Articles tagged linux at null program

Frankenwine: Multiple personas in a Wine process

2026-01-19T21:51:38Z

I came across a recent article on making Linux system calls from a Wine process. Windows programs running under Wine are still normal Linux processes and may interact with the Linux kernel like any other process. None of this was surprising, and the demonstration works just as I expect. Still, it got the wheels spinning and I realized an almost practical application: build my pkg-config implementation such that on Windows pkg-config.exe behaves as a native pkg-config, but when run under Wine this same binary takes the persona of a Linux program and becomes a cross toolchain pkg-config, bypassing Win32 and talking directly with the Linux kernel. Cosmopolitcan Libc cleverly does this out-of-the-box, but in this article we’ll mash together a couple existing sources with a bit of glue.

The results are in the merge-demo branch of u-config, and took hardly any work:

$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)

A platform layer, main_wine.c, is a merge of two existing platform layers, one of which required unavoidable tweaks. We’ll get to those details in a moment. First we’ll need to detect if we’re running under Wine, and the best solution I found was to locate ntdll!wine_get_version. If this function exists, we’re in Wine. That works out to a pretty one-liner because ntdll.dll is already loaded:

bool running_on_wine()
{
    return GetProcAddress(GetModuleHandleA("ntdll"), "wine_get_version");
}

An x86-64 Linux syscall wrapper with thorough inline assembly:

ptrdiff_t syscall3(int n, ptrdiff_t a, ptrdiff_t b, ptrdiff_t c)
{
    ptrdiff_t r;
    asm volatile (
        "syscall"
        : "=a"(r)
        : "a"(n), "D"(a), "S"(b), "d"(c)
        : "rcx", "r11", "memory"
    );
    return r;
}

ptrdiff_t write(int fd, void *buf, ptrdiff_t len)
{
    return syscall3(SYS_write, fd, (ptrdiff_t)buf, len);
}

I’d normally use long for all these integers because Linux is LP64 (long is pointer-sized), but Windows is LLP64 (only long long is 64 bits). It’s so bizarre to interface with Linux from LLP64, and this will have consequences later. With these pieces we can see the basic shape of a split personality program:

    if (running_on_wine()) {
        write(1, "hello, wine\n", 12);
    } else {
        HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
        WriteFile(h, "hello, windows\n", 15, 0, 0);
    }

We can cram two programs into this binary and select which program at run time depending on what we see. In typical programs locating and calling into glibc would be a challenge, particularly with the incompatible ABIs involved. We’re avoiding it here by interfacing directly with the kernel.

Application to u-config

Luckily u-config has completely-optional platform layers implemented with Linux system calls. The POSIX platform layer works fine, and that’s what distributions should generally use, but these bonus platforms are unhosted and do not require libc. That means we can shove it into a Windows build with relatively little trouble.

Before we do that, let’s think about what we’re doing. Debian has great cross toolchain support, including Mingw-w64. There are even a few Windows libraries in the Debian package repository, such as zlib, and we can build Windows programs against them. If you’re cross-building and using pkg-config, you ought to use the cross toolchain pkg-config, which in GNU ecosystems gets an architecture prefix like the other cross tools. Debian cross toolchains each include a cross pkg-config, and it sometimes almost works correctly! Here’s what I get on Debian 13:

$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz

Note the architecture in the -I and -L options. It really is querying the cross sysroot. Though these paths are in the cross sysroot, and so should not be listed by pkg-config. It’s unoptimal and indicates this pkg-config is probably misconfigured. In other cases it’s far from correct:

$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...

A tool prefixed x86_64-w64-mingw32- should not produce paths containing x86_64-linux-gnu (the host architecture in this case). Our version won’t have these issues.

The u-config platform interface is five functions:

filemap os_mapfile(os *, arena *, s8 path);  // read whole files
s8node *os_listing(os *, arena *, s8 path);  // list directories
void    os_write(os *, i32 fd, s8);          // standard out/err
void    os_fail(os *);                       // non-zero exit

void uconfig(config *);

Platforms implement the first four functions, and call uconfig() with the platform’s configuration, context pointer (os *), command line arguments, environment, and some memory (all in the config object). My strategy is to link two platforms into the binary, and the first challenge is they both define os_write, etc. I did not plan nor intend for one binary to contain more than one platform layer. Unity builds offer a fix without changing a single line of code:

#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include "main_windows.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include "main_linux_amd64.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

This dirty, but effective trick may look familiar. It also doesn’t interfere with the other builds. Next I define the real platform functions as a dispatch based on our run-time situation:

b32 wine_detected;

filemap os_mapfile(os *ctx, arena *a, s8 path)
{
    if (wine_detected) {
        return linux_mapfile(ctx, a, path);
    } else {
        return win32_mapfile(ctx, a, path);
    }
}

If I were serious about keeping this experiment, I’d lift os as I did the functions (as win32_os, linux_os) and include wine_detected in the context, eliminating this global variable. That cannot be done with simple hacks and macros.

The next challenge is that I wrote the Linux platform layer assuming LP64, and so it uses long instead of an equivalent platform-agnostic type like ptrdiff_t. I never thought this would be an issue because this source literally contains asm blocks and no conditional compilation, yet here we are. Lesson learned. I wanted to try an extremely janky #define on long to fix it, but this source file has a couple long long that won’t play along. These multi-token type names of C are antithetical to its preprocessor! So I adjusted the source manually instead.

The Windows and Linux platform entry points are completely different, both in name and form, and so co-exist naturally. The merged platform layer is a new entry point that will pass control to the appropriate entry point:

void entrypoint(ptrdiff_t *stack);  // Linux
void __stdcall mainCRTStartup();    // Windows

On Linux stack is the initial value of the stack pointer, which points to argc, argv, envp, and auxv. We’ll need construct an artificial “stack” for the Linux platform layer to harvest. On Windows this is the process entry point, and it will find the rest on its own as a normal Windows process. Ultimately this ended up simpler than I expected:

void __stdcall merge_entrypoint()
{
    wine_detected = running_on_wine();
    if (wine_detected) {
        u8 *fakestack[CMDLINE_ARGV_MAX+1];
        c16 *cmd = GetCommandLineW();
        fakestack[0] = (u8 *)(iz)cmdline_to_argv8(cmd, fakestack+1);
        // TODO: append envp to the fake stack
        entrypoint((iz *)fakestack);
    } else {
        mainCRTStartup();
    }
}

Where cmdline_to_argv8 is my Windows argument parser, already used by u-config, and I reserve one element at the front to store argc. Since this is just a proof-of-concept I didn’t bother fabricating and pushing envp onto the fake stack. The Linux entry point doesn’t need auxv and can be omitted. Once in the Linux entry point it’s essentially a Linux process from then on, except the x64 calling convention still in use internally.

Finally, I configure the Linux platform layer for Debian’s cross sysroot:

#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "/usr/x86_64-w64-mingw32/lib"

And that’s it! We have our platform merge. Build (w64devkit):

$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c

On Debian use x86_64-w64-mingw32-gcc for cc. The -e linker option selects the new, higher level entry point. After installing Wine binfmt, here’s how it looks on Debian:

$ ./pkg-config.exe --cflags --libs zlib
-lz

That’s the correct output, but is it using the cross sysroot? Ask it to include the -I argument despite it being in the cross sysroot:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz

Looking good! It passes the pc_path test, too:

$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig

Running this same binary on Windows after installing zlib in w64devkit:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz

Also:

$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig

My Frankenwine is a success!

Practical libc-free threading on Linux

2023-03-23T05:32:41Z

Suppose you’re not using a C runtime on Linux, and instead you’re programming against its system call API. It’s long-term and stable after all. Memory management and buffered I/O are easily solved, but a lot of software benefits from concurrency. It would be nice to also have thread spawning capability. This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls. It only takes about a dozen lines of C, including a few inline assembly instructions.

The catch is that there’s no way to avoid using a bit of assembly. Neither the clone nor clone3 system calls have threading semantics compatible with C, so you’ll need to paper over it with a bit of inline assembly per architecture. This article will focus on x86-64, but the basic concept should work on all architectures supported by Linux. The glibc clone(2) wrapper fits a C-compatible interface on top of the raw system call, but we won’t be using it here.

Before diving in, the complete, working demo: stack_head.c

The clone system call

On Linux, threads are spawned using the clone system call with semantics like the classic unix fork(2). One process goes in, two processes come out in nearly the same state. For threads, those processes share almost everything and differ only by two registers: the return value — zero in the new thread — and stack pointer. Unlike typical thread spawning APIs, the application does not supply an entry point. It only provides a stack for the new thread. The simple form of the raw clone API looks something like this:

long clone(long flags, void *stack);

Sounds kind of elegant, but it has an annoying problem: The new thread begins life in the middle of a function without any established stack frame. Its stack is a blank slate. It’s not ready to do anything except jump to a function prologue that will set up a stack frame. So besides the assembly for the system call itself, it also needs more assembly to get the thread into a C-compatible state. In other words, a generic system call wrapper cannot reliably spawn threads.

void brokenclone(void (*threadentry)(void *), void *arg)
{
    // ...
    long r = syscall(SYS_clone, flags, stack);
    // DANGER: new thread may access non-existant stack frame here
    if (!r) {
        threadentry(arg);
    }
}

For odd historical reasons, each architecture’s clone has a slightly different interface. The newer clone3 unifies these differences, but it suffers from the same thread spawning issue above, so it’s not helpful here.

The stack “header”

I figured out a neat trick eight years ago which I continue to use today. The parent and child threads are in nearly identical states when the new thread starts, but the immediate goal is to diverge. As noted, one difference is their stack pointers. To diverge their execution, we could make their execution depend on the stack. An obvious choice is to push different return pointers on their stacks, then let the ret instruction do the work.

Carefully preparing the new stack ahead of time is the key to everything, and there’s a straightforward technique that I like call the stack_head, a structure placed at the high end of the new stack. Its first element must be the entry point pointer, and this entry point will receive a pointer to its own stack_head.

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    // ...
};

The structure must have 16-byte alignment on all architectures. I used an attribute to help keep this straight, and it can help when using sizeof to place the structure, as I’ll demonstrate later.

Now for the cool part: The ... can be anything you want! Use that area to seed the new stack with whatever thread-local data is necessary. It’s a neat feature you don’t get from standard thread spawning interfaces. If I plan to “join” a thread later — wait until it’s done with its work — I’ll put a join futex in this space:

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    int join_futex;
    // ...
};

More details on that futex shortly.

The clone wrapper

I call the clone wrapper newthread. It has the inline assembly for the system call, and since it includes a ret to diverge the threads, it’s a “naked” function just like with setjmp. The compiler will generate no prologue or epilogue, and the function body is limited to inline assembly without input/output operands. It cannot even reliably reference its parameters by name. Like clone, it doesn’t accept a thread entry point. Instead it accepts a stack_head seeded with the entry point. The whole wrapper is just six instructions:

__attribute((naked))
static long newthread(struct stack_head *stack)
{
    __asm volatile (
        "mov  %%rdi, %%rsi\n"     // arg2 = stack
        "mov  $0x50f00, %%edi\n"  // arg1 = clone flags
        "mov  $56, %%eax\n"       // SYS_clone
        "syscall\n"
        "mov  %%rsp, %%rdi\n"     // entry point argument
        "ret\n"
        : : : "rax", "rcx", "rsi", "rdi", "r11", "memory"
    );
}

On x86-64, both function calls and system calls use rdi and rsi for their first two parameters. Per the reference clone(2) prototype above: the first system call argument is flags and the second argument is the new stack, which will point directly at the stack_head. However, the stack pointer arrives in rdi. So I copy stack into the second argument register, rsi, then load the flags (0x50f00) into the first argument register, rdi. The system call number goes in rax.

Where does that 0x50f00 come from? That’s the bare minimum thread spawn flag set in hexadecimal. If any flag is missing then threads will not spawn reliably — as discovered the hard way by trial and error across different system configurations, not from documentation. It’s computed normally like so:

    long flags = 0;
    flags |= CLONE_FILES;
    flags |= CLONE_FS;
    flags |= CLONE_SIGHAND;
    flags |= CLONE_SYSVSEM;
    flags |= CLONE_THREAD;
    flags |= CLONE_VM;

When the system call returns, it copies the stack pointer into rdi, the first argument for the entry point. In the new thread the stack pointer will be the same value as stack, of course. In the old thread this is a harmless no-op because rdi is a volatile register in this ABI. Finally, ret pops the address at the top of the stack and jumps. In the old thread this returns to the caller with the system call result, either an error (negative errno) or the new thread ID. In the new thread it pops the first element of stack_head which, of course, is the entry point. That’s why it must be first!

The thread has nowhere to return from the entry point, so when it’s done it must either block indefinitely or use the exit (not exit_group) system call to terminate itself.

Caller point of view

The caller side looks something like this:

static void threadentry(struct stack_head *stack)
{
    // ... do work ...
    __atomic_store_n(&stack->join_futex, 1, __ATOMIC_SEQ_CST);
    futex_wake(&stack->join_futex);
    exit(0);
}

__attribute((force_align_arg_pointer))
void _start(void)
{
    struct stack_head *stack = newstack(1<<16);
    stack->entry = threadentry;
    // ... assign other thread data ...
    stack->join_futex = 0;
    newthread(stack);

    // ... do work ...

    futex_wait(&stack->join_futex, 0);
    exit_group(0);
}

Despite the minimalist, 6-instruction clone wrapper, this is taking the shape of a conventional threading API. It would only take a bit more to hide the futex, too. Speaking of which, what’s going on there? The same principal as a WaitGroup. The futex, an integer, is zero-initialized, indicating the thread is running (“not done”). The joiner tells the kernel to wait until the integer is non-zero, which it may already be since I don’t bother to check first. When the child thread is done, it atomically sets the futex to non-zero and wakes all waiters, which might be nobody.

Caveat: It’s not safe to free/reuse the stack after a successful join. It only indicates the thread is done with its work, not that it exited. You’d need to wait for its SIGCHLD (or use CLONE_CHILD_CLEARTID). If this sounds like a problem, consider your context more carefully: Why do you feel the need to free the stack? It will be freed when the process exits. Worried about leaking stacks? Why are you starting and exiting an unbounded number of threads? In the worst case park the thread in a thread pool until you need it again. Only worry about this sort of thing if you’re building a general purpose threading API like pthreads. I know it’s tempting, but avoid doing that unless you absolutely must.

What’s with the force_align_arg_pointer? Linux doesn’t align the stack for the process entry point like a System V ABI function call. Processes begin life with an unaligned stack. This attribute tells GCC to fix up the stack alignment in the entry point prologue, just like on Windows. If you want to access argc, argv, and envp you’ll need more assembly. (I wish doing really basic things without libc on Linux didn’t require so much assembly.)

__asm (
    ".global _start\n"
    "_start:\n"
    "   movl  (%rsp), %edi\n"
    "   lea   8(%rsp), %rsi\n"
    "   lea   8(%rsi,%rdi,8), %rdx\n"
    "   call  main\n"
    "   movl  %eax, %edi\n"
    "   movl  $60, %eax\n"
    "   syscall\n"
);

int main(int argc, char **argv, char **envp)
{
    // ...
}

Getting back to the example usage, it has some regular-looking system call wrappers. Where do those come from? Start with this 6-argument generic system call wrapper.

long syscall6(long n, long a, long b, long c, long d, long e, long f)
{
    register long ret;
    register long r10 asm("r10") = d;
    register long r8  asm("r8")  = e;
    register long r9  asm("r9")  = f;
    __asm volatile (
        "syscall"
        : "=a"(ret)
        : "a"(n), "D"(a), "S"(b), "d"(c), "r"(r10), "r"(r8), "r"(r9)
        : "rcx", "r11", "memory"
    );
    return ret;
}

I could define syscall5, syscall4, etc. but instead I’ll just wrap it in macros. The former would be more efficient since the latter wastes instructions zeroing registers for no reason, but for now I’m focused on compacting the implementation source.

#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))

Now we can have some exits:

__attribute((noreturn))
static void exit(int status)
{
    SYSCALL1(SYS_exit, status);
    __builtin_unreachable();
}

__attribute((noreturn))
static void exit_group(int status)
{
    SYSCALL1(SYS_exit_group, status);
    __builtin_unreachable();
}

Simplified futex wrappers:

static void futex_wait(int *futex, int expect)
{
    SYSCALL4(SYS_futex, futex, FUTEX_WAIT, expect, 0);
}

static void futex_wake(int *futex)
{
    SYSCALL3(SYS_futex, futex, FUTEX_WAKE, 0x7fffffff);
}

And so on.

Finally I can talk about that newstack function. It’s just a wrapper around an anonymous memory map allocating pages from the kernel. I’ve hardcoded the constants for the standard mmap allocation since they’re nothing special or unusual. The return value check is a little tricky since a large portion of the negative range is valid, so I only want to check for a small range of negative errnos. (Allocating a arena looks basically the same.)

static struct stack_head *newstack(long size)
{
    unsigned long p = SYSCALL6(SYS_mmap, 0, size, 3, 0x22, -1, 0);
    if (p > -4096UL) {
        return 0;
    }
    long count = size / sizeof(struct stack_head);
    return (struct stack_head *)p + count - 1;
}

The aligned attribute comes into play here: I treat the result like an array of stack_head and return the last element. The attribute ensures each individual elements is aligned.

That’s it! There’s not much to it other than a few thoughtful assembly instructions. It took doing this a few times in a few different programs before I noticed how simple it can be.

How to build a WaitGroup from a 32-bit integer

2022-10-05T03:19:07Z

Go has a nifty synchronization utility called a WaitGroup, on which one or more goroutines can wait for concurrent task completion. In other languages, the usual task completion convention is joining threads doing the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup replaces joins. Building a WaitGroup using typical, portable primitives is a messy affair involving constructors and destructors, managing lifetimes. However, on at least Linux and Windows, we can build a WaitGroup out of a zero-initialized integer, much like my 32-bit queue and 32-bit barrier.

In case you’re not familiar with it, a typical WaitGroup use case in Go:

var wg sync.WaitGroup
for _, task := range tasks {
    wg.Add(1)
    go func(t Task) {
        // ... do task ...
        wg.Done()
    }(task)
}
wg.Wait()

I zero-initialize the WaitGroup, the main goroutine increments the counter before starting each task goroutine, each goroutine decrements the counter when done, and the main goroutine waits until the counter reaches zero. My goal is to build the same mechanism in C:

void workfunc(task t, int *wg)
{
    // ... do task ...
    waitgroup_done(wg);
}

int main(void)
{
    // ...
    int wg = 0;
    for (int i = 0; i < ntasks; i++) {
        waitgroup_add(&wg, 1);
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);
    // ...
}

When it’s done, the WaitGroup is back to zero, and no cleanup is required.

I’m going to take it a little further than that: Since its meaning and contents are explicit, you may initialize a WaitGroup to any non-negative task count! In other words, waitgroup_add is optional if the total number of tasks is known up front.

    int wg = ntasks;
    for (int i = 0; i < ntasks; i++) {
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);

A sneak peek at the full source: waitgroup.c

The four elements (of synchronization)

To build this WaitGroup, we’re going to need four primitives from the host platform, each operating on an int. The first two are atomic operations, and the second two interact with the system scheduler. To port the WaitGroup to a platform you need only implement these four functions, typically as one-liners.

static int  load(int *);           // atomic load
static int  addfetch(int *, int);  // atomic add-then-fetch
static void wait(int *, int);      // wait on change at address
static void wake(int *);           // wake all waiters by address

The first two should be self-explanatory. The wait function waits for the pointed-at integer to change its value, and the second argument is its expected current value. The scheduler will double-check the integer before putting the thread to sleep in case it changes at the last moment — in other words, an atomic check-then-maybe-sleep. The wake function is the other half. After changing the integer, a thread uses it to wake all threads waiting for the pointed-at integer to change. Together, this mechanism is known as a futex.

I’m going to simplify the WaitGroup semantics a bit in order to make my implementation even simpler. Go’s WaitGroup allows adding negatives, and the Add method essentially does double-duty. My version forbids adding negatives. That means the “add” operation is just an atomic increment:

void waitgroup_add(int *wg, int delta)
{
    addfetch(wg, delta);
}

Since it cannot bring the counter to zero, there’s nothing else to do. The “done” operation can decrement to zero:

void waitgroup_done(int *wg)
{
    if (!addfetch(wg, -1)) {
        wake(wg);
    }
}

If the atomic decrement brought the count to zero, we finished the last task, so we need to wake the waiters. We don’t know if anyone is actually waiting, but that’s fine. Some futex use cases will avoid making the relatively expensive system call if nobody’s waiting — i.e. don’t waste time on a system call for each unlock of an uncontended mutex — but in the typical WaitGroup case we expect a waiter when the count finally goes to zero. That’s the common case.

The most complicated of the three is waiting:

void waitgroup_wait(int *wg)
{
    for (;;) {
        int c = load(wg);
        if (!c) {
            break;
        }
        wait(wg, c);
    }
}

First check if the count is already zero and return if it is. Otherwise use the futex to wait for it to change. Unfortunately that’s not exactly the semantics we want, which would be to wait for a certain target. This doesn’t break the wait, but it’s a potential source of inefficiency. If a thread finishes a task between our load and wait, we don’t go to sleep, and instead try again. However, in practice, I ran thousands of threads through this thing concurrently and I couldn’t observe such a “miss.” As far as I can tell, it’s so rare it doesn’t matter.

If this was a concern, the WaitGroup could instead be a pair of integers: the counter and a “latch” that is either 0 or 1. Waiters wait on the latch, and the latch is modified (atomically) when the counter transitions to or from zero. That gives waiters a stable value on which to wait, proxying the counter. However, since this doesn’t seem to matter in practice, I prefer the elegance and simplicity of the single-integer WaitGroup.

Four elements: Linux

With the WaitGroup done at a high level, we now need the per-platform parts. Both GCC and Clang support GNU-style atomics, so I’ll just assume these are available on Linux without worrying about the compiler. The first two functions wrap these built-ins:

static int load(int *p)
{
    return __atomic_load_n(p, __ATOMIC_SEQ_CST);
}

static int addfetch(int *p, int addend)
{
    return __atomic_add_fetch(p, addend, __ATOMIC_SEQ_CST);
}

For wait and wake we need the futex(2) system call. In an attempt to discourage its direct use, glibc doesn’t wrap this system call in a function, so we must make the system call ourselves.

static void wait(int *p, int current)
{
    syscall(SYS_futex, p, FUTEX_WAIT, current, 0, 0, 0);
}

static void wake(int *p)
{
    syscall(SYS_futex, p, FUTEX_WAKE, INT_MAX, 0, 0, 0);
}

The INT_MAX means “wake as many as possible.” The other common value is 1 for waking a single waiter. Also, these system calls can’t meaningfully fail, so there’s no need to check the return value. If wait wakes up early (e.g. EINTR), it’s going to check the counter again anyway. In fact, if your kernel is more than 20 years old, predating futexes, and returns ENOSYS (“Function not implemented”), it will still work correctly, though it will be incredibly inefficient.

Four elements: Windows

Windows didn’t support futexes until Windows 8 in 2012, and were still supporting Windows without it into 2020, so they’re still relatively “new” for this platform. Nonetheless, they’re now mature enough that we can count on them being available.

I’d like to support both GCC-ish (via Mingw-w64) and MSVC-ish compilers. Mingw-w64 provides a compatible intrin.h, so I can stick to MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t define atomics for int (or even int32_t), strictly long, so I have to sneak in a little cast. (Recall: sizeof(long) == sizeof(int) on every version of Windows supporting futexes.) The other option is to typedef the WaitGroup so that it’s int on Linux (for the futex) and long on Windows (for atomics).

static int load(int *p)
{
    return _InterlockedOr((long *)p, 0);
}

static int addfetch(int *p, int addend)
{
    return addend + _InterlockedExchangeAdd((long *)p, addend);
}

The official, sanctioned futex functions are WaitOnAddress and WakeByAddressAll. They used to be in kernel32.dll, but as of this writing they live in API-MS-Win-Core-Synch-l1-2-0.dll, linked via -lsynchronization. Gross. Since I can’t stomach this, I instead call the low-level RTL functions where it’s actually implemented: RtlWaitOnAddress and RtlWakeAddressAll. These live in the nice neighborhood of ntdll.dll. They’re undocumented as far as I can tell, but thankfully Wine comes to the rescue, providing both documentation and several different implementations. Reading through it is educational, and hints at ways to construct futexes on systems lacking them.

These functions aren’t declared in any headers, so I have to do it myself. On the plus side, so far I haven’t paid the substantial compile-time costs of including windows.h, and so I can continue avoiding it. These functions are listed in the ntdll.dll import library, so I don’t need to invent the import library entries.

__declspec(dllimport)
long __stdcall RtlWaitOnAddress(void *, void *, size_t, void *);
__declspec(dllimport)
long __stdcall RtlWakeAddressAll(void *);

Rather conveniently, the semantics perfectly line up with Linux futexes!

static void wait(int *p, int current)
{
    RtlWaitOnAddress(p, &current, sizeof(*p), 0);
}

static void wake(int *p)
{
    RtlWakeAddressAll(p);
}

Like with Linux, there’s no meaningful failure, so the return values don’t matter.

That’s the whole implementation. Considering just a single platform, a flexible, lightweight, and easy-to-use synchronization facility in ~50 lines of relatively simple code is a pretty good deal if you ask me!

Illuminating synchronization edges for ThreadSanitizer

2022-10-03T03:09:38Z

Sanitizers are powerful development tools which complement debuggers and fuzzing. I typically have at least one sanitizer active during development. They’re particularly useful during code review, where they can identify issues before I’ve even begun examining the code carefully — sometimes in mere minutes under fuzzing. Accordingly, it’s a good idea to have your own code in good agreement with sanitizers before review. For ThreadSanitizer (TSan), that means dealing with false positives in programs relying on synchronization invisible to TSan.

This article’s motivation is multi-threaded epoll. I mitigate TSan false positives each time it comes up, enough to have gotten the hang of it, so I ought to document it. On Windows I would also run into the same issue with the Win32 message queue, crossing the synchronization edge between PostMessage (release) and GetMessage (acquire), except for the general lack of TSan support in Windows tooling. The same technique would work there as well.

My typical epoll scenario looks like so:

Create an epoll file descriptor (epoll_create1).
Create worker threads, passing the epoll file descriptor.
Worker threads loop on epoll_wait.
Main thread loops on accept, adding sockets to epoll (epoll_ctl).

Between accept and EPOLL_CTL_ADD, the main thread allocates and initializes the client session state, then attaches it to the epoll event. The client socket is added with the EPOLLONESHOT flag, and the session state is not touched after the call to epoll_ctl (note: sans error checks):

for (;;) {
    int fd = accept(...);
    struct session *session = ...;
    session->fd = fd;
    // ...
    struct epoll_event;
    event.events = EPOLLET | EPOLLONESHOT | ...;
    event.events.data.ptr = session;
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
}

In this example, struct session is defined by the application to contain all the state for handling a session (file descriptor, buffers, state machine, parser state, allocation arena, etc.). Everything else is part of the epoll interface.

When a socket is ready, one of the worker threads receive it. Due to EPOLLONESHOT, it’s immediately disabled and no other thread can receive it. The thread does as much work as possible (i.e. read/write until EAGAIN), then reactivates it with epoll_ctl:

for (;;) {
    struct epoll_event event;
    epoll_wait(epfd, &event, 1, -1);
    struct session *session = event.data.ptr;
    int fd = session->fd;
    // ...
    event.events = EPOLLET | EPOLLONESHOT | ...;
    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
}

The shared variables in session are passed between threads through epoll using the event’s .user.ptr. These variables are potentially read and mutated by every thread, but it’s all perfectly safe without any further synchronization — i.e. no need for mutexes, etc. All the necessary synchronization is implicit in epoll.

In the initial hand-off, that EPOLL_CTL_ADD must happen before the corresponding epoll_wait in a worker thread. This establishes that the main thread and worker thread do not touch session variables concurrently. After all, how could the worker see an event on the file descriptor before it’s been added to epoll? The synchronization in epoll itself will also ensure all the architecture-level stores are visible to other threads before the hand-off. We can call the “add” a release and the “wait” an acquire, forming a synchronization edge.

Similarly, in the hand-off between worker threads, the EPOLL_CTL_MOD that reactivates the file descriptor must happen before the wait that observes the next event because, until reactivation, it’s disabled. The EPOLL_CTL_MOD is another release in relation to the acquire wait.

Unfortunately TSan won’t see things this way. It can’t see into the kernel, and it doesn’t know these subtle epoll semantics, so it can’t see these synchronization edges. As far as it can tell, threads might be accessing a session concurrently, and TSan will reliably produce warnings about it. You could shrug your shoulders and give up on using TSan in this case, but there’s an easy solution: introduce redundant, semantically identical synchronization edges, but only when TSan is looking.

WARNING: ThreadSanitizer: data race

Redundant synchronization

I prefer to solve this by introducing the weakest possible synchronization so that I’m not synchronizing beyond epoll’s semantics. This will help TSan catch real mistakes that stronger synchronization might hide.

The weakest option is memory fences. These wouldn’t introduce extra loads or stores. At most it would be a fence instruction. I would use GCC’s built-in __atomic_thread_fence for the job. However, TSan does not currently understand thread fences, so that defeats the purpose. Instead, I introduce a new field to struct session:

struct session {
    int fd;
    // ...
    int _sync;
};

Then just before epoll_ctl I’ll do a release store on this field, “releasing” the session. All session stores are ordered before the release.

    // main thread
    // ...
    __atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);

    // worker thread
    // ...
    __atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);

After epoll_wait I add an acquire load, “acquiring” the session. All session loads are ordered after the acquire.

    epoll_wait(epfd, &event, 1, -1);
    struct session *session = event.data.ptr;
    __atomic_load_n(&session->_sync, __ATOMIC_ACQUIRE)
    int fd = session->fd;
    // ...

For this to work, the thread must not touch session variables in any way before the acquire or after the release. For example, note how I obtained the client file descriptor before the release, i.e. no session->fd argument in the epoll_ctl call.

That’s it! This redundantly establishes the happens before relationship already implicit in epoll, but now it’s visible to TSan. However, I don’t want to pay for this unless I’m actually running under TSan, so some macros are in order. __SANITIZE_THREAD__ is automatically defined when running under TSan:

#if __SANITIZE_THREAD__
# define TSAN_SYNCED     int _sync
# define TSAN_ACQUIRE(s) __atomic_load_n(&(s)->_sync, __ATOMIC_ACQUIRE)
# define TSAN_RELEASE(s) __atomic_store_n(&(s)->_sync, 0, __ATOMIC_RELEASE)
#else
# define TSAN_SYNCED
# define TSAN_ACQUIRE(s)
# define TSAN_RELEASE(s)
#endif

This also makes it more readable, and intentions clearer:

struct session {
    int fd;
    // ...
    TSAN_SYNCED;
};

    // main thread
    for (;;) {
        // ...
        TSAN_RELEASE(session);
        epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
    }

    // worker thread
    for (;;) {
        epoll_wait(epfd, &event, 1, -1);
        struct session *session = event.data.ptr;
        TSAN_ACQUIRE(session);
        int fd = session->fd;
        // ...
        TSAN_RELEASE(session);
        epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
    }

Now I can use TSan again, and it didn’t cost anything in normal builds.

My new debugbreak command

2022-07-31T12:59:59Z

I previously mentioned the Windows feature where pressing F12 in a debuggee window causes it to break in the debugger. It works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the hotkey simply raises a breakpoint structured exception. It’s been surprisingly useful, and I’ve wanted it available in more contexts, such as console programs or even on Linux. The result is a new debugbreak command, now included in w64devkit. Though, of course, you already have everything you need to build it and try it out right now. I’ve also worked out a Linux implementation.

It’s named after an MSVC intrinsic and Win32 function. It takes no arguments, and its operation is indiscriminate: It raises a breakpoint exception in all debuggee processes system-wide. Reckless? Perhaps, but certainly convenient. You don’t need to tell it which process you want to pause. It just works, and a good debugging experience is one of ease and convenience.

The linchpin is DebugBreakProcess. The command walks the process list and fires this function at each process. Nothing happens for programs without a debugger attached, so it doesn’t even bother checking if it’s a debuggee. It couldn’t be simpler. I’ve used it on everything from Windows XP to Windows 11, and it’s worked flawlessly.

HANDLE s = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
PROCESSENTRY32W p = {sizeof(p)};
for (BOOL r = Process32FirstW(s, &p); r; r = Process32NextW(s, &p)) {
    HANDLE h = OpenProcess(PROCESS_ALL_ACCESS, 0, p.th32ProcessID);
    if (h) {
        DebugBreakProcess(h);
        CloseHandle(h);
    }
}

I use it almost exclusively from Vim, where I’ve given it a leader mapping. With the editor focused, I can type backslash then d to pause the debuggee.

map <leader>d :call system("debugbreak")<cr>

With the debuggee paused, I’m free to add new breakpoints or watchpoints, or print the call stack to see what the heck it’s busy doing. The mechanism behind DebugBreakProcess is to create a new thread in the target, with that thread raising the breakpoint exception. The debugger will be stopped in this new thread. In GDB you can use the thread command to switch over to the thread that actually matters, usually thr 1.

debugbreak on Linux

On unix-like systems the equivalent of a breakpoint exception is a SIGTRAP. There’s already a standard command for sending signals, kill, so a debugbreak command can be built using nothing more than a few lines of shell script. However, unlike DebugBreakProcess, signaling every process with SIGTRAP will only end in tears. The script will need a way to determine which processes are debuggees.

Linux exposes processes in the file system as virtual files under /proc, where each process appears as a directory. Its status file includes a TracerPid field, which will be non-zero for debuggees. The script inspects this field, and if non-zero sends a SIGTRAP.

#!/bin/sh
set -e
for pid in $(find /proc -maxdepth 1 -printf '%f\n' | grep '^[0-9]\+$'); do
    grep -q '^TracerPid:\s[^0]' /proc/$pid/status 2>/dev/null &&
        kill -TRAP $pid
done

This script, now part of my dotfiles, has worked very well so far, and effectively smoothes over some debugging differences between Windows and Linux, reducing my context switching mental load. There’s probably a better way to express this script, but that’s the best I could do so far. On the BSDs you’d need to parse the output of ps, though each system seems to do its own thing for distinguishing debuggees.

A missing feature

I had originally planned for one flag, -k. Rather than breakpoint debugees, it would terminate all debuggee processes. This is especially important on Windows where debuggee processes block builds due to file locking shenanigans. I’d just run debugbreak -k as part of the build. However, it’s not possible to terminate debuggees paused in the debugger — the common situation. I’ve given up on this for now.

How to build and use DLLs on Windows

2021-05-31T02:13:40Z

I’ve recently been involved with a couple of discussions about Windows’ dynamic linking. One was Joe Nelson in considering how to make libderp accessible on Windows, and the other was about w64devkit, my Mingw-w64 distribution. I use these techniques so infrequently that I need to figure it all out again each time I need it. Unfortunately there’s a whole lot of outdated and incorrect information online which gets in the way every time this happens. While it’s all fresh in my head, I will now document what I know works.

In this article, all commands and examples are being run in the context of w64devkit (1.8.0).

Mingw-w64

If all you care about is the GNU toolchain then DLLs are straightforward, working mostly like shared objects on other platforms. To illustrate, let’s build a “square” library with one “exported” function, square, that returns the square of its input (square.c):

long square(long x)
{
    return x * x;
}

The header file (square.h):

#ifndef SQUARE_H
#define SQUARE_H

long square(long);

#endif

To build a stripped, size-optimized DLL, square.dll:

$ cc -shared -Os -s -o square.dll square.c

Now a test program to link against it (main.c), which “imports” square from square.dll:

#include 
#include "square.h"

int main(void)
{
    printf("%ld\n", square(2));
}

Linking and testing it:

$ cc -Os -s main.c square.dll
$ ./a
4

It’s that simple. Or more traditionally, using the -l flag:

$ cc -Os -s -L. main.c -lsquare

Given -lxyz GCC will look for xyz.dll in the library path.

Viewing exported symbols

Given a DLL, printing a list of the exported functions of a DLL is not so straightforward. For ELF shared objects there’s nm -D, but despite what the internet will tell you, this tool does not support DLLs. objdump will print the exports as part of the “private” headers (-p). A bit of awk can cut this down to just a list of exports. Since we’ll need this a few times, here’s a script, exports.sh, that composes objdump and awk into the tool I want:

#!/bin/sh
set -e
printf 'LIBRARY %s\nEXPORTS\n' "$1"
objdump -p "$1" | awk '/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'

Running this on square.dll above:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square

This can be helpful when debugging. It also works outside of Windows, such as on Linux. By the way, the output format is no accident: This is the .def file format (also), which will be particularly useful in a moment.

Mingw-w64 has a gendef tool to produce the above output, and this tool is now included in w64devkit. To print the exports to standard output:

$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square

Alternatively Visual Studio provides dumpbin. It’s not as concise as exports.sh but it’s a lot less verbose than objdump -p.

$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...

Mingw-w64 (improved)

You can get by without knowing anything more, which is usually enough for those looking to support Windows as a secondary platform, even just as a cross-compilation target. However, with a bit more work we can do better. Imagine doing the above with a non-trivial program. GCC doesn’t know which functions are part of the API and which are not. Obviously static functions should not be exported, but what about non-static functions visible between translation units (i.e. object files)?

For instance, suppose square.c also has this function which is not part of its API but may be called by another translation unit.

void internal_func(void) {}

Now when I build:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square

On the other side, when I build main.c how does it know which functions are imported from a DLL and which will be found in another translation unit? GCC makes it work regardless, but it can generate more efficient code if it knows at compile time (vs. link time).

On Windows both are solved by adding __declspec notation on both sides. In square.c the exports are marked as dllexport:

__declspec(dllexport)
long square(long x)
{
    return x * x;
}

void internal_func(void) {}

In the header, it’s marked as an import:

__declspec(dllimport)
long square(long);

The mere presence of dllexport tells the linker to only export those functions marked as exports, and so internal_func disappears from the exports list. Convenient!

On the import side, during compilation of the original program, GCC assumed square wasn’t an import and generated a local function call. When the linker later resolved the symbol to the DLL, it generated a trampoline to fill in as that local function (like a PLT). With dllimport, GCC knows it’s an imported function and so doesn’t go through a trampoline.

While generally unnecessary for the GNU toolchain, it’s good hygiene to use __declspec. It’s also mandatory when using MSVC, in case you care about that as well.

MSVC

Mingw-w64-compiled DLLs will work with LoadLibrary out of the box, which is sufficient in many cases, such as for dynamically-loaded plugins. For example (loadlib.c):

#include 
#include 

int main(void)
{
    HANDLE h = LoadLibrary("square.dll");
    long (*square)(long) = GetProcAddress(h, "square");
    printf("%ld\n", square(2));
}

Compiled with MSVC cl (via vcvars.bat):

$ cl /nologo loadlib.c
$ ./loadlib
4

However, the MSVC linker, unlike Binutils ld, cannot link directly with DLLs. It requires an import library. Conventionally this matches the DLL name but has a .lib extension — square.lib in this case. The Mingw-w64 ecosystem conventionally uses .dll.a, as in square.dll.a, in order to distinguish it from a static library, but it’s the same format. The most convenient way to get an import library is to ask GCC to generate one at link-time via --out-implib:

$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c

Back to cl, just add square.lib as another input. You don’t actually need square.dll present at link time.

$ cl /nologo /Os main.c square.lib
$ ./main
4

What if you already have the DLL and you just need an import library? GNU Binutils’ dlltool can do this, though not without help. It cannot generate an import library from a DLL alone since it requires a .def file enumerating the exports. (Why?) What luck that we have a tool for this!

$ ./exports.sh square.dll >square.def
$ dlltool --input-def square.def --output-lib square.lib

Reversing directions

Going the other way, building a DLL with MSVC and linking it with Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it requires that all exports are tagged with dllexport. The /LD (case sensitive) is just like GCC’s -shared.

$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4

cl outputs three files: square.dll, square.lib, and square.exp. The last can be discarded, and the second will be needed if linking with MSVC, but as before, Mingw-w64 requires only the first.

This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at least for C interfaces that don’t share CRT objects.

Tying it all together

If your program is designed to be portable, those __declspec will get in the way. That can be tidied up with some macros, but even better, those macros can be used to control ELF symbol visibility so that the library has good hygiene on, say, Linux as well.

The strategy will be to mark all API functions with SQUARE_API and expand that to whatever is necessary at the time. When building a library, it will expand to dllexport, or default visibility on unix-likes. When consuming a library it will expand to dllimport, or nothing outside of Windows. The new square.h:

#ifndef SQUARE_H
#define SQUARE_H

#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif

SQUARE_API
long square(long);

#endif

The new square.c:

#define SQUARE_BUILD
#include "square.h"

SQUARE_API
long square(long x)
{
    return x * x;
}

main.c remains the same. When compiling on unix-like systems, add the -fvisibility=hidden to hide all symbols by default so that this macro can reveal them.

$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4

Makefile ideas

While Mingw-w64 hides a lot of the differences between Windows and unix-like systems, when it comes to dynamic libraries it can only do so much, especially if you care about import libraries. If I were maintaining a dynamic library — unlikely since I strongly prefer embedding or static linking — I’d probably just use different Makefiles per toolchain and target. Aside from the SQUARE_API type of macros, the source code can fortunately remain fairly agnostic about it.

Here’s what I might use as NMakefile for MSVC nmake:

CC     = cl /nologo
CFLAGS = /Os

all: main.exe square.dll square.lib

main.exe: main.c square.h square.lib
	$(CC) $(CFLAGS) main.c square.lib

square.dll: square.c square.h
	$(CC) /LD $(CFLAGS) square.c

square.lib: square.dll

clean:
	-del /f main.exe square.dll square.lib square.exp

Usage:

nmake /nologo /f NMakefile

For w64devkit and cross-compiling, Makefile.w64, which includes import library generation for the sake of MSVC consumers:

CC      = cc
CFLAGS  = -Os
LDFLAGS = -s
LDLIBS  =

all: main.exe square.dll square.lib

main.exe: main.c square.dll square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c square.dll $(LDLIBS)

square.dll: square.c square.h
	$(CC) -shared -Wl,--out-implib,$(@:dll=lib) \
	    $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

square.lib: square.dll

clean:
	rm -f main.exe square.dll square.lib

Usage:

make -f Makefile.w64

And a Makefile for everyone else:

CC      = cc
CFLAGS  = -Os -fvisibility=hidden
LDFLAGS = -s
LDLIBS  =

all: main libsquare.so

main: main.c libsquare.so square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c ./libsquare.so $(LDLIBS)

libsquare.so: square.c square.h
	$(CC) -shared $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

clean:
	rm -f main libsquare.so

Now that I have this article, I’m glad I won’t have to figure this all out again next time I need it!

Asynchronously Opening and Closing Files in Asyncio

2020-09-04T01:36:20Z

Python asyncio has support for asynchronous networking, subprocesses, and interprocess communication. However, it has nothing for asynchronous file operations — opening, reading, writing, or closing. This is likely in part because operating systems themselves also lack these facilities. If a file operation takes a long time, perhaps because the file is on a network mount, then the entire Python process will hang. It’s possible to work around this, so let’s build a utility that can asynchronously open and close files.

The usual way to work around the lack of operating system support for a particular asynchronous operation is to dedicate threads to waiting on those operations. By using a thread pool, we can even avoid the overhead of spawning threads when we need them. Plus asyncio is designed to play nicely with thread pools anyway.

Test setup

Before we get started, we’ll need some way to test that it’s working. We need a slow file system. One thought is to use ptrace to intercept the relevant system calls, though this isn’t quite so simple. The other threads need to continue running while the thread waiting on open(2) is paused, but ptrace pauses the whole process. Fortunately there’s a simpler solution anyway: LD_PRELOAD.

Setting the LD_PRELOAD environment variable to the name of a shared object will cause the loader to load this shared object ahead of everything else, allowing that shared object to override other libraries. I’m on x86-64 Linux (Debian), and so I’m looking to override open64(2) in glibc. Here’s my open64.c:

#define _GNU_SOURCE
#include 
#include 
#include 

int
open64(const char *path, int flags, int mode)
{
    if (!strncmp(path, "/tmp/", 5)) {
        sleep(3);
    }
    int (*f)(const char *, int, int) = dlsym(RTLD_NEXT, "open64");
    return f(path, flags, mode);
}

Now Python must go through my C function when it opens files. If the file resides where under /tmp/, opening the file will be delayed by 3 seconds. Since I still want to actually open a file, I use dlsym() to access the real open64() in glibc. I build it like so:

$ cc -shared -fPIC -o open64.so open64.c -ldl

And to test that it works with Python, let’s time how long it takes to open /tmp/x:

$ touch /tmp/x
$ time LD_PRELOAD=./open64.so python3 -c 'open("/tmp/x")'

real    0m3.021s
user    0m0.014s
sys     0m0.005s

Perfect! (Note: It’s a little strange putting time before setting the environment variable, but that’s because I’m using Bash and it time is special since this is the shell’s version of the command.)

Thread pools

Python’s standard open() is most commonly used as a context manager so that the file is automatically closed no matter what happens.

with open('output.txt', 'w') as out:
    print('hello world', file=out)

I’d like my asynchronous open to follow this pattern using async with. It’s like with, but the context manager is acquired and released asynchronously. I’ll call my version aopen():

async with aopen('output.txt', 'w') as out:
    ...

So aopen() will need to return an asynchronous context manager, an object with methods __aenter__ and __aexit__ that both return awaitables. Usually this is by virtue of these methods being coroutine functions, but a normal function that directly returns an awaitable also works, which is what I’ll be doing for __aenter__.

class _AsyncOpen():
    def __init__(self, args, kwargs):
        ...

    def __aenter__(self):
        ...

    async def __aexit__(self, exc_type, exc, tb):
        ...

Ultimately we have to call open(). The arguments for open() will be given to the constructor to be used later. This will make more sense when you see the definition for aopen().

    def __init__(self, args, kwargs):
        self._args = args
        self._kwargs = kwargs

When it’s time to actually open the file, Python will call __aenter__. We can’t call open() directly since that will block, so we’ll use a thread pool to wait on it. Rather than create a thread pool, we’ll use the one that comes with the current event loop. The run_in_executor() method runs a function in a thread pool — where None means use the default pool — returning an asyncio future representing the future result, in this case the opened file object.

    def __aenter__(self):
        def thread_open():
            return open(*self._args, **self._kwargs)
        loop = asyncio.get_event_loop()
        self._future = loop.run_in_executor(None, thread_open)
        return self._future

Since this __aenter__ is not a coroutine function, it returns the future directly as its awaitable result. The caller will await it.

The default thread pool is limited to one thread per core, which I suppose is the most obvious choice, though not ideal here. That’s fine for CPU-bound operations but not for I/O-bound operations. In a real program we may want to use a larger thread pool.

Closing a file may block, so we’ll do that in a thread pool as well. First pull the file object from the future, then close it in the thread pool, waiting until the file has actually closed:

    async def __aexit__(self, exc_type, exc, tb):
        file = await self._future
        def thread_close():
            file.close()
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(None, thread_close)

The open and close are paired in this context manager, but it may be concurrent with an arbitrary number of other _AsyncOpen context managers. There will be some upper limit to the number of open files, so we need to be careful not to use too many of these things concurrently, something which easily happens when using unbounded queues. Lacking back pressure, all it takes is for tasks to be opening files slightly faster than they close them.

With all the hard work done, the definition for aopen() is trivial:

def aopen(*args, **kwargs):
    return _AsyncOpen(args, kwargs)

That’s it! Let’s try it out with the LD_PRELOAD test.

A test drive

First define a “heartbeat” task that will tell us the asyncio loop is still chugging away while we wait on opening the file.

async def heartbeat():
    while True:
        await asyncio.sleep(0.5)
        print('HEARTBEAT')

Here’s a test function for aopen() that asynchronously opens a file under /tmp/ named by an integer, (synchronously) writes that integer to the file, then asynchronously closes it.

async def write(i):
    async with aopen(f'/tmp/{i}', 'w') as out:
        print(i, file=out)

The main() function creates the heartbeat task and opens 4 files concurrently though the intercepted file opening routine:

async def main():
    beat = asyncio.create_task(heartbeat())
    tasks = [asyncio.create_task(write(i)) for i in range(4)]
    await asyncio.gather(*tasks)
    beat.cancel()

asyncio.run(main())

The result:

$ LD_PRELOAD=./open64.so python3 aopen.py
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
$ cat /tmp/{1,2,3,4}
1
2
3
4

As expected, 6 heartbeats corresponding to 3 seconds that all 4 tasks spent concurrently waiting on the intercepted open(). Here’s the full source if you want to try it our for yourself:

https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd

Caveat: no asynchronous reads and writes

Only opening and closing the file is asynchronous. Read and writes are unchanged, still fully synchronous and blocking, so this is only a half solution. A full solution is not nearly as simple because asyncio is async/await. Asynchronous reads and writes would require all new APIs with different coloring. You’d need an aprint() to complement print(), and so on, each returning an awaitable to be awaited.

This is one of the unfortunate downsides of async/await. I strongly prefer conventional, preemptive concurrency, but we don’t always have that luxury.

Purgeable Memory Allocations for Linux

2019-12-29T00:25:49Z

I saw (part of) a video, OS hacking: Purgeable memory, by Andreas Kling who’s writing an operating system called Serenity and recording videos his progress. In the video he implements purgeable memory as found on some Apple platforms by adding special support in the kernel. A process tells the kernel that a particular range of memory isn’t important, and so the kernel can reclaim if it the system is under memory pressure — the memory is purgeable.

Linux has a mechanism like this, madvise(2), that allows processes to provide hints to the kernel on how memory is expected to be used. The flag of interest is MADV_FREE:

The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page.

So, given this, I built a proof of concept / toy on top of MADV_FREE that provides this functionality for Linux:

https://github.com/skeeto/purgeable

It allocates anonymous pages using mmap(2). When the allocation is “unlocked” — i.e. the process isn’t actively using it — its pages are marked with MADV_FREE so that the kernel can reclaim them at any time. To lock the allocation so that the process can safely make use of them, the MADV_FREE is canceled. This is all a little trickier than it sounds, and that’s the subject of this article.

Note: There’s also MADV_DONTNEED which seems like it would fit the bill, but it’s implemented incorrectly in Linux. It immediately frees the pages, and so it’s useless for implementing purgeable memory.

Purgeable API

Before diving into the implementation, here’s the API. It’s just four functions with no structure definitions. The pointer used by the API is the memory allocation itself. All the bookkeeping associated with that pointer is hidden away, out of sight from the API’s consumer. The full documentation is in purgeable.h.

void *purgeable_alloc(size_t);
void  purgeable_unlock(void *);
void *purgeable_lock(void *);
void  purgeable_free(void *);

The semantics are much like a C++ weak_ptr in that locking both validates that the allocation is still available and creates a “strong” reference to it that prevents it from being purged. Though unlike a weak reference, the allocation is stickier. It will remain until the system is actually under pressure, not just when the garbage collector happens to run or the last strong reference is gone.

Here’s how it might be used to, say, store decoded PNG data that can decompressed again if needed:

uint32_t *texture = 0;
struct png *png = png_load("texture.png");
if (!png) die();

/* ... */

for (;;) {
    if (!texture) {
        texture = purgeable_alloc(png->width * png->height * 4);
        if (!texture) die();
        png_decode_rgba(png, texture);
    } else if (!purgeable_lock(texture)) {
        purgeable_free(texture);
        texture = 0;
        continue;
    }
    glTexImage2D(
        GL_TEXTURE_2D, 0,
        GL_RGBA, png->width, png->height, 0,
        GL_RGBA, GL_UNSIGNED_BYTE, texture
    );
    purgeable_unlock(texture);
    break;
}

Memory is allocated in a locked state since it’s very likely to be immediately filled with data. The application should unlock it before moving on with other tasks. The purgeable memory must always be freed using purgeable_free(), even if purgeable_lock() failed. This not only frees the bookkeeping, but also releases the now-zero pages and the mapping itself. Originally I had purgeable_lock() free the purgeable memory on failure, but I felt this was clearer. There’s no technical reason it couldn’t, though.

Purgeable Implementation

The main challenge is that the kernel doesn’t necessarily treat the MADV_FREE range contiguously. It might reclaim just some pages, and do so in an arbitrary order. In order to lock the region, each page must be handled individually. Per the man page quoted above, reversing MADV_FREE requires a write to each page — to either trigger a page fault or set a dirty bit.

The only way to tell if a page has been purged is to check if it’s been filled with zeros. That’s easy if we’re sure a particular byte in the page should be zero, but, since this is a library, the caller might just store anything on these pages.

So here’s my solution: To unlock a page, look at the first byte on the page. Remember whether or not it’s zero. If it’s zero, write a 1 into that byte. Once this has been done for all pages, use madvise(2) to mark them all MADV_FREE.

With this approach, the library only needs to track one bit of information per page regardless of the page’s contents. Assuming 4kB pages, each 32kB of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead. Not too bad!

Locking purgeable memory is a little trickier. Again, each page must be visited in turn, and if any page was purged, then the whole allocation is considered lost. If the first byte was non-zero when unlocking, the library checks that it’s still non-zero. If the first byte was zero when unlocking, then it prepares to write a zero back into that byte, which must currently be non-zero.

In either case, the MADV_FREE needs to be canceled using a write, so the library does an atomic compare-and-swap (CAS) to write the correct byte into the page, even if it’s the same value in the non-zero case. The atomic CAS is essential because it ensures the page wasn’t purged between the check and the write, as both are done together, atomically. If every page has the expected first byte, and every CAS succeeded, then the purgeable memory has been successfully locked.

As an optimization, the library could consider more than just the first byte, and look at, say, the first long int on each page. The library does less work when the page contains a non-zero value, and the chance of an arbitrary 8-byte value being zero is much lower. However, I wanted to avoid potential aliasing issues, especially if this library were to be embedded, so I passed on the idea.

Bookkeeping

The bookkeeping data is stored just before the buffer returned as the purgeable memory, and it’s never marked with MADV_FREE. Assuming 4kB pages, for each 128MB of purgeable memory the library allocates one extra anonymous page to track it. The number of pages in the allocation is stored just before the purgeable memory as a size_t, and the rest is the per-page bit table described above.

size_t *p = purgeable_alloc(1<<14);
size_t numpages = p[-1];

So the library can immediately find it starting from the purgeable memory address. Here’s an illustration:

      ,--- p
      |
      v
----------------------------------------------
|...Z|    |    |    |    |    |    |    |    |
----------------------------------------------
 ^  ^
 |  |
 |  `--- size_t numpages
 |
 `--- bit table

The downside is that buffer underflows in the application would easily trample the numpages value because it’s located immediately adjacent. It would be safer to move it to the beginning of the first page before the purgeable memory, but this would have made bit table access more complicated. While the region is locked, the contents of the bit table don’t matter, so it won’t be damaged by an underflow. Another idea: put a checksum alongside numpages. It could just be a simple integer hash.

This makes for a really slick API since the consumer doesn’t need to track anything more than a single pointer, the address of the purgeable memory allocation itself.

Worth using?

I’m not quite sure how often I’d actually use purgeable memory in real programs, especially in software intended to be portable. Each operating system needs its own implementation, and this library is not portable since it relies on interfaces and behaviors specific to Linux.

It also has a not-so-unlikely pathological case: Imagine a program that makes two purgeable memory allocation, and they’re large enough that one always evicts the other. The program would thrash back and forth fighting itself as it used each allocation. Detecting this situation might be difficult, especially as the number of purgeable memory allocations increases.

Regardless, it’s another tool for my software toolbelt.

A Survey of $RANDOM

2018-12-25T00:05:38Z

Most Bourne shell clones support a special RANDOM environment variable that evaluates to a random value between 0 and 32,767 (e.g. 15 bits). Assigment to the variable seeds the generator. This variable is an extension and did not appear in the original Unix Bourne shell. Despite this, the different Bourne-like shells that implement it have converged to the same interface, but only the interface. Each implementation differs in interesting ways. In this article we’ll explore how $RANDOM is implemented in various Bourne-like shells.

~~Unfortunately I was unable to determine the origin of $RANDOM.~~ Nobody was doing a good job tracking source code changes before the mid-1990s, so that history appears to be lost. Bash was first released in 1989, but the earliest version I could find was 1.14.7, released in 1996. KornShell was first released in 1983, but the earliest source I could find was from 1993. In both cases $RANDOM already existed. My guess is that it first appeared in one of these two shells, probably KornShell.

Update: Quentin Barnes has informed me that his 1986 copy of KornShell (a.k.a. ksh86) implements $RANDOM. This predates Bash and makes it likely that this feature originated in KornShell.

Bash

Of all the shells I’m going to discuss, Bash has the most interesting history. It never made use use of srand(3) / rand(3) and instead uses its own generator — which is generally what I prefer. Prior to Bash 4.0, it used the crummy linear congruential generator (LCG) found in the C89 standard:

static unsigned long rseed = 1;

static int
brand ()
{
  rseed = rseed * 1103515245 + 12345;
  return ((unsigned int)((rseed >> 16) & 32767));
}

For some reason it was naïvely decided that $RANDOM should never produce the same value twice in a row. The caller of brand() filters the output and discards repeats before returning to the shell script. This actually reduces the quality of the generator further since it increases correlation between separate outputs.

When the shell starts up, rseed is seeded from the PID and the current time in seconds. These values are literally summed and used as the seed.

/* Note: not the literal code, but equivalent. */
rseed = getpid() + time(0);

Subshells, which fork and initally share an rseed, are given similar treatment:

rseed = rseed + getpid() + time(0);

Notice there’s no hashing or mixing of these values, so there’s no avalanche effect. That would have prevented shells that start around the same time from having related initial random sequences.

With Bash 4.0, released in 2009, the algorithm was changed to a Park–Miller multiplicative LCG from 1988:

static int
brand ()
{
  long h, l;

  /* can't seed with 0. */
  if (rseed == 0)
    rseed = 123459876;
  h = rseed / 127773;
  l = rseed % 127773;
  rseed = 16807 * l - 2836 * h;
  return ((unsigned int)(rseed & 32767));
}

There’s actually a subtle mistake in this implementation compared to the generator described in the paper. This function will generate different numbers than the paper, and it will generate different numbers on different hosts! More on that later.

This algorithm is a much better choice than the previous LCG. There were many more options available in 2009 compared to 1989, but, honestly, this generator is pretty reasonable for this application. Bash is so slow that you’re never practically going to generate enough numbers for the small state to matter. Since the Park–Miller algorithm is older than Bash, they could have used this in the first place.

I considered submitting a patch to switch to something more modern. However, given Bash’s constraints, it’s harder said than done. Portability to weird systems is still a concern, and I expect they’d reject a patch that started making use of long long in the PRNG. They still support pre-ANSI C compilers that don’t have 64-bit arithmetic.

However, what still really could be improved is seeding. In Bash 4.x here’s what it looks like:

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (tv.tv_sec ^ tv.tv_usec ^ getpid ());
}

Seeding is both better and worse. It’s better that it’s seeded from a higher resolution clock (milliseconds), so two shells started close in time have more variation. However, it’s “mixed” with XOR, which, in this case, is worse than addition.

For example, imagine two Bash shells started one millsecond apart. Both tv_usec and getpid() are incremented by one. Those increments are likely to cancel each other out by an XOR, and they end up with the same seed.

Instead, each of those quantities should be hashed before mixing. Here’s a rough example using my triple32() hash (adapted to glorious GNU-style pre-ANSI C):

static unsigned long
hash32 (x)
     unsigned long x;
{
  x ^= x >> 17;
  x *= 0xed5ad4bbUL;
  x &= 0xffffffffUL;
  x ^= x >> 11;
  x *= 0xac4c1b51UL;
  x &= 0xffffffffUL;
  x ^= x >> 15;
  x *= 0x31848babUL;
  x &= 0xffffffffUL;
  x ^= x >> 14;
  return x;
}

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (hash32 (tv.tv_sec) ^
          hash32 (hash32 (tv.tv_usec) ^ getpid ()));
}

I had said there’s there’s a mistake in the Bash implementation of Park–Miller. Take a closer look at the types and the assignment to rseed:

  /* The variables */
  long h, l;
  unsigned long rseed;

  /* The assignment */
  rseed = 16807 * l - 2836 * h;

The result of the substraction can be negative, and that negative value is converted to unsigned long. The C standard says ULONG_MAX + 1 is added to make the value positive. ULONG_MAX varies by platform — typicially long is either 32 bits or 64 bits — so the results also vary. Here’s how the paper defined it:

  long test;

  test = 16807 * l - 2836 * h;
  if (test > 0)
    rseed = test;
  else
    rseed = test + 2147483647;

As far as I can tell, this mistake doesn’t hurt the quality of the generator.

$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634

$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115

Zsh

In contrast to Bash, Zsh is the most straightforward: defer to rand(3). Its $RANDOM can return the same value twice in a row, assuming that rand(3) does.

zlong
randomgetfn(UNUSED(Param pm))
{
    return rand() & 0x7fff;
}

void
randomsetfn(UNUSED(Param pm), zlong v)
{
    srand((unsigned int)v);
}

A cool feature is that means you could override it if you wanted with a custom generator.

int
rand(void)
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

Usage:

$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4

This trick also applies to the rest of the shells below.

KornShell (ksh)

KornShell originated in 1983, but it was finally released under an open source license in 2005. There’s a clone of KornShell called Public Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but I’ll get to that next.

KornShell defers to rand(3), but it does some additional naïve filtering on the output. When the shell starts up, it generates 10 values from rand(). If any of them are larger than 32,767 then it will shift right by three all generated numbers.

#define RANDMASK 0x7fff

    for (n = 0; n < 10; n++) {
        // Don't use lower bits when rand() generates large numbers.
        if (rand() > RANDMASK) {
            rand_shift = 3;
            break;
        }
    }

Why not just look at RAND_MAX? I guess they didn’t think of it.

Update: Quentin Barnes pointed out that RAND_MAX didn’t exist until POSIX standardization in 1988. The constant first appeared in Unix in 1990. This KornShell code either predates the standard or needed to work on systems that predate the standard.

Like Bash, repeated values are not allowed. I suspect one shell got this idea from the other.

    do {
        cur = (rand() >> rand_shift) & RANDMASK;
    } while (cur == last);

Who came up with this strange idea first?

OpenBSD’s Public Domain Korn Shell (pdksh)

I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I ever touch in practice, and its $RANDOM is the most interesting of the pdksh forks — at least since 2014.

Like Zsh, pdksh simply defers to rand(3). However, OpenBSD’s rand(3) is infamously and proudly non-standard. By default it returns non-deterministic, cryptographic-quality results seeded from system entropy (via the misnamed arc4random(3)), à la /dev/urandom. Its $RANDOM inherits this behavior.

    setint(vp, (int64_t) (rand() & 0x7fff));

However, if a value is assigned to $RANDOM in order to seed it, it reverts to its old pre-2014 deterministic generation via srand_deterministic(3).

    srand_deterministic((unsigned int)intval(vp));

OpenBSD’s deterministic rand(3) is the crummy LCG from the C89 standard, just like Bash 3.x. So if you assign to $RANDOM, you’ll get nearly the same results as Bash 3.x and earlier — the only difference being that it can repeat numbers.

That’s a slick upgrade to the old interface without breaking anything, making it my favorite version $RANDOM for any shell.

A JIT Compiler Skirmish with SELinux

2018-11-15T18:57:47Z

This is a debugging war story.

Once upon a time I wrote a fancy data conversion utility. The input was a complex binary format defined by a data dictionary supplied at run time by the user alongside the input data. Since the converter was typically used to process massive quantities of input, and the nature of that input wasn’t known until run time, I wrote an x86-64 JIT compiler to speed it up. The converter generated a fast, native binary parser in memory according to the data dictionary specification. Processing data now took much less time and everyone rejoiced.

Then along came SELinux, Sheriff of Pedantry. Not liking all the shenanigans with page protections, SELinux huffed and puffed and made mprotect(2) return EACCES (“Permission denied”). Believing I was following all the rules and so this would never happen, I foolishly did not check the result and the converter was now crashing for its users. What made SELinux so unhappy, and could this somehow be resolved?

Allocating memory

Before going further, let’s back up and review how this works. Suppose I want to generate code at run time and execute it. In the old days this was as simple as writing some machine code into a buffer and jumping to that buffer — e.g. by converting the buffer to a function pointer and calling it.

typedef int (*jit_func)(void);

/* NOTE: This doesn't work anymore! */
jit_func
jit_compile(int retval)
{
    unsigned char *buf = malloc(6);
    if (buf) {
        /* mov eax, retval */
        buf[0] = 0xb8;
        buf[1] = retval >>  0;
        buf[2] = retval >>  8;
        buf[3] = retval >> 16;
        buf[4] = retval >> 24;
        /* ret */
        buf[5] = 0xc3;
    }
    return (jit_func)buf;
}

int
main(void)
{
    jit_func f = jit_compile(1001);
    printf("f() = %d\n", f());
    free(f);
}

This situation was far too easy for malicious actors to abuse. An attacker could supply instructions of their own choosing — i.e. shell code — as input and exploit a buffer overflow vulnerability to execute the input buffer. These exploits were trivial to craft.

Modern systems have hardware checks to prevent this from happening. Memory containing instructions must have their execute protection bit set before those instructions can be executed. This is useful both for making attackers work harder and for catching bugs in programs — no more executing data by accident.

This is further complicated by the fact that memory protections have page granularity. You can’t adjust the protections for a 6-byte buffer. You do it for the entire surrounding page — typically 4kB, but sometimes as large as 2MB. This requires replacing that malloc(3) with a more careful allocation strategy. There are a few ways to go about this.

Anonymous memory mapping

The most common and most sensible is to create an anonymous memory mapping: a file memory map that’s not actually backed by a file. The mmap(2) function has a flag specifically for this purpose: MAP_ANONYMOUS.

#include 

void *
anon_alloc(size_t len)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, -1, 0);
    return p != MAP_FAILED ? p : 0;
}

void
anon_free(void *p, size_t len)
{
    munmap(p, len);
}

Unfortunately, MAP_ANONYMOUS not part of POSIX. If you’re being super strict with your includes — as I tend to be — this flag won’t be defined, even on systems where it’s supported.

#define _POSIX_C_SOURCE 200112L
#include 
// MAP_ANONYMOUS undefined!

To get the flag, you must use the _BSD_SOURCE, or, more recently, the _DEFAULT_SOURCE feature test macro to explicitly enable that feature.

#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE /* for MAP_ANONYMOUS */
#include 

The POSIX way to do this is to instead map /dev/zero. So, wanting to be Mr. Portable, this is what I did in my tool. Take careful note of this.

#define _POSIX_C_SOURCE 200112L
#include 
#include 
#include 

void *
anon_alloc(size_t len)
{
    int fd = open("/dev/zero", O_RDWR);
    if (fd == -1)
        return 0;
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

Aligned allocation

Another, less common (and less portable) strategy is to lean on the existing C memory allocator, being careful to allocate on page boundaries so that the page protections don’t affect other allocations. The classic allocation functions, like malloc(3), don’t allow for this kind of control. However, there are a couple of aligned allocation alternatives.

The first is posix_memalign(3):

int posix_memalign(void **ptr, size_t alignment, size_t size);

By choosing page alignment and a size that’s a multiple of the page size, it’s guaranteed to return whole pages. When done, pages are freed with free(3). Though, unlike unmapping, the original page protections must first be restored since those pages may be reused.

#define _POSIX_C_SOURCE 200112L
#include 
#include 

void *
anon_alloc(size_t len)
{
    void *p;
    long pagesize = sysconf(_SC_PAGE_SIZE); // TODO: cache this
    size_t roundup = (len + pagesize - 1) / pagesize * pagesize;
    return posix_memalign(&p, pagesize, roundup) ? 0 : p;
}

If you’re using C11, there’s also aligned_alloc(3). This is the most uncommon of all since most C programmers refuse to switch to a new standard until it’s at least old enough to drive a car.

Changing page protections

So we’ve allocated our memory, but it’s not going to start in an executable state. Why? Because a W^X (“write xor execute”) policy is becoming increasingly common. Attempting to set both write and execute protections at the same time may be denied. (In fact, there’s an SELinux policy for this.)

As a JIT compiler, we need to write to a page and execute it. Again, there are two strategies. The complicated strategy is to map the same memory at two different places, one with the execute protection, one with the write protection. This allows the page to be modified as it’s being executed without violating W^X.

The simpler and more secure strategy is to write the machine instructions, then swap the page over to executable using mprotect(2) once it’s ready. This is what I was doing in my tool.

unsigned char *buf = anon_alloc(len);
/* ... write instructions into the buffer ... */
mprotect(buf, len, PROT_EXEC);
jit_func func = (jit_func)buf;
func();

At a high level, That’s pretty close to what I was actually doing. That includes neglecting to check the result of mprotect(2). This worked fine and dandy for several years, when suddenly (shown here in the style of strace):

mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)

Then the program would crash trying to execute the buffer. Suddenly it wasn’t allowed to make this buffer executable. My program hadn’t changed. What had changed was the SELinux security policy on this particular system.

Asking for help

The problem is that I don’t administer this (Red Hat) system. I can’t access the logs and I didn’t set the policy. I don’t have any insight on why this call was suddenly being denied. To make this more challenging, the folks that manage this system didn’t have the necessary knowledge to help with this either.

So to figure this out, I need to treat it like a black box and probe at system calls until I can figure out just what SELinux policy I’m up against. I only have practical experience administrating Debian systems (and its derivatives like Ubuntu), which means I’ve hardly ever had to deal with SELinux. I’m flying fairly blind here.

Since my real application is large and complicated, I code up a minimal example, around a dozen lines of code: allocate a single page of memory, write a single return (ret) instruction into it, set it as executable, and call it. The program checks for errors, and I can run it under strace if that’s not insightful enough. This program is also something simple I could provide to the system administrators, since they were willing to turn some of the knobs to help narrow down the problem.

However, here’s where I made a major mistake. Assuming the problem was solely in mprotect(2), and wanting to keep this as absolutely simple as possible, I used posix_memalign(3) to allocate that page. I saw the same EACCES as before, and assumed I was demonstrating the same problem. Take note of this, too.

Finding a resolution

Eventually I’d need to figure out what policy was blocking my JIT compiler, then see if there was an alternative route. The system loader still worked after all, and I could plainly see that with strace. So it wasn’t a blanket policy that completely blocked the execute protection. Perhaps the loader was given an exception?

However, the very first order of business was to actually check the result from mprotect(2) and do something more graceful rather than crash. In my case, that meant falling back to executing a byte-code virtual machine. I added the check, and now the program ran slower instead of crashing.

The program runs on both Linux and Windows, and the allocation and page protection management is abstracted. On Windows it uses VirtualAlloc() and VirtualProtect() instead of mmap(2) and mprotect(2). Neither implementation checked that the protection change succeeded, so I fixed the Windows implementation while I was at it.

Thanks to Mingw-w64, I actually do most of my Windows development on Linux. And, thanks to Wine, I mean everything, including running and debugging. Calling VirtualProtect() in Wine would ultimately call mprotect(2) in the background, which I expected would be denied. So running the Windows version with Wine under this SELinux policy would be the perfect test. Right?

Except that mprotect(2) succeeded under Wine! The Windows version of my JIT compiler was working just fine on Linux. Huh?

This system doesn’t have Wine installed. I had built and packaged it myself. This Wine build definitely has no SELinux exceptions. Not only did the Wine loader work correctly, it can change page protections in ways my own Linux programs could not. What’s different?

Debugging this with all these layers is starting to look silly, but this is exactly why doing Windows development on Linux is so useful. I run my program under Wine under strace:

$ strace wine ./mytool.exe

I study the system calls around mprotect(2). Perhaps there’s some stricter alignment issue? No. Perhaps I need to include PROT_READ? No. The only difference I can find is they’re using the MAP_ANONYMOUS flag. So, armed with this knowledge, I modify my minimal example to allocate 1024 pages instead of just one, and suddenly it works correctly. I was most of the way to figuring this all out.

Inside glibc allocation

Why did increasing the allocation size change anything? This is a typical Linux system, so my program is linked against the GNU C library, glibc. This library allocates memory from two places depending on the allocation size.

For small allocations, glibc uses brk(2) to extend the executable image — i.e. to extend the .bss section. These resources are not returned to the operating system after they’re freed with free(3). They’re reused.

For large allocations, glibc uses mmap(2) to create a new, anonymous mapping for that allocation. When freed with free(3), that memory is unmapped and its resources are returned to the operating system.

By increasing the allocation size, it became a “large” allocation and was backed by an anonymous mapping. Even though I didn’t use mmap(2), to the operating system this would be indistinguishable to what Wine was doing (and succeeding at).

Consider this little example program:

int
main(void)
{
    printf("%p\n", malloc(1));
    printf("%p\n", malloc(1024 * 1024));
}

When not compiled as a Position Independent Executable (PIE), here’s what the output looks like. The first pointer is near where the program was loaded, low in memory. The second pointer is a randomly selected address high in memory.

0x1077010
0x7fa9b998e010

And if you run it under strace, you’ll see that the first allocation comes from brk(2) and the second comes from mmap(2).

Two SELinux policies

With a little bit of research, I found the two SELinux policies at play here. In my minimal example, I was blocked by allow_execheap.

/selinux/booleans/allow_execheap

This prohibits programs from setting the execute protection on any “heap” page.

The POSIX specification does not permit it, but the Linux implementation of mprotect allows changing the access protection of memory on the heap (e.g., allocated using malloc). This error indicates that heap memory was supposed to be made executable. Doing this is really a bad idea. If anonymous, executable memory is needed it should be allocated using mmap which is the only portable mechanism.

Obviously this is pretty loose since I was still able to do it with posix_memalign(3), which, technically speaking, allocates from the heap. So this policy applies to pages mapped by brk(2).

The second policy was allow_execmod.

/selinux/booleans/allow_execmod

The program mapped from a file with mmap and the MAP_PRIVATE flag and write permission. Then the memory region has been written to, resulting in copy-on-write (COW) of the affected page(s). This memory region is then made executable […]. The mprotect call will fail with EACCES in this case.

I don’t understand what purpose this policy serves, but this is what was causing my original problem. Pages mapped to /dev/zero are not actually considered anonymous by Linux, at least as far as this policy is concerned. I think this is a mistake, and that mapping the special /dev/zero device should result in effectively anonymous pages.

From this I learned a little lesson about baking assumptions — that mprotect(2) was solely at fault — into my minimal debugging examples. And the fix was ultimately easy: I just had to suck it up and use the slightly less pure MAP_ANONYMOUS flag.

Brute Force Incognito Browsing

2018-09-06T14:07:13Z

Both Firefox and Chrome have a feature for creating temporary private browsing sessions. Firefox calls it Private Browsing and Chrome calls it Incognito Mode. Both work essentially the same way. A temporary browsing session is started without carrying over most existing session state (cookies, etc.), and no state (cookies, browsing history, cached data, etc.) is preserved after ending the session. Depending on the configuration, some browser extensions will be enabled in the private session, and their own internal state may be preserved.

The most obvious use is for visiting websites that you don’t want listed in your browsing history. Another use for more savvy users is to visit websites with a fresh, empty cookie file. For example, some news websites use a cookie to track the number visits and require a subscription after a certain number of “free” articles. Manually deleting cookies is a pain (especially without a specialized extension), but opening the same article in a private session is two clicks away.

For web development there’s yet another use. A private session is a way to view your website from the perspective of a first-time visitor. You’ll be logged out and will have little or no existing state.

However, sometimes it just doesn’t go far enough. Some of those news websites have adapted, and in addition to counting the number of visits, they’ve figured out how to detect private sessions and block them. I haven’t looked into how they do this — maybe something to do with local storage, or detecting previously cached content. Sometimes I want a private session that’s truly fully isolated. The existing private session features just aren’t isolated enough or they behave differently, which is how they’re being detected.

Some time ago I put together a couple of scripts to brute force my own private sessions when I need them, generally for testing websites in a guaranteed fresh, fully-functioning instance. It also lets me run multiple such sessions in parallel. My scripts don’t rely on any private session feature of the browser, so the behavior is identical to a real browser, making it undetectable.

The downside is that, for better or worse, no browser extensions are carried over. In some ways this can be considered a feature, but a lot of the time I would like my ad-blocker to carry over. Your ad-blocker is probably the most important security software on your computer, so you should hesitate to give it up.

Another downside is that both Firefox and Chrome have some irritating first-time behaviors that can’t be disabled. The intent is to be newbie-friendly but it just gets in my way. For example, both bug me about logging into their browser platforms. Firefox starts with two tabs. Chrome creates a popup to ask me to configure a printer. Both start with a junk URL in the location bar so I can’t just middle-click paste (i.e. the X11 selection clipboard) into it. It’s definitely not designed for my use case.

Firefox

Here’s my brute force private session script for Firefox:

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/firefox-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
firefox -profile "$TEMP" -no-remote "$@"

It creates a temporary directory under $XDG_CACHE_HOME and tells Firefox to use the profile in that directory. No such profile exists, of course, so Firefox creates a fresh profile.

In theory I could just create a new profile alongside the default within my existing ~/.mozilla directory. However, I’ve never liked Firefox’s profile feature, especially with the intentionally unpredictable way it stores the profile itself: behind random path. I also don’t trust it to be fully isolated and to fully clean up when I’m done.

Before starting Firefox, I register a trap with the shell to clean up the profile directory regardless of what happens. It doesn’t matter if Firefox exits cleanly, if it crashes, or if I CTRL-C it to death.

The -no-remote option prevents the new Firefox instance from joining onto an existing Firefox instance, which it really prefers to do even though it’s technically supposed to be a different profile.

Note the "$@", which passes arguments through to Firefox — most often the URL of the site I want to test.

Chromium

I don’t actually use Chrome but rather the open source version, Chromium. I think this script will also work with Chrome.

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/chromium-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
chromium --user-data-dir="$TEMP" \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

It’s exactly the same as the Firefox script and only the browser arguments have changed. I tell it not to ask about being the default browser, and --no-first-run disables some of the irritating first-time behaviors.

Chromium is very noisy on the command line, so I also redirect all output to /dev/null.

If you’re on Debian like me, its version of Chromium comes with a --temp-profile option that handles the throwaway profile automatically. So the script can be simplified:

#!/bin/sh -e
chromium --temp-profile \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

In my own use case, these scripts have fully replaced the built-in private session features. In fact, since Chromium is not my primary browser, my brute force private session script is how I usually launch it. I only run it to test things, and I always want to test using a fresh profile.

Intercepting and Emulating Linux System Calls with Ptrace

2018-06-23T20:41:08Z

The ptrace(2) (“process trace”) system call is usually associated with debugging. It’s the primary mechanism through which native debuggers monitor debuggees on unix-like systems. It’s also the usual approach for implementing strace — system call trace. With Ptrace, tracers can pause tracees, inspect and set registers and memory, monitor system calls, or even intercept system calls.

By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.

The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.

For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.

You can find runnable code for the examples in this article here:

https://github.com/skeeto/ptrace-examples

strace

Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.

Ptrace has never been standardized. Its interface is similar across different operating systems, especially in its core functionality, but it’s still subtly different from system to system. The ptrace(2) prototype generally looks something like this, though the specific types may be different.

long ptrace(int request, pid_t pid, void *addr, void *data);

The pid is the tracee’s process ID. While a tracee can have only one tracer attached at a time, a tracer can be attached to many tracees.

The request field selects a specific Ptrace function, just like the ioctl(2) interface. For strace, only two are needed:

PTRACE_TRACEME: This process is to be traced by its parent.
PTRACE_SYSCALL: Continue, but stop at the next system call entrance or exit.
PTRACE_GETREGS: Get a copy of the tracee’s registers.

The other two fields, addr and data, serve as generic arguments for the selected Ptrace function. One or both are often ignored, in which case I pass zero.

The strace interface is essentially a prefix to another command.

$ strace [strace options] program [arguments]

My minimal strace doesn’t have any options, so the first thing to do — assuming it has at least one argument — is fork(2) and exec(2) the tracee process on the tail of argv. But before loading the target program, the new process will inform the kernel that it’s going to be traced by its parent. The tracee will be paused by this Ptrace system call.

pid_t pid = fork();
switch (pid) {
    case -1: /* error */
        FATAL("%s", strerror(errno));
    case 0:  /* child */
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        execvp(argv[1], argv + 1);
        FATAL("%s", strerror(errno));
}

The parent waits for the child’s PTRACE_TRACEME using wait(2). When wait(2) returns, the child will be paused.

waitpid(pid, 0, 0);

Before allowing the child to continue, we tell the operating system that the tracee should be terminated along with its parent. A real strace implementation may want to set other options, such as PTRACE_O_TRACEFORK.

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);

All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:

Wait for the process to enter the next system call.
Print a representation of the system call.
Allow the system call to execute and wait for the return.
Print the system call return value.

The PTRACE_SYSCALL request is used in both waiting for the next system call to begin, and waiting for that system call to exit. As before, a wait(2) is needed to wait for the tracee to enter the desired state.

ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);

When wait(2) returns, the registers for the thread that made the system call are filled with the system call number and its arguments. However, the operating system has not yet serviced this system call. This detail will be important later.

The next step is to gather the system call information. This is where it gets architecture specific. On x86-64, the system call number is passed in rax, and the arguments (up to 6) are passed in rdi, rsi, rdx, r10, r8, and r9. Reading the registers is another Ptrace call, though there’s no need to wait(2) since the tracee isn’t changing state.

struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, &regs);
long syscall = regs.orig_rax;

fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
        syscall,
        (long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
        (long)regs.r10, (long)regs.r8,  (long)regs.r9);

There’s one caveat. For internal kernel purposes, the system call number is stored in orig_rax rather than rax. All the other system call arguments are straightforward.

Next it’s another PTRACE_SYSCALL and wait(2), then another PTRACE_GETREGS to fetch the result. The result is stored in rax.

ptrace(PTRACE_GETREGS, pid, 0, &regs);
fprintf(stderr, " = %ld\n", (long)regs.rax);

The output from this simple program is very crude. There is no symbolic name for the system call and every argument is printed numerically, even if it’s a pointer to a buffer. A more complete strace would know which arguments are pointers and use process_vm_readv(2) to read those buffers from the tracee in order to print them appropriately.

However, this does lay the groundwork for system call interception.

System call interception

Suppose we want to use Ptrace to implement something like OpenBSD’s pledge(2), in which a process pledges to use only a restricted set of system calls. The idea is that many programs typically have an initialization phase where they need lots of system access (opening files, binding sockets, etc.). After initialization they enter a main loop in which they processing input and only a small set of system calls are needed.

Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.

Using the same strace model, rather than print out all system calls, we could either block certain system calls or simply terminate the tracee when it misbehaves. Termination is easy: just call exit(2) in the tracer. Since it’s configured to also terminate the tracee. Blocking the system call and allowing the child to continue is a little trickier.

The tricky part is that there’s no way to abort a system call once it’s started. When tracer returns from wait(2) on the entrance to the system call, the only way to stop a system call from happening is to terminate the tracee.

However, not only can we mess with the system call arguments, we can change the system call number itself, converting it to a system call that doesn’t exist. On return we can report a “friendly” EPERM error in errno via the normal in-band signaling.

for (;;) {
    /* Enter next system call */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    /* Is this system call permitted? */
    int blocked = 0;
    if (is_syscall_blocked(regs.orig_rax)) {
        blocked = 1;
        regs.orig_rax = -1; // set to invalid syscall
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }

    /* Run system call and stop on exit */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    if (blocked) {
        /* errno = EPERM */
        regs.rax = -EPERM; // Operation not permitted
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }
}

This simple example only checks against a whitelist or blacklist of system calls. And there’s no nuance, such as allowing files to be opened (open(2)) read-only but not as writable, allowing anonymous memory maps but not non-anonymous mappings, etc. There’s also no way to the tracee to dynamically drop privileges.

How could the tracee communicate to the tracer? Use an artificial system call!

Creating an artificial system call

For my new pledge-like system call — which I call xpledge() to distinguish it from the real thing — I picked system call number 10000, a nice high number that’s unlikely to ever be used for a real system call.

#define SYS_xpledge 10000

Just for demonstration purposes, I put together a minuscule interface that’s not good for much in practice. It has little in common with OpenBSD’s pledge(2), which uses a string interface. Actually designing robust and secure sets of privileges is really complicated, as the pledge(2) manpage shows. Here’s the entire interface and implementation of the system call for the tracee:

#define _GNU_SOURCE
#include 

#define XPLEDGE_RDWR  (1 << 0)
#define XPLEDGE_OPEN  (1 << 1)

#define xpledge(arg) syscall(SYS_xpledge, arg)

If it passes zero for the argument, only a few basic system calls are allowed, including those used to allocate memory (e.g. brk(2)). The PLEDGE_RDWR bit allows various read and write system calls (read(2), readv(2), pread(2), preadv(2), etc.). The PLEDGE_OPEN bit allows open(2).

To prevent privileges from being escalated back, pledge() blocks itself — though this also prevents dropping more privileges later down the line.

In the xpledge tracer, I just need to check for this system call:

/* Handle entrance */
switch (regs.orig_rax) {
    case SYS_pledge:
        register_pledge(regs.rdi);
        break;
}

The operating system will return ENOSYS (Function not implemented) since this isn’t a real system call. So on the way out I overwrite this with a success (0).

/* Handle exit */
switch (regs.orig_rax) {
    case SYS_pledge:
        ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
        break;
}

I wrote a little test program that opens /dev/urandom, makes a read, tries to pledge, then tries to open /dev/urandom a second time, then confirms it can read from the original /dev/urandom file descriptor. Running without a pledge tracer, the output looks like this:

$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604

Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:

$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4

The pledge succeeds but the second fopen(3) does not since the tracer blocked it with EPERM.

This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.

Foreign system emulation

Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.

You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.

The Linux version of PTrace has had a more efficient operation for this technique since 2005: PTRACE_SYSEMU. PTrace stops only once per a system call, and it’s up to the tracer to service that system call before allowing the tracee to continue.

for (;;) {
    ptrace(PTRACE_SYSEMU, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    switch (regs.orig_rax) {
        case OS_read:
            /* ... */

        case OS_write:
            /* ... */

        case OS_open:
            /* ... */

        case OS_exit:
            /* ... */

        /* ... and so on ... */
    }
}

To run binaries for the same architecture from any system with a stable (enough) system call ABI, you just need this PTRACE_SYSEMU tracer, a loader (to take the place of exec(2)), and whatever system libraries the binary needs (or only run static binaries).

In fact, this sounds like a fun weekend project.

When FFI Function Calls Beat Native C

2018-05-27T20:03:15Z

Update: There’s a good discussion on Hacker News.

Over on GitHub, David Yu has an interesting performance benchmark for function calls of various Foreign Function Interfaces (FFI):

https://github.com/dyu/ffi-overhead

He created a shared object (.so) file containing a single, simple C function. Then for each FFI he wrote a bit of code to call this function many times, measuring how long it took.

For the C “FFI” he used standard dynamic linking, not dlopen(). This distinction is important, since it really makes a difference in the benchmark. There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure.

The most surprising result of the benchmark is that LuaJIT’s FFI is substantially faster than C. It’s about 25% faster than a native C function call to a shared object function. How could a weakly and dynamically typed scripting language come out ahead on a benchmark? Is this accurate?

It’s actually quite reasonable. The benchmark was run on Linux, so the performance penalty we’re seeing comes the Procedure Linkage Table (PLT). I’ve put together a really simple experiment to demonstrate the same effect in plain old C:

https://github.com/skeeto/dynamic-function-benchmark

Here are the results on an Intel i7-6700 (Skylake):

plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call

These are three different types of function calls:

Through the PLT
An indirect function call (via dlsym(3))
A direct function call (via a JIT-compiled function)

As shown, the last one is the fastest. It’s typically not an option for C programs, but it’s natural in the presence of a JIT compiler, including, apparently, LuaJIT.

In my benchmark, the function being called is named empty():

void empty(void) { }

And to compile it into a shared object:

$ cc -shared -fPIC -Os -o empty.so empty.c

Just as in my PRNG shootout, the benchmark calls this function repeatedly as many times as possible before an alarm goes off.

Procedure Linkage Tables

When a program or library calls a function in another shared object, the compiler cannot know where that function will be located in memory. That information isn’t known until run time, after the program and its dependencies are loaded into memory. These are usually at randomized locations — e.g. Address Space Layout Randomization (ASLR).

How is this resolved? Well, there are a couple of options.

One option is to make a note about each such call in the binary’s metadata. The run-time dynamic linker can then patch in the correct address at each call site. How exactly this would work depends on the particular code model used when compiling the binary.

The downside to this approach is slower loading, larger binaries, and less sharing of code pages between different processes. It’s slower loading because every dynamic call site needs to be patched before the program can begin execution. The binary is larger because each of these call sites needs an entry in the relocation table. And the lack of sharing is due to the code pages being modified.

On the other hand, the overhead for dynamic function calls would be eliminated, giving JIT-like performance as seen in the benchmark.

The second option is to route all dynamic calls through a table. The original call site calls into a stub in this table, which jumps to the actual dynamic function. With this approach the code does not need to be patched, meaning it’s trivially shared between processes. Only one place needs to be patched per dynamic function: the entries in the table. Even more, these patches can be performed lazily, on the first function call, making the load time even faster.

On systems using ELF binaries, this table is called the Procedure Linkage Table (PLT). The PLT itself doesn’t actually get patched — it’s mapped read-only along with the rest of the code. Instead the Global Offset Table (GOT) gets patched. The PLT stub fetches the dynamic function address from the GOT and indirectly jumps to that address. To lazily load function addresses, these GOT entries are initialized with an address of a function that locates the target symbol, updates the GOT with that address, and then jumps to that function. Subsequent calls use the lazily discovered address.

The downside of a PLT is extra overhead per dynamic function call, which is what shows up in the benchmark. Since the benchmark only measures function calls, this appears to be pretty significant, but in practice it’s usually drowned out in noise.

Here’s the benchmark:

/* Cleared by an alarm signal. */
volatile sig_atomic_t running;

static long
plt_benchmark(void)
{
    long count;
    for (count = 0; running; count++)
        empty();
    return count;
}

Since empty() is in the shared object, that call goes through the PLT.

Indirect dynamic calls

Another way to dynamically call functions is to bypass the PLT and fetch the target function address within the program, e.g. via dlsym(3).

void *h = dlopen("path/to/lib.so", RTLD_NOW);
void (*f)(void) = dlsym("f");
f();

Once the function address is obtained, the overhead is smaller than function calls routed through the PLT. There’s no intermediate stub function and no GOT access. (Caveat: If the program has a PLT entry for the given function then dlsym(3) may actually return the address of the PLT stub.)

However, this is still an indirect function call. On conventional architectures, direct function calls have an immediate relative address. That is, the target of the call is some hard-coded offset from the call site. The CPU can see well ahead of time where the call is going.

An indirect function call has more overhead. First, the address has to be stored somewhere. Even if that somewhere is just a register, it increases register pressure by using up a register. Second, it provokes the CPU’s branch predictor since the call target isn’t static, making for extra bookkeeping in the CPU. In the worst case the function call may even cause a pipeline stall.

Here’s the benchmark:

volatile sig_atomic_t running;

static long
indirect_benchmark(void (*f)(void))
{
    long count;
    for (count = 0; running; count++)
        f();
    return count;
}

The function passed to this benchmark is fetched with dlsym(3) so the compiler can’t do something tricky like convert that indirect call back into a direct call.

If the body of the loop was complicated enough that there was register pressure, thereby requiring the address to be spilled onto the stack, this benchmark might not fare as well against the PLT benchmark.

Direct function calls

The first two types of dynamic function calls are simple and easy to use. Direct calls to dynamic functions is trickier business since it requires modifying code at run time. In my benchmark I put together a little JIT compiler to generate the direct call.

There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB range due to a signed 32-bit immediate. This means the JIT code has to be placed virtually nearby the target function, empty(). If the JIT code needed to call two different dynamic functions separated by more than 2GB, then it’s not possible for both to be direct.

To keep things simple, my benchmark isn’t precise or very careful about picking the JIT code address. After being given the target function address, it blindly subtracts 4MB, rounds down to the nearest page, allocates some memory, and writes code into it. To do this correctly would mean inspecting the program’s own memory mappings to find space, and there’s no clean, portable way to do this. On Linux this requires parsing virtual files under /proc.

Here’s what my JIT’s memory allocation looks like. It assumes reasonable behavior for uintptr_t casts:

static void
jit_compile(struct jit_func *f, void (*empty)(void))
{
    uintptr_t addr = (uintptr_t)empty;
    void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK);
    /* ... */
    unsigned char *p = mmap(desired, len, prot, flags, fd, 0);
    /* ... */
}

It allocates two pages, one writable and the other containing non-writable code. Similar to my closure library, the lower page is writable and holds the running variable that gets cleared by the alarm. It needed to be nearby the JIT code in order to be an efficient RIP-relative access, just like the other two benchmark functions. The upper page contains this assembly:

jit_benchmark:
        push  rbx
        xor   ebx, ebx
.loop:  mov   eax, [rel running]
        test  eax, eax
        je    .done
        call  empty
        inc   ebx
        jmp   .loop
.done:  mov   eax, ebx
        pop   rbx
        ret

The call empty is the only instruction that is dynamically generated — necessary to fill out the relative address appropriately (the minus 5 is because it’s relative to the end of the instruction):

    // call empty
    uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5;
    *p++ = 0xe8;
    *p++ = rel >>  0;
    *p++ = rel >>  8;
    *p++ = rel >> 16;
    *p++ = rel >> 24;

If empty() wasn’t in a shared object and instead located in the same binary, this is essentially the direct call that the compiler would have generated for plt_benchmark(), assuming somehow it didn’t inline empty().

Ironically, calling the JIT-compiled code requires an indirect call (e.g. via a function pointer), and there’s no way around this. What are you going to do, JIT compile another function that makes the direct call? Fortunately this doesn’t matter since the part being measured in the loop is only a direct call.

It’s no mystery

Given these results, it’s really no mystery that LuaJIT can generate more efficient dynamic function calls than a PLT, even if they still end up being indirect calls. In my benchmark, the non-PLT indirect calls were 28% faster than the PLT, and the direct calls 43% faster than the PLT. That’s a small edge that JIT-enabled programs have over plain old native programs, though it comes at the cost of absolutely no code sharing between processes.

A Crude Personal Package Manager

2018-03-27T02:10:35Z

For the past couple of months I’ve been using a custom package manager to manage a handful of software packages within various unix-like environments. Packages are installed in my home directory under ~/.local/bin, and the package manager itself is just a 110 line Bourne shell script. It’s is not intended to replace the system’s package manager but, instead, compliment it in some cases where I need more flexibility. I use it to run custom versions of specific pieces of software — newer or older than the system-installed versions, or with my own patches and modifications — without interfering with the rest of system, and without a need for root access. It’s worked out really well so far and I expect to continue making heavy use of it in the future.

It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):

https://github.com/skeeto/dotfiles/blob/master/bin/qpkg

Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.

I say it’s crude because its “install” (-I) procedure is little more than a wrapper around tar. It doesn’t invoke libtool after installing a library, and there’s no post-install script — or postinst as Debian calls it. It doesn’t check for conflicts between packages, though there’s a command for doing so manually ahead of time. It doesn’t manage dependencies, nor even have them as a concept. That’s all on the user to screw up.

In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:

It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.
Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.
It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.

The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.

The second caveat is that some builds tailor themselves by default to the host (e.g. -march=native). If care isn’t taken, those packages may not be very portable. This is more common than I had expected and has mildly annoyed me.

Birth of a package manager

While the package manager is new, I’ve been building and installing software in my home directory for years. I’d follow the normal process of setting the install prefix to $HOME/.local, running the build, and then letting the “install” target do its thing.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.

A common difficulty has been handling the release of new versions of software. I’d like to upgrade to the new version, but lack a way to cleanly uninstall the previous version. Simply clobbering the old version by installing it on top usually works. Occasionally it wouldn’t, and I’d have to blow away ~/.local and start all over again. With more and more software installed in my home directory, restarting has become more and more of a chore that I’d like to avoid.

What I needed was a way to track exactly which files were installed so that I could remove them later when I needed to uninstall. Fortunately there’s a widely-used convention for exactly this purpose: DESTDIR.

It’s expected that when a Makefile provides an “install” target, it prefixes the installation path with the DESTDIR macro, which is assigned to the empty string by default. This allows the user to install the software to a temporary location for the purposes of packaging. Unlike the installation prefix (--prefix) configured before the build began, the software is not expected to function properly when run in the DESTDIR location.

$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install

A different tool will used to copy these files into place and actually install it. This tool can track what files were installed, allowing them to be removed later when uninstalling. My package manager uses the tar program for both purposes. First it creates a package by packing up the DESTDIR (at the root of the actual install prefix):

$ tar czf package.tgz -C $DESTDIR$HOME/.local .

So a package is nothing more than a gzipped tarball. To install, it unpacks the tarball in ~/.local.

$ cd $HOME/.local
$ tar xzf ~/package.tgz

But how does it uninstall a package? It didn’t keep track of what was installed. Easy! The tarball itself contains the package list, and it’s printed with tar’s t mode.

cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done

I’m using grep to skip directories, which are conveniently listed with a trailing slash. Note that in the example above, there are a couple of issues with file names containing whitespace. If the file contains a space character, it will word split incorrectly in the for loop. A Makefile couldn’t handle such a file in the first place, but, in case it’s still necessary, my package manager sets IFS to just a newline.

If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.

Commands

There are five commands, each assigned to a capital letter: -B, -C, -I, -V, and -U. It’s an interface pattern inspired by Ted Unangst’s signify (see signify(1)). I also used this pattern with Blowpipe and, in retrospect, wish I had also used with Enchive.

Build (`-B`)

Unlike the other three commands, the “build” command isn’t essential, and is just for convenience. It assumes the build uses an Autoconfg-like configure script and runs it automatically, followed by make with the appropriate -j (jobs) option. It automatically sets the --prefix argument when running the configure script.

If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.

Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/

In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.

Once qpkg returns, the program will be fully built — or stuck on a build error if you’re unlucky. If you need to pass custom configure options, just tack them on the qpkg command:

$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses

Since the second and third steps — creating the build directory and moving into it — is so common, there’s an optional switch for it: -d. This option’s argument is the build directory. qpkg creates that directory and runs the build inside it. In practice I just use “x” for the build directory since it’s so quick to add “dx” to the command.

$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/

With the software compiled, the next step is creating the package.

Create (`-C`)

The “create” command creates the DESTDIR (_destdir in the working directory) and runs the “install” Makefile target to fill it with files. Continuing with the example above and its x/ build directory:

$ qpkg -Cdx name

Where “name” is the name of the package, without any file name extension. Like with “build”, extra arguments after the package name are passed to make in case there needs to be any additional tweaking.

When the “create” command finishes, there will be new package named name.tgz in the working directory. At this point the source and build directories are no longer needed, assuming everything went fine.

$ rm -rf name-version/
$ rm -rf x/

This package is ready to install, though you may want to verify it first.

Verify (`-V`)

The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.

$ qpkg -V name.tgz

The most common conflict I’ve seen is in the info index (info/dir) file, which is safe to ignore since I don’t care about it.

If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.

Install (`-I`)

The “install” command is just the dumb tar xzf explained above. It will clobber anything in its way without warning, which is why, if that matters, “verify” should be used first.

$ qpkg -I name.tgz

When qpkg returns, the package has been installed and is probably ready to go. A lot of packages complain that you need to run libtool to finalize an installation, but I’ve never had a problem skipping it. This dumb unpacking generally works fine.

Uninstall (`-U`)

Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.

$ qpkg -U name.tgz

Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.

I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.

Package by accumulation

Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):

binutils
C headers
core GCC
C runtime
rest of GCC

Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)

To deal with this, I added a “keep” (-k) option that leaves the DESTDIR around after creating the package. To keep things tidy, the intermediate packages exist and are installed, but the final, big cross-compiler package accumulates into the DESTDIR. The final package at the end is actually the whole cross compiler in one package, a superset of them all.

Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.

My use case, and an alternative

The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.

All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.

Initial Evaluation of the Windows Subsystem for Linux

2017-11-30T21:03:53Z

Recently I had my first experiences with the Windows Subsystem for Linux (WSL), evaluating its potential as an environment for getting work done. This subsystem, introduced to Windows 10 in August 2016, allows Windows to natively run x86 and x86-64 Linux binaries. It’s essentially the counterpart to Wine, which allows Linux to natively run Windows binaries.

WSL interfaces with Linux programs only at the kernel level, servicing system calls the same way the Linux kernel would. The subsystem’s main job is translating Linux system calls into NT requests. There’s a series of articles about its internals if you’re interested in learning more.

I was honestly impressed by how well this all works, especially since Microsoft has long had an affinity for producing flimsy imitations (Windows console, PowerShell, Arial, etc.). WSL’s design allows Microsoft to dump an Ubuntu system wholesale inside Windows — and, more recently, other Linux distributions — bypassing a bunch of annoying issues, particularly in regards to glibc.

WSL processes can exec(2) Windows binaries, which then run in under their appropriate subsystem, similar to binfmt on Linux. In theory this nice interop should allow for some automation Linux-style even for Windows’ services and programs. More on that later.

There are some notable issues, though.

Lack of device emulation

No soundcard devices are exposed to the subsystem, so Linux programs can’t play sound. There’s a hack to talk PulseAudio with a Windows’ process that can access, but that’s about it. Generally there’s not much reason to be playing media or games under WSL, but this can be an annoyance if you’re, say, writing software that synthesizes audio.

Really, there’s almost no device emulation at all and /proc is pretty empty. You won’t see hard drives or removable media under /dev, nor will you see USB devices like webcams and joysticks. A lot of the useful things you might do on a Linux system aren’t available under WSL.

No Filesystem in Userspace (FUSE)

Microsoft hasn’t implemented any of the system calls for FUSE, so don’t expect to use your favorite userspace filesystems. The biggest loss for me is sshfs, which I use frequently.

If FUSE was supported, it would be interesting to see how the rest of Windows interacts with these mounted filesystems, if at all.

Fragile services

Services running under WSL are flaky. The big issue is that when the initial WSL shell process exits, all WSL processes are killed and the entire subsystem is torn down. This includes any services that are running. That’s certainly surprising to anyone with experience running services on any kind of unix system. This is probably the worst part of WSL.

While systemd is the standard for Linux these days and may even be “installed” in the WSL virtual filesystem, it’s not actually running and you can’t use systemctl to interact with services. Services can only be controlled the old fashioned way, and, per above, that initial WSL console window has to remain open while services are running.

That’s a bit of a damper if you’re intending to spend a lot of time remotely SSHing into your Windows 10 system. So yes, it’s trivial to run an OpenSSH server under WSL, but it won’t feel like a proper system service.

Limited graphics support

WSL doesn’t come with an X server, so you have to supply one separately (Xming, etc.) that runs outside WSL, as a normal Windows process. WSL processes can connect to that server (DISPLAY) allowing you to run most Linux graphical software.

However, this means there’s no hardware acceleration. There will be no GLX extensions available. If your goal is to run the Emacs or Vim GUIs, that’s not a big deal, but it might matter if you were interested in running a browser under WSL. It also means it’s not a suitable environment for developing software using OpenGL.

Filesystem woes

The filesystem manages to be both one of the smallest issues as well as one of the biggest.

Filename translation

On the small issue side is filename translation. Under most Linux filesystems — and even more broadly for unix — a filename is just a bytestring. They’re not necessarily UTF-8 or any other particular encoding, and that’s partly why filenames are case-sensitive — the meaning of case depends on the encoding.

However, Windows uses a pseudo-UTF-16 scheme for filenames, incompatible with bytestrings. Since WSL lives within a Windows’ filesystem, there must be some bijection between bytestring filenames and pseudo-UTF-16 filenames. It will also have to reject filenames that can’t be mapped. WSL does both.

I couldn’t find any formal documentation about how filename translation works, but most of it can be reverse engineered through experimentation. In practice, Linux filenames are UTF-8 encoded strings, and WSL’s translation takes advantage of this. Filenames are decoded as UTF-8 and re-encoded as UTF-16 for Windows. Any byte that doesn’t decode as valid UTF-8 is silently converted to REPLACEMENT CHARACTER (U+FFFD), and decoding continues from the next byte.

I wonder if there are security consequences for different filenames silently mapping to the same underlying file.

Exercise for the reader: How is an unmatched surrogate half from Windows translated to WSL, where it doesn’t have a UTF-8 equivalent? I haven’t tried this yet.

Even for valid UTF-8, there are many bytes that most Linux filesystems allow in filenames that Windows does not. This ranges from simple things like ASCII backslash and colon — special components of Windows’ paths — to unusual characters like newlines, escape, and other ASCII control characters. There are two different ways these are handled:

The C drive is available under /mnt/c, and WSL processes can access regular Windows files under this “mountpoint.” Attempting to access filenames with invalid characters under this mountpoint always results in ENOENT: “No such file or directory.”
Outside of /mnt/c is WSL territory, and Windows processes aren’t supposed to touch these files. This allows for more freedom when translating filenames. REPLACEMENT CHARACTER is still used for invalid UTF-8 sequences, but the forbidden characters, including backslashes, are all permitted. They’re translated to #XXXX where X is hexadecimal for the normally invalid character. For example, a:b becomes a#003Ab.

While WSL doesn’t let you get away with all the crazy, ill-advised filenames that Linux allows, it’s still quite reasonable. Since Windows and Linux filenames aren’t entirely compatible, there’s going to be some trade-off no matter how this translation is done.

Filesystem performance

On the other hand, filesystem performance is abysmal, and I doubt the subsystem is to blame. This isn’t a surprise to anyone who’s used moderately-sized Git repositories on Windows, where the large numbers of loose files brings things to a crawl. This has been a Windows issue for years, and that’s even before you start plugging in the typically “security” services — virus scanners, whitelists, etc. — that are typically present on a Windows system and make this even worse.

To test out WSL, I went around my normal business compiling tools and making myself at home, just as I would on Linux. Doing nearly anything in WSL was noticably slower than doing the same on Linux on the exact same hardware. I didn’t run any benchmarks, but I’d expect to see around an order of magnitude difference on average for filesystem operations. Building LLVM and Clang took a couple hours rather than the typical 20 minutes.

I don’t expect this issue to get fixed anytime soon, and it’s probably always going to be a notable limitation of WSL.

So is WSL useful?

One of my hopes for WSL appears to be unfeasible. I thought it might be a way to avoid porting software from POSIX to Win32. I could just supply Windows users with the same Linux binary and they’d be fine. However, WSL requires switching Windows into a special “developer mode,” putting it well out of reach of the vast majority of users, especially considering the typical corporate computing environment that will lock this down. In practice, WSL is only useful to developers. I’m sure this is no accident. (Developer mode is no longer required as of October 2017.)

Mostly I see WSL as a Cygwin killer. Unix is my IDE and, on Windows, Cygwin has been my preferred go to for getting a solid unix environment for software development. Unlike WSL, Cygwin processes can make direct Win32 calls, which is occasionally useful. But, in exchange, WSL will overall be better equipped. It has native Linux tools, including a better suite of debugging tools — even better than you get in Windows itself — Valgrind, strace, and properly-working GDB (always been flaky in Cygwin). WSL is not nearly as good as actual Linux, but it’s better than Cygwin if you can get access to it.

Building and Installing Software in $HOME

2017-06-19T02:34:39Z

For more than 5 years now I’ve kept a private “root” filesystem within my home directory under $HOME/.local/. Within are the standard /usr directories, such as bin/, include/, lib/, etc., containing my own software, libraries, and man pages. These are first-class citizens, indistinguishable from the system-installed programs and libraries. With one exception (setuid programs), none of this requires root privileges.

Installing software in $HOME serves two important purposes, both of which are indispensable to me on a regular basis.

No root access: Sometimes I’m using a system administered by someone else, and I don’t have root access.

This prevents me from installing packaged software myself through the system’s package manager. Building and installing the software myself in my home directory, without involvement from the system administrator, neatly works around this issue. As a software developer, it’s already perfectly normal for me to build and run custom software, and this is just an extension of that behavior.

In the most desperate situation, all I need from the sysadmin is a decent C compiler and at least a minimal POSIX environment. I can bootstrap anything I might need, both libraries and programs, including a better C compiler along the way. This is one major strength of open source software.

I have noticed one alarming trend: Both GCC (since 4.8) and Clang are written in C++, so it’s becoming less and less reasonable to bootstrap a C++ compiler from a C compiler, or even from a C++ compiler that’s more than a few years old. So you may also need your sysadmin to supply a fairly recent C++ compiler if you want to bootstrap an environment that includes C++. I’ve had to avoid some C++ software (such as CMake) for this reason.

Custom software builds: Even if I am root, I may still want to install software not available through the package manager, a version not available in the package manager, or a version with custom patches.

In theory this is what /usr/local is all about. It’s typically the location for software not managed by the system’s package manager. However, I think it’s cleaner to put this in $HOME/.local, so long as other system users don’t need it.

For example, I have an installation of each version of Emacs between 24.3 (the oldest version worth supporting) through the latest stable release, each suffixed with its version number, under $HOME/.local. This is useful for quickly running a test suite under different releases.

$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...

Another example is NetHack, which I prefer to play with a couple of custom patches (Menucolors, wchar). The install to $HOME/.local is also captured as a patch.

$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install

Normally NetHack wants to be setuid (e.g. run as the “games” user) in order to restrict access to high scores, saves, and bones — saved levels where a player died, to be inserted randomly into other players’ games. This prevents cheating, but requires root to set up. Fortunately, when I install NetHack in my home directory, this isn’t a feature I actually care about, so I can ignore it.

Mutt is in a similar situation, since it wants to install a special setgid program (mutt_dotlock) that synchronizes mailbox access. All MUAs need something like this.

Everything described below is relevant to basically any modern unix-like system: Linux, BSD, etc. I personally install software in $HOME across a variety of systems and, fortunately, it mostly works the same way everywhere. This is probably in large part due to everyone standardizing around the GCC and GNU binutils interfaces, even if the system compiler is actually LLVM/Clang.

Configuring for $HOME installs

Out of the box, installing things in $HOME/.local won’t do anything useful. You need to set up some environment variables in your shell configuration (i.e. .profile, .bashrc, etc.) to tell various programs, such as your shell, about it. The most obvious variable is $PATH:

export PATH=$HOME/.local/bin:$PATH

Notice I put it in the front of the list. This is because I want my home directory programs to override system programs with the same name. For what other reason would I install a program with the same name if not to override the system program?

In the simplest situation this is good enough, but in practice you’ll probably need to set a few more things. If you install libraries in your home directory and expect to use them just as if they were installed on the system, you’ll need to tell the compiler where else to look for those headers and libraries, both for C and C++.

export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib

The first two are like the -I compiler option and the third is like -L linker option, except you usually won’t need to use them explicitly. Unfortunately LIBRARY_PATH doesn’t override the system library paths, so in some cases, you will need to explicitly set -L. Otherwise you will still end up linking against the system library rather than the custom packaged version. I really wish GCC and Clang didn’t behave this way.

Some software uses pkg-config to determine its compiler and linker flags, and your home directory will contain some of the needed information. So set that up too:

export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig

Run-time linker

Finally, when you install libraries in your home directory, the run-time dynamic linker will need to know where to find them. There are three ways to deal with this:

The crude, easy way: LD_LIBRARY_PATH.
The elegant, difficult way: ELF runpath.
Screw it, just statically link the bugger. (Not always possible.)

For the crude way, point the run-time linker at your lib/ and you’re done:

export LD_LIBRARY_PATH=$HOME/.local/lib

However, this is like using a shotgun to kill a fly. If you install a library in your home directory that is also installed on the system, and then run a system program, it may be linked against your library rather than the library installed on the system as was originally intended. This could have detrimental effects.

The precision method is to set the ELF “runpath” value. It’s like a per-binary LD_LIBRARY_PATH. The run-time linker uses this path first in its search for libraries, and it will only have an effect on that particular program/library. This also applies to dlopen().

Some software will configure the runpath by default in their build system, but often you need to configure this yourself. The simplest way is to set the LD_RUN_PATH environment variable when building software. Another option is to manually pass -rpath options to the linker via LDFLAGS. It’s used directly like this:

$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux

Verify with readelf:

$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]

ELF supports a special $ORIGIN “variable” set to the binary’s location. This allows the program and associated libraries to be installed anywhere without changes, so long as they have the same relative position to each other . (Note the quotes to prevent shell interpolation.)

$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux

There is one situation where runpath won’t work: when you want a system-installed program to find a home directory library with dlopen() — e.g. as an extension to that program. You either need to ensure it uses a relative or absolute path (i.e. the argument to dlopen() contains a slash) or you must use LD_LIBRARY_PATH.

Personally, I always use the Worse is Better LD_LIBRARY_PATH shotgun. Occasionally it’s caused some annoying issues, but the vast majority of the time it gets the job done with little fuss. This is just my personal development environment, after all, not a production server.

Manual pages

Another potentially tricky issue is man pages. When a program or library installs a man page in your home directory, it would certainly be nice to access it with man just like it was installed on the system. Fortunately, Debian and Debian-derived systems, using a mechanism I haven’t yet figured out, discover home directory man pages automatically without any assistance. No configuration needed.

It’s more complicated on other systems, such as the BSDs. You’ll need to set the MANPATH variable to include $HOME/.local/share/man. It’s unset by default and it overrides the system settings, which means you need to manually include the system paths. The manpath program can help with this … if it’s available.

export MANPATH=$HOME/.local/share/man:$(manpath)

I haven’t figured out a portable way to deal with this issue, so I mostly ignore it.

How to install software in $HOME

While I’ve poo-pooed autoconf in the past, the standard configure script usually makes it trivial to build and install software in $HOME. The key ingredient is the --prefix option:

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

Most of the time it’s that simple! If you’re linking against your own libraries and want to use runpath, it’s a little more complicated:

$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"

For CMake, there’s CMAKE_INSTALL_PREFIX:

$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..

The CMake builds I’ve seen use ELF runpath by default, and no further configuration may be required to make that work. I’m sure that’s not always the case, though.

Some software is just a single, static, standalone binary with everything baked in. It doesn’t need to be given a prefix, and installation is as simple as copying the binary into place. For example, Enchive works like this:

$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin

Some software uses its own unique configuration interface. I can respect that, but it does add some friction for users who now have something additional and non-transferable to learn. I demonstrated a NetHack build above, which has a configuration much more involved than it really should be. Another example is LuaJIT, which uses make variables that must be provided consistently on every invocation:

$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install

(You can use the “install” target to both build and install, but I wanted to illustrate the repetition of PREFIX.)

Some libraries aren’t so smart about pkg-config and need some handholding — for example, ncurses. I mention it because it’s required for both Vim and Emacs, among many others, so I’m often building it myself. It ignores --prefix and needs to be told a second time where to install things:

$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH

Another issue is that a whole lot of software has been hardcoded for ncurses 5.x (i.e. ncurses5-config), and it requires hacks/patching to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x for this reason.

Learning through experience

I could go on and on like this, discussing the quirks for the various libraries and programs that I use. Over the years I’ve gotten used to many of these issues, committing the solutions to memory. Unfortunately, even within the same version of a piece of software, the quirks can change between major operating system releases, so I’m continuously learning my way around new issues. It’s really given me an appreciation for all the hard work that package maintainers put into customizing and maintaining software builds to fit properly into a larger ecosystem.

Asynchronous Requests from Emacs Dynamic Modules

2017-02-14T02:30:00Z

A few months ago I had a discussion with Vladimir Kazanov about his Orgfuse project: a Python script that exposes an Emacs Org-mode document as a FUSE filesystem. It permits other programs to navigate the structure of an Org-mode document through the standard filesystem APIs. I suggested that, with the new dynamic modules in Emacs 25, Emacs itself could serve a FUSE filesystem. In fact, support for FUSE services in general could be an package of his own.

So that’s what he did: Elfuse. It’s an old joke that Emacs is an operating system, and here it is handling system calls.

However, there’s a tricky problem to solve, an issue also present my joystick module. Both modules handle asynchronous events — filesystem requests or joystick events — but Emacs runs the event loop and owns the main thread. The external events somehow need to feed into the main event loop. It’s even more difficult with FUSE because FUSE also wants control of its own thread for its own event loop. This requires Elfuse to spawn a dedicated FUSE thread and negotiate a request/response hand-off.

When a filesystem request or joystick event arrives, how does Emacs know to handle it? The simple and obvious solution is to poll the module from a timer.

struct queue requests;

emacs_value
Frequest_next(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
    emacs_value next = Qnil;
    queue_lock(requests);
    if (queue_length(requests) > 0) {
        void *request = queue_pop(requests, env);
        next = env->make_user_ptr(env, fin_empty, request);
    }
    queue_unlock(request);
    return next;
}

And then ask Emacs to check the module every, say, 10ms:

(defun request--poll ()
  (let ((next (request-next)))
    (when next
      (request-handle next))))

(run-at-time 0 0.01 #'request--poll)

Blocking directly on the module’s event pump with Emacs’ thread would prevent Emacs from doing important things like, you know, being a text editor. The timer allows it to handle its own events uninterrupted. It gets the job done, but it’s far from perfect:

It imposes an arbitrary latency to handling requests. Up to the poll period could pass before a request is handled.
Polling the module 100 times per second is inefficient. Unless you really enjoy recharging your laptop, that’s no good.

The poll period is a sliding trade-off between latency and battery life. If only there was some mechanism to, ahem, signal the Emacs thread, informing it that a request is waiting…

SIGUSR1

Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals, which is exactly the mechanism we need. The interface is a “key” binding on special-event-map, the keymap that handles these kinds of events. When the signal arrives, Emacs queues it up for the main event loop.

(define-key special-event-map [sigusr1]
  (lambda ()
    (interactive)
    (request-handle (request-next))))

The module blocks on its own thread on its own event pump. When a request arrives, it queues the request, rings the bell for Emacs to come handle it (raise()), and waits on a semaphore. For illustration purposes, assume the module reads requests from and writes responses to a file descriptor, like a socket.

int event_fd = /* ... */;
struct request request;
sem_init(&request.sem, 0, 0);

for (;;) {
    /* Blocking read for request event */
    read(event_fd, &request.event, sizeof(request.event));

    /* Put request on the queue */
    queue_lock(requests);
    queue_push(requests, &request);
    queue_unlock(requests);
    raise(SIGUSR1);  // TODO: Should raise() go inside the lock?

    /* Wait for Emacs */
    while (sem_wait(&request.sem))
        ;

    /* Reply with Emacs' response */
    write(event_fd, &request.response, sizeof(request.response));
}

The sem_wait() is in a loop because signals will wake it up prematurely. In fact, it may even wake up due to its own signal on the line before. This is the only way this particular use of sem_wait() might fail, so there’s no need to check errno.

If there are multiple module threads making requests to the same global queue, the lock is necessary to protect the queue. The semaphore is only for blocking the thread until Emacs has finished writing its particular response. Each thread has its own semaphore.

When Emacs is done writing the response, it releases the module thread by incrementing the semaphore. It might look something like this:

emacs_value
Frequest_complete(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
    struct request *request = env->get_user_ptr(env, args[0]);
    if (request)
        sem_post(&request->sem);
    return Qnil;
}

The top-level handler dispatches to the specific request handler, calling request-complete above when it’s done.

(defun request-handle (next)
  (condition-case e
      (cl-ecase (request-type next)
        (:open  (request-handle-open  next))
        (:close (request-handle-close next))
        (:read  (request-handle-read  next)))
    (error (request-respond-as-error next e)))
  (request-complete))

This SIGUSR1+semaphore mechanism is roughly how Elfuse currently processes requests.

Making it work on Windows

Windows doesn’t have signals. This isn’t a problem for Elfuse since Windows doesn’t have FUSE either. Nor does it matter for Joymacs since XInput isn’t event-driven and always requires polling. But someday someone will need this mechanism for a dynamic module on Windows.

Fortunately there’s a solution: input language change events, WM_INPUTLANGCHANGE. It’s also on special-event-map:

(define-key special-event-map [language-change]
  (lambda ()
    (interactive)
    (request-process (request-next))))

Instead of raise() (or pthread_kill()), broadcast the window event with PostMessage(). Outside of invoking the language-change key binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t belong to any particular window. We don’t really want to change the input language, after all.

PostMessageA(HWND_BROADCAST, WM_INPUTLANGCHANGE, 0, 0);

Naturally you’ll also need to replace the POSIX threading primitives with the Windows versions (CreateThread(), CreateSemaphore(), etc.). With a bit of abstraction in the right places, it should be pretty easy to support both POSIX and Windows in these asynchronous dynamic module events.

Manual Control Flow Guard in C

2017-01-21T22:44:15Z

Recent versions of Windows have a new exploit mitigation feature called Control Flow Guard (CFG). Before an indirect function call — e.g. function pointers and virtual functions — the target address checked against a table of valid call addresses. If the address isn’t the entry point of a known function, then the program is aborted.

If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.

The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.

For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.

CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.

While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.

An exploit example

Consider this trivial C program, demo.c:

int
main(void)
{
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    return 0;
}

It reads a name into a buffer and prints it back out with a greeting. While trivial, it’s far from innocent. That naive call to gets() doesn’t check the bounds of the buffer, introducing an exploitable buffer overflow. It’s so obvious that both the compiler and linker will yell about it.

For simplicity, suppose the program also contains a dangerous function.

void
self_destruct(void)
{
    puts("**** GO BOOM! ****");
}

The attacker can use the buffer overflow to call this dangerous function.

To make this attack simpler for the sake of the article, assume the program isn’t using ASLR (e.g. without -fpie/-pie, or with -fno-pie/-no-pie). For this particular example, I’ll also explicitly disable buffer overflow protections (e.g. _FORTIFY_SOURCE and stack protectors).

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
      -o demo demo.c

First, find the address of self_destruct().

$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct

This is on x86-64, so it’s a 64-bit address. The size of the name buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes allocated above, so there’s 16 bytes to fill, then 8 bytes to overwrite the return pointer with the address of self_destruct.

$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault

With this input I’ve successfully exploited the buffer overflow to divert control to self_destruct(). When main tries to return into libc, it instead jumps to the dangerous function, and then crashes when that function tries to return — though, presumably, the system would have self-destructed already. Turning on the stack protector stops this exploit.

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
      -o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...

The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.

The stack protector transformed the program into something that looks like the following:

int
main(void)
{
    long __canary = __get_thread_canary();
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    if (__canary != __get_thread_canary())
        abort();
    return 0;
}

However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.

Function pointers and virtual functions

After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:

void
self_destruct(char *password)
{
    if (strcmp(password, "12345") == 0)
        puts("**** GO BOOM! ****");
}

The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.

Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.

struct greeter {
    char name[8];
    void (*greet)(struct greeter *);
};

void
greet_hello(struct greeter *g)
{
    printf("Hello, %s.\n", g->name);
}

void
greet_aloha(struct greeter *g)
{
    printf("Aloha, %s.\n", g->name);
}

There’s now a greeter object and the function pointer makes its behavior polymorphic. Think of it as a hand-coded virtual function for C. Here’s the new (contrived) main:

int
main(void)
{
    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    greeter.greet(&greeter);
    return 0;
}

(In a real program, something else provides greeter and picks its own function pointer for greet.)

Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.

$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct

We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).

$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****

Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.

This is where the Control Flow Guard comes into play. With CFG enabled, the compiler inserts a check before calling the greet() function pointer. It must point to the beginning of a known function, otherwise it will abort just like the stack protector. Since the middle of self_destruct() isn’t the beginning of a function, it would abort if this exploit is attempted.

However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.

Function address bitmap

As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.

For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.

These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.

Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.

Here are the CFG parameters. I’ve made them macros so that they can easily be tuned at compile-time. The cfg_bits is the integer type backing the bitmap array. The CFG_RESOLUTION is the number of bits dropped, so “3” is a granularity of 8 bytes.

typedef unsigned long cfg_bits;
#define CFG_RESOLUTION  3
#define CFG_BITS        24

Given a function pointer f, this macro extracts the bitmap index.

#define CFG_INDEX(f) \
    (((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))

The CFG bitmap is just an array of integers. Zero it to initialize.

struct cfg {
    cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];
};

Functions are manually registered in the bitmap using cfg_register().

void
cfg_register(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    cfg->bitmap[i / z] |= 1UL << (i % z);
}

Because functions are registered at run-time, it’s fully compatible with ASLR. If ASLR is enabled, the bitmap will be a little different each run. On the same note, it may be worth XORing each bitmap element with a random, run-time value — along the same lines as the stack canary value — to make it harder for an attacker to manipulate the bitmap should he get the ability to overwrite it by a vulnerability. Alternatively the bitmap could be switched to read-only (e.g. mprotect()) once everything is registered.

And finally, the check function, used immediately before indirect calls. It ensures f was previously passed to cfg_register() (except for false positives, as discussed). Since it will be invoked often, it needs to be fast and simple.

void
cfg_check(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    if (!((cfg->bitmap[i / z] >> (i % z)) & 1))
        abort();
}

And that’s it! Now augment main to make use of it:

struct cfg cfg;

int
main(void)
{
    cfg_register(&cfg, self_destruct);  // to prove this works
    cfg_register(&cfg, greet_hello);
    cfg_register(&cfg, greet_aloha);

    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    cfg_check(&cfg, greeter.greet);
    greeter.greet(&greeter);
    return 0;
}

And now attempting the exploit:

$ ./demo < boom
Aborted

Normally self_destruct() wouldn’t be registered since it’s not a legitimate target of an indirect call, but the exploit still didn’t work because it called into the middle of self_destruct(), which isn’t a valid address in the bitmap. The check aborts the program before it can be exploited.

In a real application I would have a global cfg bitmap for the whole program, and define cfg_check() in a header as an inline function.

Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.

Update: Ted Unangst pointed out OpenBSD performing a similar check in its mbuf library. Instead of a bitmap, the function pointer is replaced with an index into an array of registered function pointers. That approach is cleaner, more efficient, completely portable, and has no false positives.

C Closures as a Library

2017-01-08T22:45:38Z

A common idiom is C is the callback function pointer, either to deliver information (i.e. a visitor or handler) or to customize the function’s behavior (e.g. a comparator). Examples of the latter in the C standard library are qsort() and bsearch(), each requiring a comparator function in order to operate on arbitrary types.

void qsort(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *));

void *bsearch(const void *key, const void *base,
              size_t nmemb, size_t size,
              int (*compar)(const void *, const void *));

A problem with these functions is that there’s no way to pass context to the callback. The callback may need information beyond the two element pointers when making its decision, or to update a result. For example, suppose I have a structure representing a two-dimensional coordinate, and a coordinate distance function.

struct coord {
    float x;
    float y;
};

static inline float
distance(const struct coord *a, const struct coord *b)
{
    float dx = a->x - b->x;
    float dy = a->y - b->y;
    return sqrtf(dx * dx + dy * dy);
}

If I have an array of coordinates and I want to sort them based on their distance from some target, the comparator needs to know the target. However, the qsort() interface has no way to directly pass this information. Instead it has to be passed by another means, such as a global variable.

struct coord *target;

int
coord_cmp(const void *a, const void *b)
{
    float dist_a = distance(a, target);
    float dist_b = distance(b, target);
    if (dist_a < dist_b)
        return -1;
    else if (dist_a > dist_b)
        return 1;
    else
        return 0;
}

And its usage:

    size_t ncoords = /* ... */;
    struct coords *coords = /* ... */;
    struct current_target = { /* ... */ };
    // ...
    target = &current_target
    qsort(coords, ncoords, sizeof(coords[0]), coord_cmp);

Potential problems are that it’s neither thread-safe nor re-entrant. Two different threads cannot use this comparator at the same time. Also, on some platforms and configurations, repeatedly accessing a global variable in a comparator may have a significant cost. A common workaround for thread safety is to make the global variable thread-local by allocating it in thread-local storage (TLS):

_Thread_local struct coord *target;       // C11
__thread struct coord *target;            // GCC and Clang
__declspec(thread) struct coord *target;  // Visual Studio

This makes the comparator thread-safe. However, it’s still not re-entrant (usually unimportant) and accessing thread-local variables on some platforms is even more expensive — which is the situation for Pthreads TLS, though not a problem for native x86-64 TLS.

Modern libraries usually provide some sort of “user data” pointer — a generic pointer that is passed to the callback function as an additional argument. For example, the GNU C Library has long had qsort_r(): re-entrant qsort.

void qsort_r(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *, void *),
           void *arg);

The new comparator looks like this:

int
coord_cmp_r(const void *a, const void *b, void *target)
{
    float dist_a = distance(a, target);
    float dist_b = distance(b, target);
    if (dist_a < dist_b)
        return -1;
    else if (dist_a > dist_b)
        return 1;
    else
        return 0;
}

And its usage:

    void *arg = &current_target;
    qsort_r(coords, ncoords, sizeof(coords[0]), coord_cmp_r, arg);

User data arguments are thread-safe, re-entrant, performant, and perfectly portable. They completely and cleanly solve the entire problem with virtually no drawbacks. If every library did this, there would be nothing left to discuss and this article would be boring.

The closure solution

In order to make things more interesting, suppose you’re stuck calling a function in some old library that takes a callback but doesn’t support a user data argument. A global variable is insufficient, and the thread-local storage solution isn’t viable for one reason or another. What do you do?

The core problem is that a function pointer is just an address, and it’s the same address no matter the context for any particular callback. On any particular call, the callback has three ways to distinguish this call from other calls. These align with the three solutions above:

Inspect some global state: the global variable solution. The caller will change this state for some other calls.
Query its unique thread ID: the thread-local storage solution. Calls on different threads will have different thread IDs.
Examine a context argument: the user pointer solution.

A wholly different approach is to use a unique function pointer for each callback. The callback could then inspect its own address to differentiate itself from other callbacks. Imagine defining multiple instances of coord_cmp each getting their context from a different global variable. Using a unique copy of coord_cmp on each thread for each usage would be both re-entrant and thread-safe, and wouldn’t require TLS.

Taking this idea further, I’d like to generate these new functions on demand at run time akin to a JIT compiler. This can be done as a library, mostly agnostic to the implementation of the callback. Here’s an example of what its usage will be like:

void *closure_create(void *f, int nargs, void *userdata);
void  closure_destroy(void *);

The callback to be converted into a closure is f and the number of arguments it takes is nargs. A new closure is allocated and returned as a function pointer. This closure takes nargs - 1 arguments, and it will call the original callback with the additional argument userdata.

So, for example, this code uses a closure to convert coord_cmp_r into a function suitable for qsort():

int (*closure)(const void *, const void *);
closure = closure_create(coord_cmp_r, 3, &current_target);

qsort(coords, ncoords, sizeof(coords[0]), closure);

closure_destroy(closure);

Caveat: This API is utterly insufficient for any sort of portability. The number of arguments isn’t nearly enough information for the library to generate a closure. For practically every architecture and ABI, it’s going to depend on the types of each of those arguments. On x86-64 with the System V ABI — where I’ll be implementing this — this argument will only count integer/pointer arguments. To find out what it takes to do this properly, see the libjit documentation.

Memory design

This implementation will be for x86-64 Linux, though the high level details will be the same for any program running in virtual memory. My closures will span exactly two consecutive pages (typically 8kB), though it’s possible to use exactly one page depending on the desired trade-offs. The reason I need two pages are because each page will have different protections.

Native code — the thunk — lives in the upper page. The user data pointer and callback function pointer lives at the high end of the lower page. The two pointers could really be anywhere in the lower page, and they’re only at the end for aesthetic reasons. The thunk code will be identical for all closures of the same number of arguments.

The upper page will be executable and the lower page will be writable. This allows new pointers to be set without writing to executable thunk memory. In the future I expect operating systems to enforce W^X (“write xor execute”), and this code will already be compliant. Alternatively, the pointers could be “baked in” with the thunk page and immutable, but since creating closure requires two system calls, I figure it’s better that the pointers be mutable and the closure object reusable.

The address for the closure itself will be the upper page, being what other functions will call. The thunk will load the user data pointer from the lower page as an additional argument, then jump to the actual callback function also given by the lower page.

Thunk assembly

The x86-64 thunk assembly for a 2-argument closure calling a 3-argument callback looks like this:

user:  dq 0
func:  dq 0
;; --- page boundary here ---
thunk2:
        mov  rdx, [rel user]
        jmp  [rel func]

As a reminder, the integer/pointer argument register order for the System V ABI calling convention is: rdi, rsi, rdx, rcx, r8, r9. The third argument is passed through rdx, so the user pointer is loaded into this register. Then it jumps to the callback address with the original arguments still in place, plus the new argument. The user and func values are loaded RIP-relative (rel) to the address of the code. The thunk is using the callback address (its own address) to determine the context.

The assembled machine code for the thunk is just 13 bytes:

unsigned char thunk2[16] = {
    // mov  rdx, [rel user]
    0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
    // jmp  [rel func]
    0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}

All closure_create() has to do is allocate two pages, copy this buffer into the upper page, adjust the protections, and return the address of the thunk. Since closure_create() will work for nargs number of arguments, there will actually be 6 slightly different thunks, one for each of the possible register arguments (rdi through r9).

static unsigned char thunk[6][13] = {
    {
        0x48, 0x8b, 0x3d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x35, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x4C, 0x8b, 0x05, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x4C, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    },
};

Given a closure pointer returned from closure_create(), here are the setter functions for setting the closure’s two pointers.

void
closure_set_data(void *closure, void *data)
{
    void **p = closure;
    p[-2] = data;
}

void
closure_set_function(void *closure, void *f)
{
    void **p = closure;
    p[-1] = f;
}

In closure_create(), allocation is done with an anonymous mmap(), just like in my JIT compiler. It’s initially mapped writable in order to copy the thunk, then the thunk page is set to executable.

void *
closure_create(void *f, int nargs, void *userdata)
{
    long page_size = sysconf(_SC_PAGESIZE);
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    char *p = mmap(0, page_size * 2, prot, flags, -1, 0);
    if (p == MAP_FAILED)
        return 0;

    void *closure = p + page_size;
    memcpy(closure, thunk[nargs - 1], sizeof(thunk[0]));
    mprotect(closure, page_size, PROT_READ | PROT_EXEC);

    closure_set_function(closure, f);
    closure_set_data(closure, userdata);
    return closure;
}

Destroying a closure is done by computing the lower page address and calling munmap() on it:

void
closure_destroy(void *closure)
{
    long page_size = sysconf(_SC_PAGESIZE);
    munmap((char *)closure - page_size, page_size * 2);
}

And that’s it! You can see the entire demo here:

closure-demo.c

It’s a lot simpler for x86-64 than it is for x86, where there’s no RIP-relative addressing and arguments are passed on the stack. The arguments must all be copied back onto the stack, above the new argument, and it cannot be a tail call since the stack has to be fixed before returning. Here’s what the thunk looks like for a 2-argument closure:

data:	dd 0
func:	dd 0
;; --- page boundary here ---
thunk2:
        call .rip2eax
.rip2eax:
        pop eax
        push dword [eax - 13]
        push dword [esp + 12]
        push dword [esp + 12]
        call [eax - 9]
        add esp, 12
        ret

Exercise for the reader: Port the closure demo to a different architecture or to the the Windows x64 ABI.

Relocatable Global Data on x86

2016-12-23T22:50:51Z

Relocatable code — program code that executes correctly from any properly-aligned address — is an essential feature for shared libraries. Otherwise all of a system’s shared libraries would need to coordinate their virtual load addresses. Loading programs and libraries to random addresses is also a valuable security feature: Address Space Layout Randomization (ASLR). But how does a compiler generate code for a function that accesses a global variable if that variable’s address isn’t known at compile time?

Consider this simple C code sample.

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return x < 4 ? values[x] : 0.0f;
}

This function needs the base address of values in order to dereference it for values[x]. The easiest way to find out how this works, especially without knowing where to start, is to compile the code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian Jessie).

$ gcc -c -Os -fPIC get_value.c

I optimized for size (-Os) to make the disassembly easier to follow. Next, disassemble this pre-linked code with objdump. Alternatively I could have asked for the compiler’s assembly output with -S, but this will be good reverse engineering practice.

$ objdump -d -Mintel get_value.o
0000000000000000 :
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret

There are a couple of interesting things going on, but let’s start from the beginning.

The ABI specifies that the first integer/pointer argument (the 32-bit integer x) is passed through the edi register. The function compares x to 3, to satisfy x < 4.
The ABI specifies that floating point values are returned through the SSE2 SIMD register xmm0. It’s cleared by XORing the register with itself — the conventional way to clear registers on x86 — setting up for a return value of 0.0f.
It then uses the result of the previous comparison to perform a jump, ja (“jump if after”). That is, jump to the relative address specified by the jump’s operand if the first operand to cmp (edi) comes after the first operand (0x3) as unsigned values. Its cousin, jg (“jump if greater”), is for signed values. If x is outside the array bounds, it jumps straight to ret, returning 0.0f.
If x was in bounds, it uses a lea (“load effective address”) to load something into the 64-bit rax register. This is the complicated bit, and I’ll start by giving the answer: The value loaded into rax is the address of the values array. More on this in a moment.
Finally it uses x as an index into address in rax. The movss (“move scalar single-precision”) instruction loads a 32-bit float into the first lane of xmm0, where the caller expects to find the return value. This is all preceded by a mov edi, edi which looks like a hotpatch nop, but it isn’t. x86-64 always uses 64-bit registers for addressing, meaning it uses rdi not edi. All 32-bit register assignments clear the upper 32 bits, and so this mov zero-extends edi into rdi. This is in case of the unlikely event that the caller left garbage in those upper bits.

Clearing `xmm0`

The first interesting part: xmm0 is cleared even when its first lane is loaded with a value. There are two reasons to do this.

The obvious reason is that the alternative requires additional instructions, and I told GCC to optimize for size. It would need either an extra ret or an conditional jmp over the “else” branch.

The less obvious reason is that it breaks a data dependency. For over 20 years now, x86 micro-architectures have employed an optimization technique called register renaming. Architectural registers (rax, edi, etc.) are just temporary names for underlying physical registers. This disconnect allows for more aggressive out-of-order execution. Two instructions sharing an architectural register can be executed independently so long as there are no data dependencies between these instructions.

For example, take this assembly sample. It assembles to 9 bytes of machine code.

    mov  edi, [rcx]
    mov  ecx, 7
    shl  eax, cl

This reads a 32-bit value from the address stored in rcx, then assigns ecx and uses cl (the lowest byte of rcx) in a shift operation. Without register renaming, the shift couldn’t be performed until the load in the first instruction completed. However, the second instruction is a 32-bit assignment, which, as I mentioned before, also clears the upper 32 bits of rcx, wiping the unused parts of register.

So after the second instruction, it’s guaranteed that the value in rcx has no dependencies on code that comes before it. Because of this, it’s likely a different physical register will be used for the second and third instructions, allowing these instructions to be executed out of order, before the load. Ingenious!

Compare it to this example, where the second instruction assigns to cl instead of ecx. This assembles to just 6 bytes.

    mov  edi, [rcx]
    mov  cl, 7
    shl  eax, cl

The result is 3 bytes smaller, but since it’s not a 32-bit assignment, the upper bits of rcx still hold the original register contents. This creates a false dependency and may prevent out-of-order execution, reducing performance.

By clearing xmm0, instructions in get_value involving xmm0 have the opportunity to be executed prior to instructions in the callee that use xmm0.

RIP-relative addressing

Going back to the instruction that computes the address of values.

   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]

Normally load/store addresses are absolute, based off an address either in a general purpose register, or at some hard-coded base address. The latter is not an option in relocatable code. With RIP-relative addressing that’s still the case, but the register with the absolute address is rip, the instruction pointer. This addressing mode was introduced in x86-64 to make relocatable code more efficient.

That means this instruction copies the instruction pointer (pointing to the next instruction) into rax, plus a 32-bit displacement, currently zero. This isn’t the right way to encode a displacement of zero (unless you want a larger instruction). That’s because the displacement will be filled in later by the linker. The compiler adds a relocation entry to the object file so that the linker knows how to do this.

On platforms that use ELF we can inspect relocations this with readelf.

$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4

The relocation type is R_X86_64_PC32. In the AMD64 Architecture Processor Supplement, this is defined as “S + A - P”.

S: Represents the value of the symbol whose index resides in the relocation entry.
A: Represents the addend used to compute the value of the relocatable field.
P: Represents the place of the storage unit being relocated.

The symbol, S, is .rodata — the final address for this object file’s portion of .rodata (where values resides). The addend, A, is -4 since the instruction pointer points at the next instruction. That is, this will be relative to four bytes after the relocation offset. Finally, the address of the relocation, P, is the address of last four bytes of the lea instruction. These values are all known at link-time, so no run-time support is necessary.

Being “S - P” (overall), this will be the displacement between these two addresses: the 32-bit value is relative. It’s relocatable so long as these two parts of the binary (code and data) maintain a fixed distance from each other. The binary is relocated as a whole, so this assumption holds.

32-bit relocation

Since RIP-relative addressing wasn’t introduced until x86-64, how did this all work on x86? Again, let’s just see what the compiler does. Add the -m32 flag for a 32-bit target, and -fomit-frame-pointer to make it simpler for explanatory purposes.

$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 :
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f 
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 <__x86.get_pc_thunk.cx>:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret

Hmm, this one includes an extra function.

In this calling convention, arguments are passed on the stack. The first instruction loads the argument, x, into eax.
The fldz instruction clears the x87 floating pointer return register, just like clearing xmm0 in the x86-64 version.
Next it calls __x86.get_pc_thunk.cx. The call pushes the instruction pointer, eip, onto the stack. This function reads that value off the stack into ecx and returns. In other words, calling this function copies eip into ecx. It’s setting up to load data at an address relative to the code. Notice the function name starts with two underscores — a name which is reserved for exactly for these sorts of implementation purposes.
Next a 32-bit displacement is added to ecx. In this case it’s 2, but, like before, this is actually going be filled in later by the linker.
Then it’s just like before: a branch to optionally load a value. The floating pointer load (fld) is another relocation.

Let’s look at the relocations. There are three this time:

$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata

The first relocation is the call-site for the thunk. The thunk has external linkage and may be merged with a matching thunk in another object file, and so may be relocated. (Clang inlines its thunk.) Calls are relative, so its type is R_386_PC32: a code-relative displacement just like on x86-64.

The next is of type R_386_GOTPC and sets the second operand in that add ecx. It’s defined as “GOT + A - P” where “GOT” is the address of the Global Offset Table — a table of addresses of the binary’s relocated objects. Since values is static, the GOT won’t actually hold an address for it, but the relative address of the GOT itself will be useful.

The final relocation is of type R_386_GOTOFF. This is defined as “S + A - GOT”. Another displacement between two addresses. This is the displacement in the load, fld. Ultimately the load adds these last two relocations together, canceling the GOT:

  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P

So the GOT isn’t relevant in this case. It’s just a mechanism for constructing a custom relocation type.

Branch optimization

Notice in the x86 version the thunk is called before checking the argument. What if it’s most likely that will x be out of bounds of the array, and the function usually returns zero? That means it’s usually wasting its time calling the thunk. Without profile-guided optimization the compiler probably won’t know this.

The typical way to provide such a compiler hint is with a pair of macros, likely() and unlikely(). With GCC and Clang, these would be defined to use __builtin_expect. Compilers without this sort of feature would have macros that do nothing instead. So I gave it a shot:

#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return unlikely(x < 4) ? values[x] : 0.0f;
}

Unfortunately this makes no difference even in the latest version of GCC. In Clang it changes branch fall-through (for static branch prediction), but still always calls the thunk. It seems compilers have difficulty with optimizing relocatable code on x86.

x86-64 isn’t just about more memory

It’s commonly understood that the advantage of 64-bit versus 32-bit systems is processes having access to more than 4GB of memory. But as this shows, there’s more to it than that. Even programs that don’t need that much memory can really benefit from newer features like RIP-relative addressing.

A Showerthoughts Fortune File

2016-12-01T23:58:15Z

I have created a fortune file for the all-time top 10,000 /r/Showerthoughts posts, as of October 2016. As a word of warning: Many of these entries are adult humor and may not be appropriate for your work computer. These fortunes would be categorized as “offensive” (fortune -o).

Download: showerthoughts (1.3 MB)

The copyright status of this file is subject to each of its thousands of authors. Since it’s not possible to contact many of these authors — some may not even still live — it’s obviously never going to be under an open source license (Creative Commons, etc.). Even more, some quotes are probably from comedians and such, rather than by the redditor who made the post. I distribute it only for fun.

Installation

To install this into your fortune database, first process it with strfile to create a random-access index, showerthoughts.dat, then copy them to the directory with the rest.

$ strfile showerthoughts
"showerthoughts.dat" created
There were 10000 strings
Longest string: 343 bytes
Shortest string: 39 bytes

$ cp showerthoughts* /usr/share/games/fortunes/

Alternatively, fortune can be told to use this file directly:

$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
        ―AndItsDeepToo, Aug 2016

If you didn’t already know, fortune is an old unix utility that displays a random quotation from a quotation database — a digital fortune cookie. I use it as an interactive login shell greeting on my ODROID-C2 server:

if shopt -q login_shell; then
    fortune ~/.fortunes
fi

How was it made?

Fortunately I didn’t have to do something crazy like scrape reddit for weeks on end. Instead, I downloaded the pushshift.io submission archives, which is currently around 70 GB compressed. Each file contains one month’s worth of JSON data, one object per submission, one submission per line, all compressed with bzip2.

Unlike so many other datasets, especially when it’s made up of arbitrary inputs from millions of people, the format of the /r/Showerthoughts posts is surprisingly very clean and requires virtually no touching up. It’s some really fantastic data.

A nice feature of bzip2 is concatenating compressed files also concatenates the uncompressed files. Additionally, it’s easy to parallelize bzip2 compression and decompression, which gives it an edge over xz. I strongly recommend using lbzip2 to decompress this data, should you want to process it yourself.

cat RS_*.bz2 | lbunzip2 > everything.json

jq is my favorite command line tool for processing JSON (and rendering fractals). To filter all the /r/Showerthoughts posts, it’s a simple select expression. Just mind the capitalization of the subreddit’s name. The -c tells jq to keep it one per line.

cat RS_*.bz2 | \
    lbunzip2 | \
    jq -c 'select(.subreddit == "Showerthoughts")' \
    > showerthoughts.json

However, you’ll quickly find that jq is the bottleneck, parsing all that JSON. Your cores won’t be exploited by lbzip2 as they should. So I throw grep in front to dramatically decrease the workload for jq.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts")'
    > showerthoughts.json

This will let some extra things through, but it’s a superset. The -a option is necessary because the data contains some null bytes. Without it, grep switches into binary mode and breaks everything. This is incredibly frustrating when you’ve already waited half an hour for results.

To further reduce the workload further down the pipeline, I take advantage of the fact that only four fields will be needed: title, score, author, and created_utc. The rest can — and should, for efficiency’s sake — be thrown away where it’s cheap to do so.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts") |
               {title, score, author, created_utc}' \
    > showerthoughts.json

This gathers all 1,199,499 submissions into a 185 MB JSON file (as of this writing). Most of these submissions are terrible, so the next step is narrowing it to the small set of good submissions and putting them into the fortune database format.

It turns out reddit already has a method for finding the best submissions: a voting system. Just pick the highest scoring posts. Through experimentation I arrived at 10,000 as the magic cut-off number. After this the quality really starts to drop off. Over time this should probably be scaled up with the total number of submissions.

I did both steps at the same time using a bit of Emacs Lisp, which is particularly well-suited to the task:

https://github.com/skeeto/showerthoughts

This Elisp program reads one JSON object at a time and sticks each into a AVL tree sorted by score (descending), then timestamp (ascending), then title (ascending). The AVL tree is limited to 10,000 items, with the lowest items being dropped. This was a lot faster than the more obvious approach: collecting everything into a big list, sorting it, and keeping the top 10,000 items.

Formatting

The most complicated part is actually paragraph wrapping the submissions. Most are too long for a single line, and letting the terminal hard wrap them is visually unpleasing. The submissions are encoded in UTF-8, some with characters beyond simple ASCII. Proper wrapping requires not just Unicode awareness, but also some degree of Unicode rendering. The algorithm needs to recognize grapheme clusters and know the size of the rendered text. This is not so trivial! Most paragraph wrapping tools and libraries get this wrong, some counting width by bytes, others counting width by codepoints.

Emacs’ M-x fill-paragraph knows how to do all these things — only for a monospace font, which is all I needed — and I decided to leverage it when generating the fortune file. Here’s an example that paragraph-wraps a string:

(defun string-fill-paragraph (s)
  (with-temp-buffer
    (insert s)
    (fill-paragraph)
    (buffer-string)))

For the file format, items are delimited by a % on a line by itself. I put the wrapped content, followed by a quotation dash, the author, and the date. A surprising number of these submissions have date-sensitive content (“on this day X years ago”), so I found it was important to include a date.

April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
        ―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
        ―MKLV, Aug 2015
%

There’s the potential that a submission itself could end with a lone % and, with a bit of bad luck, it happens to wrap that onto its own line. Fortunately this hasn’t happened yet. But, now that I’ve advertised it, someone could make such a submission, popular enough for the top 10,000, with the intent to personally trip me up in a future update. I accept this, though it’s unlikely, and it would be fairly easy to work around if it happened.

The strfile program looks for the % delimiters and fills out a table of file offsets. The header of the .dat file indicates the number strings along with some other metadata. What follows is a table of 32-bit file offsets.

struct {
    uint32_t str_version;  /* version number */
    uint32_t str_numstr;   /* # of strings in the file */
    uint32_t str_longlen;  /* length of longest string */
    uint32_t str_shortlen; /* shortest string length */
    uint32_t str_flags;    /* bit field for flags */
    char str_delim;        /* delimiting character */
}

Note that the table doesn’t necessarily need to list the strings in the same order as they appear in the original file. In fact, recent versions of strfile can sort the strings by sorting the table, all without touching the original file. Though none of this important to fortune.

Now that you know how it all works, you can build your own fortune file from your own inputs!

A Magnetized Needle and a Steady Hand

2016-11-17T23:35:26Z

Now they’ve gone an done it. An unidentified agency has spread a potent computer virus across all the world’s computers and deleted the binaries for every copy of every software development tool. Even the offline copies — it’s that potent.

Most of the source code still exists, even for the compilers, and most computer systems will continue operating without disruption, but no new software can be developed unless it’s written byte by byte in raw machine code. Only real programmers can get anything done.

The world’s top software developers have been put to work bootstrapping a C compiler (and others) completely from scratch so that we can get back to normal. Without even an assembler, it’s a slow, tedious process.

In the mean time, rather than wait around for the bootstrap work to complete, the rest of us have been assigned individual programs hit by the virus. For example, many basic unix utilities have been wiped out, and the bootstrap would benefit from having them. Having different groups tackle each missing program will allow the bootstrap effort to move forward somewhat in parallel. At least that’s what the compiler nerds told us. The real reason is that they’re tired of being asked if they’re done yet, and these tasks will keep the rest of us quietly busy.

Fortunately you and I have been assigned the easiest task of all: We’re to write the true command from scratch. We’ll have to figure it out byte by byte. The target is x86-64 Linux, which means we’ll need the following documentation:

Executable and Linking Format (ELF) Specification. This is the binary format used by modern Unix-like systems, including Linux. A more convenient way to access this document is man 5 elf.
Intel 64 and IA-32 Architectures Software Developer’s Manual (Volume 2). This fully documents the instruction set and its encoding. It’s all the information needed to write x86 machine code by hand. The AMD manuals would work too.
System V Application Binary Interface: AMD64 Architecture Processor Supplement. Only a few pieces of information are needed from this document, but more would be needed for a more substantial program.
Some magic numbers from header files.

Manual Assembly

The program we’re writing is true, whose behavior is documented as “do nothing, successfully.” All command line arguments are ignored and no input is read. The program only needs to perform the exit system call, immediately terminating the process.

According to the ABI document (3) Appendix A, the registers for system call arguments are: rdi, rsi, rdx, r10, r8, r9. The system call number goes in rax. The exit system call takes only one argument, and that argument will be 0 (success), so rdi should be set to zero. It’s likely that it’s already zero when the program starts, but the ABI document says its contents are undefined (§3.4), so we’ll set it explicitly.

For Linux on x86-64, the system call number for exit is 60, (/usr/include/asm/unistd_64.h), so rax will be set to 60, followed by syscall.

    xor  edi, edi
    mov  eax, 60
    syscall

There’s no assembler available to turn this into machine code, so it has to be assembled by hand. For that we need the Intel manual (2).

The first instruction is xor, so look up that mnemonic in the manual. Like most x86 mnemonics, there are many different opcodes and multiple ways to encode the same operation. For xor, we have 22 opcodes to examine.

The operands are two 32-bit registers, so there are two options: opcodes 0x31 and 0x33.

31 /r      XOR r/m32, r32
33 /r      XOR r32, r/m32

The “r/m32” means the operand can be either a register or the address of a 32-bit region of memory. With two register operands, both encodings are equally valid, both have the same length (2 bytes), and neither is canonical, so the decision is entirely arbitrary. Let’s pick the first one, opcode 0x31, since it’s listed first.

The “/r” after the opcode means the register-only operand (“r32” in both cases) will be specified in the ModR/M byte. This is the byte that immediately follows the opcode and specifies one of two of the operands.

The ModR/M byte is broken into three parts: mod (2 bits), reg (3 bits), r/m (3 bits). This gets a little complicated, but if you stare at Table 2-1 in the Intel manual for long enough it eventually makes sense. In short, two high bits (11) for mod indicates we’re working with a register rather than a load. Here’s where we’re at for ModR/M:

11 ??? ???

The order of the x86 registers is unintuitive: ax, cx, dx, bx, sp, bp, si, di. With 0-indexing, that gives di a value of 7 (111 in binary). With edi as both operands, this makes ModR/M:

11 111 111

Or, in hexadecimal, FF. And that’s it for this instruction. With the opcode (0x31) and the ModR/M byte (0xFF):

31 FF

The encoding for mov is a bit different. Look it up and match the operands. Like before, there are two possible options:

B8+rd id   MOV r32, imm32
C7 /0 id   MOV r/m32, imm32

In the B8+rd notation means the 32-bit register operand (rd for “register double word”) is added to the opcode instead of having a ModR/M byte. It’s followed by a 32-bit immediate value (id for “integer double word”). That’s a total of 5 bytes.

The “/0” in second means 0 goes in the “reg” field of ModR/M, and the whole instruction is followed by the 32-bit immediate (id). That’s a total of 6 bytes. Since this is longer, we’ll use the first encoding.

So, that’s opcode 0xB8 + 0, since eax is register number 0, followed by 60 (0x3C) as a little endian, 4-byte value. Here’s the encoding for the second instruction:

B8 3C 00 00 00

The final instruction is a cakewalk. There are no operands, it comes in only one form of two opcode bytes.

0F 05   SYSCALL

So the encoding for this instruction is:

0F 05

Putting it all together the program is 9 bytes:

31 FF B8 3C 00 00 00 0F 05

Aren’t you glad you don’t normally have to assemble entire programs by hand?

Constructing the ELF

Back in the old days you may have been able to simply drop these bytes into a file and execute it. That’s how DOS COM programs worked. But this definitely won’t work if you tried it on Linux. Binaries must be in the Executable and Linking Format (ELF). This format tells the loader how to initialize the program in memory and how to start it.

Fortunately for this program we’ll only need to fill out two structures: the ELF header and one program header. The binary will be the ELF header, followed immediately by the program header, followed immediately by the program.

To fill this binary out, we’d use whatever method the virus left behind for writing raw bytes to a file. For now I’ll assume the echo command is still available, and we’ll use hexadecimal \xNN escapes to write raw bytes. If this isn’t available, you might need to use the magnetic needle and steady hand method, or the butterflies.

The very first structure in an ELF file must be the ELF header, from the ELF specification (1):

    typedef struct {
        unsigned char e_ident[EI_NIDENT];
        uint16_t      e_type;
        uint16_t      e_machine;
        uint32_t      e_version;
        ElfN_Addr     e_entry;
        ElfN_Off      e_phoff;
        ElfN_Off      e_shoff;
        uint32_t      e_flags;
        uint16_t      e_ehsize;
        uint16_t      e_phentsize;
        uint16_t      e_phnum;
        uint16_t      e_shentsize;
        uint16_t      e_shnum;
        uint16_t      e_shstrndx;
    } ElfN_Ehdr;

No other data is at a fixed location because this header specifies where it can be found. If you’re writing a C program in the future, once compilers have been bootstrapped back into existence, you can access this structure in elf.h.

The ELF header

The EI_NIDENT macro is 16, so e_ident is 16 bytes. The first 4 bytes are fixed: 0x7F, E, L, F.

The 5th byte is called EI_CLASS: a 32-bit program (ELFCLASS32 = 1) or a 64-bit program (ELFCLASS64 = 2). This will be a 64-bit program (2).

The 6th byte indicates the integer format (EI_DATA). The one we want for x86-64 is ELFDATA2LSB (1), two’s complement, little-endian.

The 7th byte is the ELF version (EI_VERSION), always 1 as of this writing.

The 8th byte is the ABI (ELF_OSABI), which in this case is ELFOSABI_SYSV (0).

The 9th byte is the version (EI_ABIVERSION), which is just 0 again.

The rest is zero padding.

So writing the ELF header:

echo -ne '\x7FELF\x02\x01\x01\x00' > true
echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The next field is the e_type. This is an executable program, so it’s ET_EXEC (2). Other options are object files (ET_REL = 1), shared libraries (ET_DYN = 3), and core files (ET_CORE = 4).

echo -ne '\x02\x00' >> true

The value for e_machine is EM_X86_64 (0x3E). This value isn’t in the ELF specification but rather the ABI document (§4.1.1). On BSD this is instead named EM_AMD64.

echo -ne '\x3E\x00' >> true

For e_version it’s always 1, like in the header.

echo -ne '\x01\x00\x00\x00' >> true

The e_entry field will be 8 bytes because this is a 64-bit ELF. This is the virtual address of the program’s entry point. It’s where the loader will pass control and so it’s where we’ll load the program. The typical entry address is somewhere around 0x400000. For a reason I’ll explain shortly, our entry point will be 120 bytes (0x78) after that nice round number, at 0x40000078.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The e_phoff field holds the offset of the program header table. The ELF header is 64 bytes (0x40) and this structure will immediately follow. It’s also 8 bytes.

echo -ne '\x40\x00\x00\x00\x00\x00\x00\x00' >> true

The e_shoff header holds the offset of the section table. In an executable program we don’t need sections, so this is zero.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The e_flags field has processor-specific flags, which in our case is just 0.

echo -ne '\x00\x00\x00\x00' >> true

The e_ehsize holds the size of the ELF header, which, as I said, is 64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_phentsize is the size of one program header, which is 56 bytes (0x38).

echo -ne '\x38\x00' >> true

The e_phnum field indicates how many program headers there are. We only need the one: the segment with the 9 program bytes, to be loaded into memory.

echo -ne '\x01\x00' >> true

The e_shentsize is the size of a section header. We’re not using this, but we’ll do our due diligence. These are 64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_shnum field is the number of sections (0).

echo -ne '\x00\x00' >> true

The e_shstrndx is the index of the section with the string table. It doesn’t exist, so it’s 0.

echo -ne '\x00\x00' >> true

The program header

Next is our program header.

    typedef struct {
        uint32_t   p_type;
        uint32_t   p_flags;
        Elf64_Off  p_offset;
        Elf64_Addr p_vaddr;
        Elf64_Addr p_paddr;
        uint64_t   p_filesz;
        uint64_t   p_memsz;
        uint64_t   p_align;
    } Elf64_Phdr;

The p_type field indicates the segment type. This segment will hold the program and will be loaded into memory, so we want PT_LOAD (1). Other kinds of segments set up dynamic loading and such.

echo -ne '\x01\x00\x00\x00' >> true

The p_flags field gives the memory protections. We want executable (PF_X = 1) and readable (PF_R = 4). These are ORed together to make 5.

echo -ne '\x05\x00\x00\x00' >> true

The p_offset is the file offset for the content of this segment. This will be the program we assembled. It will immediately follow the this header. The ELF header was 64 bytes, plus a 56 byte program header, which is 120 (0x78).

echo -ne '\x78\x00\x00\x00\x00\x00\x00\x00' >> true

The p_vaddr is the virtual address where this segment will be loaded. This is the entry point from before. A restriction is that this value must be congruent with p_offset modulo the page size. That’s why the entry point was offset by 120 bytes.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The p_paddr is unused for this platform.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The p_filesz is the size of the segment in the file: 9 bytes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_memsz is the size of the segment in memory, also 9 bytes. It might sound redundant, but these are allowed to differ, in which case it’s either truncated or padded with zeroes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_align indicates the segment’s alignment. We don’t care about alignment.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

Append the program

Finally, append the program we assembled at the beginning.

echo -ne '\x31\xFF\xB8\x3C\x00\x00\x00\x0F\x05' >> true

Set it executable (hopefully chmod survived!):

chmod +x true

And test it:

./true && echo 'Success'

Here’s the whole thing as a shell script:

make-true.sh

Is the C compiler done bootstrapping yet?

Emacs, Dynamic Modules, and Joysticks

2016-11-05T04:01:51Z

Two months ago Emacs 25 was released and introduced a new dynamic module feature. Emacs can now load shared libraries built against Emacs’ module API, defined in emacs-module.h. What’s interesting about this API is that it doesn’t require linking against Emacs or any sort of library. Instead, at run time Emacs supplies the module’s initialization function with function pointers for the entire API.

As a demonstration, in this article I’ll build an Emacs joystick interface (Linux only) using a dynamic module. It will allow Emacs to read events from any joystick on the system. All the source code is here:

https://github.com/skeeto/joymacs

It includes a calibration interface (M-x joydemo) within Emacs:

Currently, Emacs’ emacs-module.h header is the entirety of the module documentation. It’s a bit thin and leaves ambiguities that requires some reading of the Emacs source code. Even reading the source, it’s not clear which behaviors are a reliable part of the interface. For example, if there’s a pending non-local exit, it’s safe for a function to return NULL since the return value is never inspected (Emacs 25.1), but will this always be the case? While mistakes are unforgiving (a hard crash), the API is mostly intuitive and it’s been pretty easy to feel my way around it.

Update: Philipp Stephani has written thorough, reliable module documentation.

Dynamic Module Types

All Emacs values — integers, floats, cons cells, vectors, strings, etc. — are represented as the polymorphic, pointer-valued type, emacs_value. Despite being a pointer, NULL is not a valid value, as convenient as that would be. The API includes functions for creating and extracting the fundamental types: integers, floats, strings. Almost all other object types can only be accessed by making Lisp function calls to regular Emacs functions from the module.

Modules also introduce a brand new Emacs object type: a user pointer. These are non-readable, opaque pointer values returned by modules, typically representing a handle to some resource, be it a memory block, database connection, or a joystick. These objects include a finalizer function pointer — which, surprisingly, is not permitted to be NULL — and their lifetime is managed by Emacs’ garbage collector.

User pointers are a somewhat dangerous feature since there’s little to stop Emacs Lisp code from misusing them. A Lisp program can take a user pointer from one module and pass it to a function in a different module. Since it’s just a pointer, there’s no way to type check it. At best, a module could maintain a table of all its live pointers, checking all user pointer arguments against the table before dereferencing. But I don’t expect this to be normal practice.

Module Initialization

After loading the module through the platform’s mechanism, the first thing Emacs does is check for the symbol plugin_is_GPL_compatible. While tacky, this is not surprising given the culture around Emacs.

Next it calls emacs_module_init(), passing it the first function pointer. From this, the module can get a Lisp environment and start doing Emacs things, such as binding module functions to Lisp symbols.

Here’s a complete “Hello, world!” example:

#include "emacs-module.h"

int plugin_is_GPL_compatible;

int
emacs_module_init(struct emacs_runtime *ert)
{
    emacs_env *env = ert->get_environment(ert);
    emacs_value message = env->intern(env, "message");
    const char hi[] = "Hello, world!";
    emacs_value string = env->make_string(env, hi, sizeof(hi) - 1);
    env->funcall(env, message, 1, &string);
    return 0;
}

In a real module, it’s common to create function objects for native functions, then fetch the fset symbol and make a Lisp call on it to bind the newly-created function object to a name. You’ll see this in action later.

Joystick API

The joystick API will closely resemble Linux’s own joystick API, making for a fairly thin wrapper. It’s so thin that Emacs almost doesn’t even need a dynamic module. This is because, on Linux, joysticks are just files under /dev/input/. Want to see the input events on the first joystick? Just read /dev/input/js0. So Plan 9.

Emacs already knows how to read files, but these virtual files are a little too special for that. The header linux/joystick.h defines a struct js_event:

struct js_event {
    uint32_t time;  /* event timestamp in milliseconds */
    int16_t value;
    uint8_t type;
    uint8_t number; /* axis/button number */
};

The idea is to read from the joystick device into this structure. The first several reads are initialization that define the axes and buttons of the joystick and their initial state. Further events are queued up for the file descriptor. This all means that the file can’t just be opened each time joystick input is needed. It has to be held open for the duration, and is typically configured non-blocking.

The Emacs package will be called joymacs and there will be three functions:

(joymacs-open N)
(joymacs-close JOYSTICK)
(joymacs-read JOYSTICK EVENT-VECTOR)

joymacs-open

The joymacs-open function will take an integer, opening the Nth joystick (/dev/input/jsN). It will create a file descriptor for the joystick device, returning it as a user pointer. Think of it as a sort of “joystick handle.” Now, it could instead return the file descriptor as an integer, but the user pointer has two significant benefits:

The resource will be garbage collected. If the caller loses track of a file descriptor returned as an integer, the joystick device will be held open until Emacs shuts down, using up one of Emacs’ file descriptors. By putting it in a user pointer, the garbage collector will have the module to release the file descriptor if the user loses track of it.
It should be difficult for the user to make a dangerous call. Emacs Lisp can’t create user pointers — they only come from modules — and so the module is less likely to get passed the wrong thing. In the case of joystick-close, the module will be calling close(2) on the argument. We definitely don’t want to make that system call on file descriptors owned by Emacs. Further, since user pointers are mutable, the module can ensure it doesn’t call close(2) twice.

Here’s the implementation for joymacs-open. I’ll over over each part in detail.

static emacs_value
joymacs_open(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)ptr;
    (void)n;
    int id = env->extract_integer(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    char buf[64];
    int buflen = sprintf(buf, "/dev/input/js%d", id);
    int fd = open(buf, O_RDONLY | O_NONBLOCK);
    if (fd == -1) {
        emacs_value signal = env->intern(env, "file-error");
        emacs_value message = env->make_string(env, buf, buflen);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }
    return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);
}

The C function name doesn’t matter to Emacs. It’s static because it doesn’t even matter if the function visible to Emacs. It will get the function pointer later as part of initialization.

This is the prototype for all functions callable by Emacs Lisp, regardless of its arity. It has four arguments:

It gets an environment, env, through which to call back into Emacs.
It gets n, the number of arguments. This is guaranteed to be the correct number of arguments, as specified later when creating the function object, so only variadic functions need to inspect this argument.
The Lisp arguments are passed as an array of values, args. There’s no type declaration when declaring a function object, so these may be of the wrong type. I’ll go over how to deal with this.
Finally, it gets an arbitrary pointer, supplied at function object creation time. This allows the module to create closures, but will usually be ignored.

The first thing the function does is extract its integer argument. This is actually an intmax_t, but I don’t think anyone has that many USB ports. An int will suffice.

    int id = env->extract_integer(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;

As for not underestimating fools, what if the user passed a value that isn’t an integer? Will the world come crashing down? Fortunately Emacs checks that in extract_integer and, if there’s a mismatch, sets a pending error signal in the environment. This is really great because checking types directly in the module is a real pain the ass. So, before committing to anything further, such as opening a file, I check for this signal and bail out early if necessary. In Emacs 25.1 it’s safe to return NULL since the return value will be completely ignored, but I’d rather hedge my bets.

By the way, the nil here is a global variable set in initialization. You don’t just get that for free!

The next step is opening the joystick device, read-only and non-blocking. The non-blocking is vital because the module would otherwise hang Emacs later if there are no events (well, except for the read being quickly interrupted by a POSIX signal).

    char buf[64];
    int buflen = sprintf(buf, "/dev/input/js%d", id);
    int fd = open(buf, O_RDONLY | O_NONBLOCK);

If the joystick fails to open (e.g. it doesn’t exist, or the user lacks permission), manually set an error signal for a non-local exit. I chose the file-error signal and I’m just using the filename as the signal data.

    if (fd == -1) {
        emacs_value signal = env->intern(env, "file-error");
        emacs_value message = env->make_string(env, buf, buflen);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }

Otherwise create the user pointer. No need to allocate any memory; just stuff it in the pointer itself. If the user mistakenly passes it to another module, it will sure be in for a surprise when it tries to dereference it.

    return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);

The fin_close() function is defined as:

static void
fin_close(void *fdptr)
{
    int fd = (intptr_t)fdptr;
    if (fd != -1)
        close(fd);
}

The garbage collector will call this function when the user pointer is lost. If the user closes it early with joymacs-close, that function will set the user pointer to -1, an invalid file descriptor, so that it doesn’t get closed a second time here.

joymacs-close

Here’s joymacs-close, which is a bit simpler.

static emacs_value
joymacs_close(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)ptr;
    (void)n;
    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    if (fd != -1) {
        close(fd);
        env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
    }
    return nil;
}

Again, it starts by extracting its argument, relying on Emacs to do the check:

    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;

If the user pointer hasn’t been closed yet, then close it and strip out the file descriptor to prevent further closes.

    if (fd != -1) {
        close(fd);
        env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
    }

joymacs-read

The joymacs-read function is doing something a little unusual for an Emacs Lisp function. It takes two arguments: the joystick handle and a 5-element vector. Instead of returning the event in some representation, it fills the vector with the event details. The are two reasons for this:

The API has no function for creating vectors … though the module could get the make-symbol vector and call it to create a vector.
The idiom for event pumps is for the caller to supply a buffer to the pump. This has better performance by avoiding lots of unnecessary allocations, especially since events tend to be message-like objects with a short, well-defined extent.

Here’s the full definition:

static emacs_value
joymacs_read(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)n;
    (void)ptr;
    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    struct js_event e;
    int r = read(fd, &e, sizeof(e));
    if (r == -1 && errno == EAGAIN) {
        /* No more events. */
        return nil;
    } else if (r == -1) {
        /* An actual read error (joystick unplugged, etc.). */
        emacs_value signal = env->intern(env, "file-error");
        const char *error = strerror(errno);
        size_t len = strlen(error);
        emacs_value message = env->make_string(env, error, len);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    } else {
        /* Fill out event vector. */
        emacs_value v = args[1];
        emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
        emacs_value value;
        if (type == button)
            value = e.value ? t : nil;
        else
            value =  env->make_float(env, e.value / (double)INT16_MAX);
        env->vec_set(env, v, 0, env->make_integer(env, e.time));
        env->vec_set(env, v, 1, type);
        env->vec_set(env, v, 2, value);
        env->vec_set(env, v, 3, env->make_integer(env, e.number));
        env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
        return args[1];
    }
}

As before, extract the first argument and check for a signal. Then call read(2) to get an event. If the read fails with EAGAIN, it’s not a real failure. There are just no more events, so return nil.

    struct js_event e;
    int r = read(fd, &e, sizeof(e));
    if (r == -1 && errno == EAGAIN) {
        /* No more events. */
        return nil;
    }

If the read failed with something else — perhaps the joystick was unplugged — signal an error. The strerror(3) string is used for the signal data.

    if (r == -1) {
        /* An actual read error (joystick unplugged, etc.). */
        emacs_value signal = env->intern(env, "file-error");
        const char *error = strerror(errno);
        emacs_value message = env->make_string(env, error, strlen(error));
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }

Otherwise fill out the event vector. If the second argument isn’t a vector, or if it’s too short, the signal will automatically get raised by Emacs. The module can keep plowing through the vec_set() calls safely since it’s not committing to anything.

        /* Fill out event vector. */
        emacs_value v = args[1];
        emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
        emacs_value value;
        if (type == button)
            value = e.value ? t : nil;
        else
            value =  env->make_float(env, e.value / (double)INT16_MAX);
        env->vec_set(env, v, 0, env->make_integer(env, e.time));
        env->vec_set(env, v, 1, type);
        env->vec_set(env, v, 2, value);
        env->vec_set(env, v, 3, env->make_integer(env, e.number));
        env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
        return args[1];

The Linux event struct has four fields and the function fills out five values of the vector. This is because the type field has a bit flag indicating initialization events. This is split out into an extra t/nil value. It also normalizes axis values and converts button values into t/nil, which makes more sense for Emacs Lisp. The event itself is returned since it’s a truthy value and it’s convenient for the caller.

The astute programmer might notice that the negative side of the axis could go just below -1.0, since INT16_MIN has one extra value over INT16_MAX (two’s complement). It doesn’t seem to be documented, but the joystick drivers I’ve seen never exactly return INT16_MIN, so this is in fact the correct way to normalize it.

Initialization

Update 2021: In a previous version of this article, I talked about interning symbols during initialziation so that they do not need to be re-interned each time the module is called. This no longer works, and it was probably never intended to be work in the first place. The lesson is simple: Do not reuse Emacs objects between module calls.

First grab the fset symbol since this function will be needed to bind names to the module’s functions.

    emacs_value fset = env->intern(env, "fset");

Using fset, bind the functions. The second and third arguments to make_function are the minimum and maximum number of arguments, which may look familiar. The last argument is that closure pointer I mentioned at the beginning.

    emacs_value args[2];
    args[0] = env->intern(env, "joymacs-open");
    args[1] = env->make_function(env, 1, 1, joymacs_open, doc, 0);
    env->funcall(env, fset, 2, args);

If the module is to be loaded with require like any other package, it needs to provide: (provide 'joymacs).

    emacs_value provide = env->intern(env, "provide");
    emacs_value joymacs = env->intern(env, "joymacs");
    env->funcall(env, provide, 1, &joymacs);

And that’s it!

The source repository now includes a port to Windows (XInput). If you’re on Linux or Windows, have Emacs 25 with modules enabled, and a joystick is plugged in, then make run in the repository should bring up Emacs running a joystick calibration demonstration. The module can’t poke at Emacs when events are ready, so instead there’s a timer that polls the module for events.

I’d like to someday see an Emacs Lisp game well-suited for a joystick.

An Array of Pointers vs. a Multidimensional Array

2016-10-27T21:01:33Z

In a C program, suppose I have a table of color names of similar length. There are two straightforward ways to construct this table. The most common would be an array of char *.

char *colors_ptr[] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The other is a two-dimensional char array.

char colors_2d[][7] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The initializers are identical, and the syntax by which these tables are used is the same, but the underlying data structures are very different. For example, suppose I had a lookup() function that searches the table for a particular color.

int
lookup(const char *color)
{
    int ncolors = sizeof(colors) / sizeof(colors[0]);
    for (int i = 0; i < ncolors; i++)
        if (strcmp(colors[i], color) == 0)
            return i;
    return -1;
}

Thanks to array decay — array arguments are implicitly converted to pointers (§6.9.1-10) — it doesn’t matter if the table is char colors[][7] or char *colors[]. It’s a little bit misleading because the compiler generates different code depending on the type.

Memory Layout

Here’s what colors_ptr, a jagged array, typically looks like in memory.

The array of six pointers will point into the program’s string table, usually stored in a separate page. The strings aren’t in any particular order and will be interspersed with the program’s other string constants. The type of the expression colors_ptr[n] is char *.

On x86-64, suppose the base of the table is in rax, the index of the string I want to retrieve is rcx, and I want to put the string’s address back into rax. It’s one load instruction.

mov   rax, [rax + rcx*8]

Contrast this with colors_2d: six 7-byte elements in a row. No pointers or addresses. Only strings.

The strings are in their defined order, packed together. The type of the expression colors_2d[n] is char [7], an array rather than a pointer. If this was a large table used by a hot function, it would have friendlier cache characteristics — both in locality and predictability.

In the same scenario before with x86-64, it takes two instructions to put the string’s address in rax, but neither is a load.

imul  rcx, rcx, 7
add   rax, rcx

In this particular case, the generated code can be slightly improved by increasing the string size to 8 (e.g. char colors_2d[][8]). The multiply turns into a simple shift and the ALU no longer needs to be involved, cutting it to one instruction. This looks like a load due to the LEA (Load Effective Address), but it’s not.

lea   rax, [rax + rcx*8]

Relocation

There’s another factor to consider: relocation. Nearly every process running on a modern system takes advantage of a security feature called Address Space Layout Randomization (ASLR). The virtual address of code and data is randomized at process load time. For shared libraries, it’s not just a security feature, it’s essential to their basic operation. Libraries cannot possibly coordinate their preferred load addresses with every other library on the system, and so must be relocatable.

If the program is compiled with GCC or Clang configured for position independent code — -fPIC (for libraries) or -fpie + -pie (for programs) — extra work has to be done to support colors_ptr. Those are all addresses in the pointer array, but the compiler doesn’t know what those addresses will be. The compiler fills the elements with temporary values and adds six relocation entries to the binary, one for each element. The loader will fill out the array at load time.

However, colors_2d doesn’t have any addresses other than the address of the table itself. The loader doesn’t need to be involved with each of its elements. Score another point for the two-dimensional array.

On x86-64, in both cases the table itself typically doesn’t need a relocation entry because it will be RIP-relative (in the small code model). That is, code that uses the table will be at a fixed offset from the table no matter where the program is loaded. It won’t need to be looked up using the Global Offset Table (GOT).

In case you’re ever reading compiler output, in Intel syntax the assembly for putting the table’s RIP-relative address in rax looks like so:

;; NASM:
lea    rax, [rel address]
;; Some others:
lea    rax, [rip + address]

Or in AT&T syntax:

lea    address(%rip), %rax

Virtual Memory

Besides (trivially) more work for the loader, there’s another consequence to relocations: Pages containing relocations are not shared between processes (except after fork()). When loading a program, the loader doesn’t copy programs and libraries to memory so much as it memory maps their binaries with copy-on-write semantics. If another process is running with the same binaries loaded (e.g. libc.so), they’ll share the same physical memory so long as those pages haven’t been modified by either process. Modifying the page creates a unique copy for that process.

Relocations modify parts of the loaded binary, so these pages aren’t shared. This means colors_2d has the possibility of being shared between processes, but colors_ptr (and its entire page) definitely does not. Shucks.

This is one of the reasons why the Procedure Linkage Table (PLT) exists. The PLT is an array of function stubs for shared library functions, such as those in the C standard library. Sure, the loader could go through the program and fill out the address of every library function call, but this would modify lots and lots of code pages, creating a unique copy of large parts of the program. Instead, the dynamic linker lazily supplies jump addresses for PLT function stubs, one per accessed library function.

However, as I’ve written it above, it’s unlikely that even colors_2d will be shared. It’s still missing an important ingredient: const.

Const

They say const isn’t for optimization but, darnit, this situation keeps coming up. Since colors_ptr and colors_2d are both global, writable arrays, the compiler puts them in the same writable data section of the program, and, in my test program, they end up right next to each other in the same page. The other relocations doom colors_2d to being a local copy.

Fortunately it’s trivial to fix by adding a const:

const char colors_2d[][7] = { /* ... */ };

Writing to this memory is now undefined behavior, so the compiler is free to put it in read-only memory (.rodata) and separate from the dirty relocations. On my system, this is close enough to the code to wind up in executable memory.

Note, the equivalent for colors_ptr requires two const qualifiers, one for the array and another for the strings. (Obviously the const doesn’t apply to the loader.)

const char *const colors_ptr[] = { /* ... */ };

String literals are already effectively const, though the C specification (unlike C++) doesn’t actually define them to be this way. But, like setting your relationship status on Facebook, declaring it makes it official.

It’s just micro-optimization

These little details are all deep down the path of micro-optimization and will rarely ever matter in practice, but perhaps you learned something broader from all this. This stuff fascinates me.

Linux System Calls, Error Numbers, and In-Band Signaling

2016-09-23T01:07:40Z

Today I got an e-mail asking about a previous article on creating threads on Linux using raw system calls (specifically x86-64). The questioner was looking to use threads in a program without any libc dependency. However, he was concerned about checking for mmap(2) errors when allocating the thread’s stack. The mmap(2) man page says it returns -1 (a.k.a. MAP_FAILED) on error and sets errno. But how do you check errno without libc?

As a reminder here’s what the (unoptimized) assembly looks like.

stack_create:
    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
    mov rax, SYS_mmap
    syscall
    ret

As usual, the system call return value is in rax, which becomes the return value for stack_create(). Again, its C prototype would look like this:

void *stack_create(void);

If you were to, say, intentionally botch the arguments to force an error, you might notice that the system call isn’t returning -1, but other negative values. What gives?

The trick is that errno is a C concept. That’s why it’s documented as errno(3) — the 3 means it belongs to C. Just think about how messy this thing is: it’s a thread-local value living in the application’s address space. The kernel rightfully has nothing to do with it. Instead, the mmap(2) wrapper in libc assigns errno (if needed) after the system call returns. This is how all system calls through libc work, even with the syscall(2) wrapper.

So how does the kernel report the error? It’s an old-fashioned return value. If you have any doubts, take it straight from the horse’s mouth: mm/mmap.c:do_mmap(). Here’s a sample of return statements.

if (!len)
        return -EINVAL;

/* Careful about overflows.. */
len = PAGE_ALIGN(len);
if (!len)
        return -ENOMEM;

/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
        return -EOVERFLOW;

/* Too many mappings? */
if (mm->map_count > sysctl_max_map_count)
        return -ENOMEM;

It’s returning the negated error number. Simple enough.

If you think about it a moment, you might notice a complication: This is a form of in-band signaling. On success, mmap(2) returns a memory address. All those negative error numbers are potentially addresses that a caller might want to map. How can we tell the difference?

1) None of the possible error numbers align on a page boundary, so they’re not actually valid return values. NULL does lie on a page boundary, which is one reason why it’s not used as an error return value for mmap(2). The other is that you might actually want to map NULL, for better or worse.

2) Those low negative values lie in a region of virtual memory reserved exclusively for the kernel (sometimes called “low memory”). On x86-64, any address with the most significant bit set (i.e. the sign bit of a signed integer) is one of these addresses. Processes aren’t allowed to map these addresses, and so mmap(2) will never return such a value on success.

So what’s a clean, safe way to go about checking for error values? It’s a lot easier to read musl than glibc, so let’s take a peek at how musl does it in its own mmap: src/mman/mmap.c.

if (off & OFF_MASK) {
    errno = EINVAL;
    return MAP_FAILED;
}
if (len >= PTRDIFF_MAX) {
    errno = ENOMEM;
    return MAP_FAILED;
}
if (flags & MAP_FIXED) {
    __vm_wait();
}
return (void *)syscall(SYS_mmap, start, len, prot, flags, fd, off);

Hmm, it looks like its returning the result directly. What happened to setting errno? Well, syscall() is actually a macro that runs the result through __syscall_ret().

#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))

Looking a little deeper: src/internal/syscall_ret.c.

long __syscall_ret(unsigned long r)
{
    if (r > -4096UL) {
        errno = -r;
        return -1;
    }
    return r;
}

Bingo. As documented, if the value falls within that “high” (unsigned) range of negative values for any system call, it’s an error number.

Getting back to the original question, we could employ this same check in the assembly code. However, since this is a anonymous memory map with a kernel-selected address, there’s only one possible error: ENOMEM (12). This error happens if the maximum number of memory maps has been reached, or if there’s no contiguous region available for the 4MB stack. The check will only need to test the result against -12.

Inspecting C's qsort Through Animation

2016-09-05T21:17:11Z

The C standard library includes a qsort() function for sorting arbitrary buffers given a comparator function. The name comes from its original Unix implementation, “quicker sort,” a variation of the well-known quicksort algorithm. The C standard doesn’t specify an algorithm, except to say that it may be unstable (C99 §7.20.5.2¶4) — equal elements have an unspecified order. As such, different C libraries use different algorithms, and even when using the same algorithm they make different implementation trade-offs.

I added a drawing routine to a comparison function to see what the sort function was doing for different C libraries. Every time it’s called for a comparison, it writes out a snapshot of the array as a Netpbm PPM image. It’s easy to turn concatenated PPMs into a GIF or video. Here’s my code if you want to try it yourself:

qsort-animate.c

Adjust the parameters at the top to taste. Rather than call rand() in the standard library, I included xorshift64star() with a hard-coded seed so that the array will be shuffled exactly the same across all platforms. This makes for a better comparison.

To get an optimized GIF on unix-like systems, run it like so. (Microsoft’s UCRT currently has serious bugs with pipes, so it was run differently in that case.)

./a.out | convert -delay 10 ppm:- gif:- | gifsicle -O3 > sort.gif

The number of animation frames reflects the efficiency of the sort, but this isn’t really a benchmark. The input array is fully shuffled, and real data often not. For a benchmark, have a look at a libc qsort() shootout of sorts instead.

To help you follow along, clicking on any animation will restart it.

glibc

Sorted in 307 frames. glibc prefers to use mergesort, which, unlike quicksort, isn’t an in-place algorithm, so it has to allocate memory. That allocation could fail for huge arrays, and, since qsort() can’t fail, it uses quicksort as a backup. You can really see the mergesort in action: changes are made that we cannot see until later, when it’s copied back into the original array.

dietlibc (0.32)

Sorted in 503 frames. dietlibc is an alternative C standard library for Linux. It’s optimized for size, which shows through its slower performance. It looks like a quicksort that always chooses the last element as the pivot.

Update: Felix von Leitner, the primary author of dietlibc, has alerted me that, as of version 0.33, it now chooses a random pivot. This comment from the source describes it:

We chose the rightmost element in the array to be sorted as pivot, which is OK if the data is random, but which is horrible if the data is already sorted. Try to improve by exchanging it with a random other pivot.

musl

Sort in 637 frames. musl libc is another alternative C standard library for Linux. It’s my personal preference when I statically link Linux binaries. Its qsort() looks a lot like a heapsort, and with some research I see it’s actually smoothsort, a heapsort variant.

BSD

Sorted in 354 frames. I ran it on both OpenBSD and FreeBSD with identical results, so, unsurprisingly, they share an implementation. It’s quicksort, and what’s neat about it is at the beginning you can see it searching for a median for use as the pivot. This helps avoid the O(n^2) worst case.

BSD also includes a mergesort() with the same prototype, except with an int return for reporting failures. This one sorted in 247 frames. Like glibc before, there’s some behind-the-scenes that isn’t captured. But even more, notice how the markers disappear during the merge? It’s running the comparator against copies, stored outside the original array. Sneaky!

Again, BSD also includes heapsort(), so ran that too. It sorted in 418 frames. It definitely looks like a heapsort, and the worse performance is similar to musl. It seems heapsort is a poor fit for this data.

Cygwin

It turns out Cygwin borrowed its qsort() from BSD. It’s pixel identical to the above. I hadn’t noticed until I looked at the frame counts.

MSVCRT.DLL (MinGW) and UCRT (Visual Studio)

MinGW builds against MSVCRT.DLL, found on every Windows system despite its unofficial status. Until recently Microsoft didn’t include a C standard library as part of the OS, but that changed with their Universal CRT (UCRT) announcement. I thought I’d try them both.

Turns out they borrowed their old qsort() for the UCRT, and the result is the same: sorted in 417 frames. It chooses a pivot from the median of the ends and the middle, swaps the pivot to the middle, then partitions. Looking to the middle for the pivot makes sorting pre-sorted arrays much more efficient.

Pelles C

Finally I ran it against Pelles C, a C compiler for Windows. It sorted in 463 frames. I can’t find any information about it, but it looks like some sort of hybrid between quicksort and insertion sort. Like BSD qsort(), it finds a good median for the pivot, partitions the elements, and if a partition is small enough, it switches to insertion sort. This should behave well on mostly-sorted arrays, but poorly on well-shuffled arrays (like this one).

More Implementations

That’s everything that was readily accessible to me. If you can run it against something new, I’m certainly interested in seeing more implementations.

How to Read and Write Other Process Memory

2016-09-03T21:53:26Z

I recently put together a little game memory cheat tool called MemDig. It can find the address of a particular game value (score, lives, gold, etc.) after being given that value at different points in time. With the address, it can then modify that value to whatever is desired.

I’ve been using tools like this going back 20 years, but I never tried to write one myself until now. There are many memory cheat tools to pick from these days, the most prominent being Cheat Engine. These tools use the platform’s debugging API, so of course any good debugger could do the same thing, though a debugger won’t be specialized appropriately (e.g. locating the particular address and locking its value).

My motivation was bypassing an in-app purchase in a single player Windows game. I wanted to convince the game I had made the purchase when, in fact, I hadn’t. Once I had it working successfully, I ported MemDig to Linux since I thought it would be interesting to compare. I’ll start with Windows for this article.

Windows

Only three Win32 functions are needed, and you could almost guess at how it works.

It’s very straightforward ~~and, for this purpose, is probably the simplest API for any platform~~ (see update).

As you probably guessed, you first need to open the process, given its process ID (integer). You’ll need to select the desired access bit a bit set. To read memory, you need the PROCESS_VM_READ and PROCESS_QUERY_INFORMATION rights. To write memory, you need the PROCESS_VM_WRITE and PROCESS_VM_OPERATION rights. Alternatively you could just ask for all rights with PROCESS_ALL_ACCESS, but I prefer to be precise.

DWORD access = PROCESS_VM_READ |
               PROCESS_QUERY_INFORMATION |
               PROCESS_VM_WRITE |
               PROCESS_VM_OPERATION;
HANDLE proc = OpenProcess(access, FALSE, pid);

And then to read or write:

void *addr; // target process address
SIZE_T written;
ReadProcessMemory(proc, addr, &value, sizeof(value), &written);
// or
WriteProcessMemory(proc, addr, &value, sizeof(value), &written);

Don’t forget to check the return value and verify written. Finally, don’t forget to close it when you’re done.

CloseHandle(proc);

That’s all there is to it. For the full cheat tool you’d need to find the mapped regions of memory, via VirtualQueryEx. It’s not as simple, but I’ll leave that for another article.

Linux

Unfortunately there’s no standard, cross-platform debugging API for unix-like systems. Most have a ptrace() system call, though each works a little differently. Note that ptrace() is not part of POSIX, but appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere. The following will all be specific to Linux, though the procedure is similar on other unix-likes.

In typical Linux fashion, if it involves other processes, you use the standard file API on the /proc filesystem. Each process has a directory under /proc named as its process ID. In this directory is a virtual file called “mem”, which is a file view of that process’ entire address space, including unmapped regions.

char file[64];
sprintf(file, "/proc/%ld/mem", (long)pid);
int fd = open(file, O_RDWR);

The catch is that while you can open this file, you can’t actually read or write on that file without attaching to the process as a debugger. You’ll just get EIO errors. To attach, use ptrace() with PTRACE_ATTACH. This asynchronously delivers a SIGSTOP signal to the target, which has to be waited on with waitpid().

You could select the target address with lseek(), but it’s cleaner and more efficient just to do it all in one system call with pread() and pwrite(). I’ve left out the error checking, but the return value of each function should be checked:

ptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, NULL, 0);

off_t addr = ...; // target process address
pread(fd, &value, sizeof(value), addr);
// or
pwrite(fd, &value, sizeof(value), addr);

ptrace(PTRACE_DETACH, pid, 0, 0);

The process will (and must) be stopped during this procedure, so do your reads/writes quickly and get out. The kernel will deliver the writes to the other process’ virtual memory.

Like before, don’t forget to close.

close(fd);

To find the mapped regions in the real cheat tool, you would read and parse the virtual text file /proc/pid/maps. I don’t know if I’d call this stringly-typed method elegant — the kernel converts the data into string form and the caller immediately converts it right back — but that’s the official API.

Update: Konstantin Khlebnikov has pointed out the process_vm_readv() and process_vm_writev() system calls, available since Linux 3.2 (January 2012) and glibc 2.15 (March 2012). These system calls do not require ptrace(), nor does the remote process need to be stopped. They’re equivalent to ReadProcessMemory() and WriteProcessMemory(), except there’s no requirement to first “open” the process.

Automatic Deletion of Incomplete Output Files

2016-08-07T02:00:37Z

Conventionally, a program that creates an output file will delete its incomplete output should an error occur while writing the file. It’s risky to leave behind a file that the user may rightfully confuse for a valid file. They might not have noticed the error.

For example, compression programs such as gzip, bzip2, and xz when given a compressed file as an argument will create a new file with the compression extension removed. They write to this file as the compressed input is being processed. If the compressed stream contains an error in the middle, the partially-completed output is removed.

There are exceptions of course, such as programs that download files over a network. The partial result has value, especially if the transfer can be continued from where it left off. The convention is to append another extension, such as “.part”, to indicate a partial output.

The straightforward solution is to always delete the file as part of error handling. A non-interactive program would report the error on standard error, delete the file, and exit with an error code. However, there are at least two situations where error handling would be unable to operate: unhandled signals (usually including a segmentation fault) and power failures. A partial or corrupted output file will be left behind, possibly looking like a valid file.

A common, more complex approach is to name the file differently from its final name while being written. If written successfully, the completed file is renamed into place. This is already required for durable replacement, so it’s basically free for many applications. In the worst case, where the program is unable to clean up, the obviously incomplete file is left behind only wasting space.

Looking to be more robust, I had the following misguided idea: Rely completely on the operating system to perform cleanup in the case of a failure. Initially the file would be configured to be automatically deleted when the final handle is closed. This takes care of all abnormal exits, and possibly even power failures. The program can just exit on error without deleting the file. Once written successfully, the automatic-delete indicator is cleared so that the file survives.

The target application for this technique supports both Linux and Windows, so I would need to figure it out for both systems. On Windows, there’s the flag FILE_FLAG_DELETE_ON_CLOSE. I’d just need to find a way to clear it. On POSIX, file would be unlinked while being written, and linked into the filesystem on success. The latter turns out to be a lot harder than I expected.

Solution for Windows

I’ll start with Windows since the technique actually works fairly well here — ignoring the usual, dumb Win32 filesystem caveats. This is a little surprising, since it’s usually Win32 that makes these things far more difficult than they should be.

The primary Win32 function for opening and creating files is CreateFile. There are many options, but the key is FILE_FLAG_DELETE_ON_CLOSE. Here’s how an application might typically open a file for output.

DWORD access = GENERIC_WRITE;
DWORD create = CREATE_ALWAYS;
DWORD flags = FILE_FLAG_DELETE_ON_CLOSE;
HANDLE f = CreateFile("out.tmp", access, 0, 0, create, flags, 0);

This special flag asks Windows to delete the file as soon as the last handle to to file object is closed. Notice I said file object, not file, since these are different things. The catch: This flag is a property of the file object, not the file, and cannot be removed.

However, the solution is simple. Create a new link to the file so that it survives deletion. This even works for files residing on a network shares.

CreateHardLink("out", "out.tmp", 0);
CloseHandle(f);  // deletes out.tmp file

The gotcha is that the underlying filesystem must be NTFS. FAT32 doesn’t support hard links. Unfortunately, since FAT32 remains the least common denominator and is still widely used for removable media, depending on the application, your users may expect support for saving files to FAT32. A workaround is probably required.

Solution for Linux

This is where things really fall apart. It’s just barely possible on Linux, it’s messy, and it’s not portable anywhere else. There’s no way to do this for POSIX in general.

My initial thought was to create a file then unlink it. Unlike the situation on Windows, files can be unlinked while they’re currently open by a process. These files are finally deleted when the last file descriptor (the last reference) is closed. Unfortunately, using unlink(2) to remove the last link to a file prevents that file from being linked again.

Instead, the solution is to use the relatively new (since Linux 3.11), Linux-specific O_TMPFILE flag when creating the file. Instead of a filename, this variation of open(2) takes a directory and creates an unnamed, temporary file in it. These files are special in that they’re permitted to be given a name in the filesystem at some future point.

For this example, I’ll assume the output is relative to the current working directory. If it’s not, you’ll need to open an additional file descriptor for the parent directory, and also use openat(2) to avoid possible race conditions (since paths can change from under you). The number of ways this can fail is already rapidly multiplying.

int fd = open(".", O_TMPFILE|O_WRONLY, 0600);

The catch is that only a handful of filesystems support O_TMPFILE. It’s like the FAT32 problem above, but worse. You could easily end up in a situation where it’s not supported, and will almost certainly require a workaround.

Linking a file from a file descriptor is where things get messier. The file descriptor must be linked with linkat(2) from its name on the /proc virtual filesystem, constructed as a string. The following snippet comes straight from the Linux open(2) manpage.

char buf[64];
sprintf(buf, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, buf, AT_FDCWD, "out", AT_SYMLINK_FOLLOW);

Even on Linux, /proc isn’t always available, such as within a chroot or a container, so this part can fail as well. In theory there’s a way to do this with the Linux-specific AT_EMPTY_PATH and avoid /proc, but I couldn’t get it to work.

// Note: this doesn't actually work for me.
linkat(fd, "", AT_FDCWD, "out", AT_EMPTY_PATH);

Given the poor portability (even within Linux), the number of ways this can go wrong, and that a workaround is definitely needed anyway, I’d say this technique is worthless. I’m going to stick with the tried-and-true approach for this one.

Appending to a File from Multiple Processes

2016-08-03T16:17:44Z

Suppose you have multiple processes appending output to the same file without explicit synchronization. These processes might be working in parallel on different parts of the same problem, or these might be threads blocked individually reading different external inputs. There are two concerns that come into play:

1) The append must be atomic such that it doesn’t clobber previous appends by other threads and processes. For example, suppose a write requires two separate operations: first moving the file pointer to the end of the file, then performing the write. There would be a race condition should another process or thread intervene in between with its own write.

2) The output will be interleaved. The primary solution is to design the data format as atomic records, where the ordering of records is unimportant — like rows in a relational database. This could be as simple as a text file with each line as a record. The concern is then ensuring records are written atomically.

This article discusses processes, but the same applies to threads when directly dealing with file descriptors.

Appending

The first concern is solved by the operating system, with one caveat. On POSIX systems, opening a file with the O_APPEND flag will guarantee that writes always safely append.

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

However, this says nothing about interleaving. Two processes successfully appending to the same file will result in all their bytes in the file in order, but not necessarily contiguously.

The caveat is that not all filesystems are POSIX-compatible. Two famous examples are NFS and the Hadoop Distributed File System (HDFS). On these networked filesystems, appends are simulated and subject to race conditions.

On POSIX systems, fopen(3) with the a flag will use O_APPEND, so you don’t necessarily need to use open(2). On Linux this can be verified for any language’s standard library with strace.

#include 

int main(void)
{
    fopen("/dev/null", "a");
    return 0;
}

And the result of the trace:

$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

For Win32, the equivalent is the FILE_APPEND_DATA access right, and similarly only applies to “local files.”

Interleaving and Pipes

The interleaving problem has two layers, and gets more complicated the more correct you want to be. Let’s start with pipes.

On POSIX, a pipe is unseekable and doesn’t have a file position, so appends are the only kind of write possible. When writing to a pipe (or FIFO), writes less than the system-defined PIPE_BUF are guaranteed to be atomic and non-interleaving.

Write requests of PIPE_BUF bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than PIPE_BUF bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, […]

The minimum value for PIPE_BUF for POSIX systems is 512 bytes. On Linux it’s 4kB, and on other systems it’s as high as 32kB. As long as each record is less than 512 bytes, a simple write(2) will due. None of this depends on a filesystem since no files are involved.

If more than PIPE_BUF bytes isn’t enough, the POSIX writev(2) can be used to atomically write up to IOV_MAX buffers of PIPE_BUF bytes. The minimum value for IOV_MAX is 16, but is typically 1024. This means the maximum safe atomic write size for pipes — and therefore the largest record size — for a perfectly portable program is 8kB (16✕512). On Linux it’s 4MB.

That’s all at the system call level. There’s another layer to contend with: buffered I/O in your language’s standard library. Your program may pass data in appropriately-sized pieces for atomic writes to the I/O library, but it may be undoing your hard work, concatenating all these writes into a buffer, splitting apart your records. For this part of the article, I’ll focus on single-threaded C programs.

Suppose you’re writing a simple space-separated format with one line per record.

int foo, bar;
float baz;
while (condition) {
    // ...
    printf("%d %d %f\n", foo, bar, baz);
}

Whether or not this works depends on how stdout is buffered. C standard library streams (FILE *) have three buffering modes: unbuffered, line buffered, and fully buffered. Buffering is configured through setbuf(3) and setvbuf(3), and the initial buffering state of a stream depends on various factors. For buffered streams, the default buffer is at least BUFSIZ bytes, itself at least 256 (C99 §7.19.2¶7). Note: threads share this buffer.

Since each record in the above program easily fits inside 256 bytes, if stdout is a line buffered pipe then this program will interleave correctly on any POSIX system without further changes.

If instead your output is comma-separated values (CSV) and your records may contain new line characters, there are two approaches. In each, the record must still be no larger than PIPE_BUF bytes.

Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3)) and output the entire buffer in a single fwrite(3). While I believe this will always work in practice, it’s not guaranteed by the C specification, which defines fwrite(3) as a series of fputc(3) calls (C99 §7.19.8.2¶2).
Fully buffered pipe: set a sufficiently large stream buffer and follow each record with a fflush(3). Unlike fwrite(3) on an unbuffered stream, the specification says the buffer will be “transmitted to the host environment as a block” (C99 §7.19.3¶3), so this should be perfectly correct on any POSIX system.

If your situation is more complicated than this, you’ll probably have to bypass your standard library buffered I/O and call write(2) or writev(2) yourself.

Practical Application

If interleaving writes to a pipe stdout sounds contrived, here’s the real life scenario: GNU xargs with its --max-procs (-P) option to process inputs in parallel.

xargs -n1 -P$(nproc) myprogram < inputs.txt | cat > outputs.csv

The | cat ensures the output of each myprogram process is connected to the same pipe rather than to the same file.

A non-portable alternative to | cat, especially if you’re dispatching processes and threads yourself, is the splice(2) system call on Linux. It efficiently moves the output from the pipe to the output file without an intermediate copy to userspace. GNU Coreutils’ cat doesn’t use this.

Win32 Pipes

On Win32, anonymous pipes have no semantics regarding interleaving. Named pipes have per-client buffers that prevent interleaving. However, the pipe buffer size is unspecified, and requesting a particular size is only advisory, so it comes down to trial and error, though the unstated limits should be comparatively generous.

Interleaving and Files

Suppose instead of a pipe we have an O_APPEND file on POSIX. Common wisdom states that the same PIPE_BUF atomic write rule applies. While this often works, especially on Linux, this is not correct. The POSIX specification doesn’t require it and there are systems where it doesn’t work.

If you know the particular limits of your operating system and filesystem, and you don’t care much about portability, then maybe you can get away with interleaving appends. For full portability, pipes are required.

On Win32, writes on local files up to the underlying drive’s sector size (typically 512 bytes to 4kB) are atomic. Otherwise the only options are deprecated Transactional NTFS (TxF), or manually synchronizing your writes. All in all, it’s going to take more work to get correct.

Conclusion

My true use case for mucking around with clean, atomic appends is to compute giant CSV tables in parallel, with the intention of later loading into a SQL database (i.e. SQLite) for analysis. A more robust and traditional approach would be to write results directly into the database as they’re computed. But I like the platform-neutral intermediate CSV files — good for archival and sharing — and the simplicity of programs generating the data — concerned only with atomic write semantics rather than calling into a particular SQL database API.

Mapping Multiple Memory Views in User Space

2016-04-10T21:59:16Z

Modern operating systems run processes within virtual memory using a piece of hardware called a memory management unit (MMU). The MMU contains a page table that defines how virtual memory maps onto physical memory. The operating system is responsible for maintaining this page table, mapping and unmapping virtual memory to physical memory as needed by the processes it’s running. If a process accesses a page that is not currently mapped, it will trigger a page fault and the execution of the offending thread will be paused until the operating system maps that page.

This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.

Some useful applications of this feature include:

An extremely fast, large memory “copy” by mapping the source memory overtop the destination memory.
Trivial interoperability between code instrumented with baggy bounds checking [PDF] and non-instrumented code. A few bits of each pointer are reserved to tag the pointer with the size of its memory allocation. For compactness, the stored size is rounded up to a power of two, making it “baggy.” Instrumented code checks this tag before making a possibly-unsafe dereference. Normally, instrumented code would need to clear (or set) these bits before dereferencing or before passing it to non-instrumented code. Instead, the allocation could be mapped simultaneously at each location for every possible tag, making the pointer valid no matter its tag bits.
Two responses to my last post on hotpatching suggested that, instead of modifying the instruction directly, memory containing the modification could be mapped over top of the code. I would copy the code to another place in memory, safely modify it in private, switch the page protections from write to execute (both for W^X and for other hardware limitations), then map it over the target. Restoring the original behavior would be as simple as unmapping the change.

Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.

POSIX Memory Mapping

On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions are shm_open(3), ftruncate(2), and mmap(2).

First, create a file descriptor to shared memory using shm_open. It has very similar semantics to open(2).

int shm_open(const char *name, int oflag, mode_t mode);

The name works much like a filesystem path, but is actually a different namespace (though on Linux it is a tmpfs mounted at /dev/shm). Resources created here (O_CREAT) will persist until explicitly deleted (shm_unlink(3)) or until the system reboots. It’s an oversight in POSIX that a name is required even if we never intend to access it by name. File descriptors can be shared with other processes via fork(2) or through UNIX domain sockets, so a name isn’t strictly required.

OpenBSD introduced shm_mkstemp(3) to solve this problem, but it’s not widely available. On Linux, as of this writing, the O_TMPFILE flag may or may not provide a fix (it’s undocumented).

The portable workaround is to attempt to choose a unique name, open the file with O_CREAT | O_EXCL (either atomically create the file or fail), shm_unlink the shared memory object as soon as possible, then cross our fingers. The shared memory object will still exist (the file descriptor keeps it alive) but will not longer be accessible by name.

int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
    handle_error(); // non-local exit
shm_unlink("/example");

The shared memory object is brand new (O_EXCL) and is therefore of zero size. ftruncate sets it to the desired size. This does not need to be a multiple of the page size. Failing to allocate memory will result in a bus error on access.

size_t size = sizeof(uint32_t);
ftruncate(fd, size);

Finally mmap the shared memory into place just as if it were a file. We can choose an address (aligned to a page) or let the operating system choose one for use (NULL). If we don’t plan on making any more mappings, we can also close the file descriptor. The shared memory object will be freed as soon as it completely unmapped (munmap(2)).

int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
close(fd);

At this point both a and b have different addresses but point (via the page table) to the same physical memory. Changes to one are reflected in the other. So this:

*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);

Will print out something like:

0x6ffffff0000 0x6fffffe0000 0xdeafbeef

It’s also possible to do all this only with open(2) and mmap(2) by mapping the same file twice, but you’d need to worry about where to put the file, where it’s going to be backed, and the operating system will have certain obligations about syncing it to storage somewhere. Using POSIX shared memory is simpler and faster.

Windows Memory Mapping

Windows is very similar, but directly supports anonymous shared memory. The key functions are CreateFileMapping, and MapViewOfFileEx.

First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.

size_t size = sizeof(uint32_t);
HANDLE h = CreateFileMapping(INVALID_HANDLE_VALUE,
                             NULL,
                             PAGE_READWRITE,
                             0, size,
                             NULL);

There’s no truncate step because the space is allocated at creation time via the two-part size argument.

Then, just like mmap:

uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
CloseHandle(h);

If I wanted to choose the target address myself, I’d call MapViewOfFileEx instead, which takes the address as additional argument.

From here on it’s the same as above.

Generalizing the API

Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.

int  memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);

Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.

It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.

Pass the same pointers back to memory_alias_unmap to free the mappings. When called correctly it cannot fail, so there’s no return value.

The full source is here: memalias.c

POSIX

Starting with the simpler of the two functions, the POSIX implementation looks like so:

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    for (size_t i = 0; i < naddr; i++)
        munmap(addrs[i], size);
}

The complex part is creating the mapping:

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    char path[128];
    snprintf(path, sizeof(path), "/%s(%lu,%p)",
             __FUNCTION__, (long)getpid(), addrs);
    int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd == -1)
        return -1;
    shm_unlink(path);
    ftruncate(fd, size);
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = mmap(addrs[i], size,
                        PROT_READ | PROT_WRITE, MAP_SHARED,
                        fd, 0);
        if (addrs[i] == MAP_FAILED) {
            memory_alias_unmap(size, i, addrs);
            close(fd);
            return -1;
        }
    }
    close(fd);
    return 0;
}

The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.

Otherwise it just walks the array setting up the mappings.

Windows

The Windows version is very similar.

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    (void)size;
    for (size_t i = 0; i < naddr; i++)
        UnmapViewOfFile(addrs[i]);
}

Since Windows tracks the size internally, it’s unneeded and ignored.

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
                                 NULL,
                                 PAGE_READWRITE,
                                 0, size,
                                 NULL);
    if (m == NULL)
        return -1;
    DWORD access = FILE_MAP_ALL_ACCESS;
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
        if (addrs[i] == NULL) {
            memory_alias_unmap(size, i, addrs);
            CloseHandle(m);
            return -1;
        }
    }
    CloseHandle(m);
    return 0;
}

In the future I’d like to find some unique applications of these multiple memory views.

Hotpatching a C Function on x86

2016-03-31T23:59:59Z

In this post I’m going to do a silly, but interesting, exercise that should never be done in any program that actually matters. I’m going write a program that changes one of its function definitions while it’s actively running and using that function. Unlike last time, this won’t involve shared libraries, but it will require x86_64 and GCC. Most of the time it will work with Clang, too, but it’s missing an important compiler option that makes it stable.

If you want to see it all up front, here’s the full source: hotpatch.c

Here’s the function that I’m going to change:

void
hello(void)
{
    puts("hello");
}

It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:

void
hello(void)
{
    static int x;
    printf("goodbye %d\n", x++);
}

I was only going change the string, but I figured I should make it a little more interesting.

Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a far more complex problem.

But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:

I want to hotpatch this function while it is being used by another thread without any synchronization. It may even be executing the function at the same time I clobber its first instructions with my jump. If it’s in between these instructions, disaster will strike.

The solution is the ms_hook_prologue function attribute. This tells GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP that I can safely clobber. This idea originated in Microsoft’s Win32 API, hence the “ms” in the name.

The prologue NOP needs to be updated atomically. I can’t let the other thread see a half-written instruction or, again, disaster. On x86 this means I have an alignment requirement. Since I’m overwriting an 8-byte instruction, I’m specifically going to need 8-byte alignment to get an atomic write.

The solution is the aligned function attribute, ensuring the hotpatch prologue is properly aligned.

The final problem is that there must be exactly one copy of this function in the compiled program. It must never be inlined or cloned, since these won’t be hotpatched.

As you might have guessed, this is primarily fixed with the noinline function attribute. Since GCC may also clone the function and call that instead, so it also needs the noclone attribute.

Even further, if GCC determines there are no side effects, it may cache the return value and only ever call the function once. To convince GCC that there’s a side effect, I added an empty inline assembly string (__asm("")). Since puts() has a side effect (output), this isn’t truly necessary for this particular example, but I’m being thorough.

What does the function look like now?

__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))
void
hello(void)
{
    __asm("");
    puts("hello");
}

And what does the assembly look like?

$ objdump -Mintel -d hotpatch
0000000000400848 :
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 

It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction does nothing. It copies rsp into itself and changes no flags. Why not 8 1-byte NOPs? I need to replace exactly one instruction with exactly one other instruction. I can’t have another thread in between those NOPs.

Hotpatching

Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.

void
hotpatch(void *target, void *replacement)
{
    assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
    void *page = (void *)((uintptr_t)target & ~0xfff);
    mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
    uint32_t rel = (char *)replacement - (char *)target - 5;
    union {
        uint8_t bytes[8];
        uint64_t value;
    } instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
    *(uint64_t *)target = instruction.value;
    mprotect(page, 4096, PROT_EXEC);
}

It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.

There are a number of things going on here, so let’s go through them one at a time:

Make the function writeable

The .text segment will not be writeable by default. This is for both security and safety. Before I can hotpatch the function I need to make the function writeable. To make the function writeable, I need to make its page writable. To make its page writeable I need to call mprotect(). If there was another thread monkeying with the page attributes of this page at the same time (another thread calling hotpatch()) I’d be in trouble.

It finds the page by rounding the target address down to the nearest 4096, the assumed page size (sorry hugepages). Warning: I’m being a bad programmer and not checking the result of mprotect(). If it fails, the program will crash and burn. It will always fail systems with W^X enforcement, which will likely become the standard in the future. Under W^X (“write XOR execute”), memory can either be writeable or executable, but never both at the same time.

What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.

At the end of the function, I mprotect() the page back to non-writeable.

Create the instruction

I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:

Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.

To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.

I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.

Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.

    *(uint64_t *)target = instruction.value;

Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.

Trying it out

Here’s what my test program looks like:

void *
worker(void *arg)
{
    (void)arg;
    for (;;) {
        hello();
        usleep(100000);
    }
    return NULL;
}

int
main(void)
{
    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    getchar();
    hotpatch(hello, new_hello);
    pthread_join(thread, NULL);
    return 0;
}

I fire off the other thread to keep it pinging at hello(). In the main thread, it waits until I hit enter to give the program input, after which it calls hotpatch() and changes the function called by the “worker” thread. I’ve now changed the behavior of the worker thread without its knowledge. In a more practical situation, this could be used to update parts of a running program without restarting or even synchronizing.

Small, Freestanding Windows Executables

2016-01-31T22:53:03Z

Update: This is old and was updated in 2023!

Recently I’ve been experimenting with freestanding C programs on Windows. Freestanding refers to programs that don’t link, either statically or dynamically, against a standard library (i.e. libc). This is typical for operating systems and similar, bare metal situations. Normally a C compiler can make assumptions about the semantics of functions provided by the C standard library. For example, the compiler will likely replace a call to a small, fixed-size memmove() with move instructions. Since a freestanding program would supply its own, it may have different semantics.

My usual go to for C/C++ on Windows is Mingw-w64, which has greatly suited my needs the past couple of years. It’s packaged on Debian, and, when combined with Wine, allows me to fully develop Windows applications on Linux. Being GCC, it’s also great for cross-platform development since it’s essentially the same compiler as the other platforms. The primary difference is the interface to the operating system (POSIX vs. Win32).

However, it has one glaring flaw inherited from MinGW: it links against msvcrt.dll, an ancient version of the Microsoft C runtime library that currently ships with Windows. Besides being dated and quirky, it’s not an official part of Windows and never has been, despite its inclusion with every release since Windows 95. Mingw-w64 doesn’t have a C library of its own, instead patching over some of the flaws of msvcrt.dll and linking against it.

Since so much depends on msvcrt.dll despite its unofficial nature, it’s unlikely Microsoft will ever drop it from future releases of Windows. However, if strict correctness is a concern, we must ask Mingw-w64 not to link against it. An alternative would be PlibC, though the LGPL licensing is unfortunate. Another is Cygwin, which is a very complete POSIX environment, but is heavy and GPL-encumbered.

Sometimes I’d prefer to be more direct: skip the C standard library altogether and talk directly to the operating system. On Windows that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only links against system DLLs.

Linux vs. Windows

The most important benefit of a standard library like libc is a portable, uniform interface to the host system. So long as the standard library suits its needs, the same program can run anywhere. Without it, the programs needs an implementation of each host-specific interface.

On Linux, operating system requests at the lowest level are made directly via system calls. This requires a bit of assembly language for each supported architecture (int 0x80 on x86, syscall on x86-64, swi on ARM, etc.). The POSIX functions of the various Linux libc implementations are built on top of this mechanism.

For example, here’s a function for a 1-argument system call on x86-64.

long
syscall1(long n, long arg)
{
    long result;
    __asm__ volatile (
        "syscall"
        : "=a"(result)
        : "a"(n), "D"(arg)
    );
    return result;
}

Then exit() is implemented on top. Note: A real libc would do cleanup before exiting, like calling registered atexit() functions.

#include   // defines SYS_exit

void
exit(int code)
{
    syscall1(SYS_exit, code);
}

The situation is simpler on Windows. Its low level system calls are undocumented and unstable, changing across even minor updates. The formal, stable interface is through the exported functions in kernel32.dll. In fact, kernel32.dll is essentially a standard library on its own (making the term “freestanding” in this case dubious). It includes functions usually found only in user-space, like string manipulation, formatted output, font handling, and heap management (similar to malloc()). It’s not POSIX, but it has analogs to much of the same functionality.

Program Entry

The standard entry for a C program is main(). However, this is not the application’s true entry. The entry is in the C library, which does some initialization before calling your main(). When main() returns, it performs cleanup and exits. Without a C library, programs don’t start at main().

On Linux the default entry is the symbol _start. It’s prototype would look like so:

void _start(void);

Returning from this function leads to a segmentation fault, so it’s up to your application to perform the exit system call rather than return.

On Windows, the entry depends on the type of application. The two relevant subsystems today are the console and windows subsystems. The former is for console applications (duh). These programs may still create windows and such, but must always have a controlling console. The latter is primarily for programs that don’t run in a console, though they can still create an associated console if they like. In Mingw-w64, give -mconsole (default) or -mwindows to the linker to choose the subsystem.

The default entry for each is slightly different.

int WINAPI mainCRTStartup(void);
int WINAPI WinMainCRTStartup(void);

Unlike Linux’s _start, Windows programs can safely return from these functions, similar to main(), hence the int return. The WINAPI macro means the function may have a special calling convention, depending on the platform.

On any system, you can choose a different entry symbol or address using the --entry option to the GNU linker.

Disabling libgcc

One problem I’ve run into is Mingw-w64 generating code that calls __chkstk_ms() from libgcc. I believe this is a long-standing bug, since -ffreestanding should prevent these sorts of helper functions from being used. The workaround I’ve found is to disable the stack probe and pre-commit the whole stack.

-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000

Alternatively you could link against libgcc (statically) with -lgcc, but, again, I’m going for a tiny executable.

A freestanding example

Here’s an example of a Windows “Hello, World” that doesn’t use a C library.

#include 

int WINAPI
mainCRTStartup(void)
{
    char msg[] = "Hello, world!\n";
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(stdout, msg, sizeof(msg), (DWORD[]){0}, NULL);
    return 0;
}

To build it:

x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32

Notice I manually linked against kernel32.dll. The stripped final result is only 4kB, mostly PE padding. There are techniques to trim this down even further, but for a substantial program it wouldn’t make a significant difference.

From here you could create a GUI by linking against user32.dll and gdi32.dll (both also part of Win32) and calling the appropriate functions. I already ported my OpenGL demo to a freestanding .exe, dropping GLFW and directly using Win32 and WGL. It’s much less portable, but the final .exe is only 4kB, down from the original 104kB (static linking against GLFW).

I may go this route for the upcoming 7DRL 2016 in March.

Raw Linux Threads via System Calls

2015-05-15T17:33:40Z

This article has a followup.

Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.

Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.

It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86-64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).

I’ve put the complete demo here if you want to see it all at once:

Pure assembly, library-free Linux threading demo

An x86-64 Primer

I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86-64 assembly, feel free to skip to the next section.

x86-64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.

rsp : stack pointer
rbp : “base” pointer (still used in debugging and profiling)
rax rbx rcx rdx : general purpose (notice: a, b, c, d)
rdi rsi : “destination” and “source”, now meaningless names
r8 r9 r10 r11 r12 r13 r14 r15 : added for x86-64

The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86-64 they’re almost completely interchangeable.

There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.

The Stack

The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.

The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86-64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:

rdi, rsi, rdx, rcx, r8, r9

So, for example, to perform a function call like foo(1, 2, 3), store 1, 2 and 3 in rdi, rsi, and rdx, then call the function. The mov instruction stores the source (second) operand in its destination (first) operand. The call instruction pushes the current value of rip onto the stack, then sets rip (jumps) to the address of the target function. When the callee is ready to return, it uses the ret instruction to pop the original rip value off the stack and back into rip, returning control to the caller.

    mov rdi, 1
    mov rsi, 2
    mov rdx, 3
    call foo

Called functions must preserve the contents of these registers (the same value must be stored when the function returns):

rbx, rsp, rbp, r12, r13, r14, r15

System Calls

When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.

rdi, rsi, rdx, r10, r8, r9

Each system call has an integer identifying it. This number is different on each platform, but, in Linux’s case, it will never change. Instead of call, rax is set to the number of the desired system call and the syscall instruction makes the request to the OS kernel. Prior to x86-64, this was done with an old-fashioned interrupt. Because interrupts are slow, a special, statically-positioned “vsyscall” page (now deprecated as a security hazard), later vDSO, is provided to allow certain system calls to be made as function calls. We’ll only need the syscall instruction in this article.

So, for example, the write() system call has this C prototype.

ssize_t write(int fd, const void *buf, size_t count);

On x86-64, the write() system call is at the top of the system call table as call 1 (read() is 0). Standard output is file descriptor 1 by default (standard input is 0). The following bit of code will write 10 bytes of data from the memory address buffer (a symbol defined elsewhere in the assembly program) to standard output. The number of bytes written, or -1 for error, will be returned in rax.

    mov rdi, 1        ; fd
    mov rsi, buffer
    mov rdx, 10       ; 10 bytes
    mov rax, 1        ; SYS_write
    syscall

Effective Addresses

There’s one last thing you need to know: registers often hold a memory address (i.e. a pointer), and you need a way to read the data behind that address. In NASM syntax, wrap the register in brackets (e.g. [rax]), which, if you’re familiar with C, would be the same as dereferencing the pointer.

These bracket expressions, called an effective address, may be limited mathematical expressions to offset that base address entirely within a single instruction. This expression can include another register (index), a power-of-two scalar (bit shift), and an immediate signed offset. For example, [rax + rdx*8 + 12]. If rax is a pointer to a struct, and rdx is an array index to an element in array on that struct, only a single instruction is needed to read that element. NASM is smart enough to allow the assembly programmer to break this mold a little bit with more complex expressions, so long as it can reduce it to the [base + index*2^exp + offset] form.

The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.

Allocating a Stack

Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.

The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.

You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:

brk(): Extends (or shrinks) the heap of a running process, typically located somewhere shortly after the .bss segment. Many allocators will do this for small or initial allocations. This is a less optimal choice for thread stacks because the stacks will be very near other important data, near other stacks, and lack a guard page (by default). It would be somewhat easier for an attacker to exploit a buffer overflow. A guard page is a locked-down page just past the absolute end of the stack that will trigger a segmentation fault on a stack overflow, rather than allow a stack overflow to trash other memory undetected. A guard page could still be created manually with mprotect(). Also, there’s also no room for these stacks to grow.
mmap(): Use an anonymous mapping to allocate a contiguous set of pages at some randomized memory location. As we’ll see, you can even tell the kernel specifically that you’re going to use this memory as a stack. Also, this is simpler than using brk() anyway.

On x86-64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.

void *stack_create(void);

The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.

void *mmap(void *addr, size_t length, int prot, int flags);

For flags, we’ll choose a private, anonymous mapping that, being a stack, grows downward. Even with that last flag, the system call will still return the bottom address of the mapping, which will be important to remember later. It’s just a simple matter of setting the arguments in the registers and making the system call.

%define SYS_mmap	9
%define STACK_SIZE	(4096 * 1024)	; 4 MB

stack_create:
    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
    mov rax, SYS_mmap
    syscall
    ret

Now we can allocate new stacks (or stack-sized buffers) as needed.

Spawning a Thread

Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.

long sys_clone(unsigned long flags, void *child_stack);

Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.

long thread_create(void (*)(void));

The function pointer argument is passed via rdi, per the ABI. Store this for safekeeping on the stack (push) in preparation for calling stack_create(). When it returns, the address of the low end of stack will be in rax.

thread_create:
    push rdi
    call stack_create
    lea rsi, [rax + STACK_SIZE - 8]
    pop qword [rsi]
    mov rdi, CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
             CLONE_PARENT | CLONE_THREAD | CLONE_IO
    mov rax, SYS_clone
    syscall
    ret

The second argument to clone() is a pointer to the high address of the stack (specifically, just above the stack). So we need to add STACK_SIZE to rax to get the high end. This is done with the lea instruction: load effective address. Despite the brackets, it doesn’t actually read memory at that address, but instead stores the address in the destination register (rsi). I’ve moved it back by 8 bytes because I’m going to place the thread function pointer at the “top” of the new stack in the next instruction. You’ll see why in a moment.

Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.

As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.

CLONE_THREAD: Put the new process in the same thread group.
CLONE_VM: Runs in the same virtual memory space.
CLONE_PARENT: Share a parent with the callee.
CLONE_SIGHAND: Share signal handlers.
CLONE_FS, CLONE_FILES, CLONE_IO: Share filesystem information.

A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).

Now here’s the really cool part, and the reason branching isn’t needed. There’s no reason to check rax to determine if we are the original thread (in which case we return to the caller) or if we’re the new thread (in which case we jump to the thread function). Remember how we seeded the new stack with the thread function? When the new thread returns (ret), it will jump to the thread function with a completely empty stack. The original thread, using the original stack, will return to the caller.

The value returned by thread_create() is the process ID of the new thread, which is essentially the thread object (e.g. Pthread’s pthread_t).

Cleaning Up

The thread function has to be careful not to return (ret) since there’s nowhere to return. It will fall off the stack and terminate the program with a segmentation fault. Remember that threads are just processes? It must use the exit() syscall to terminate. This won’t terminate the other threads.

%define SYS_exit	60

exit:
    mov rax, SYS_exit
    syscall

Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.

More Exploration

If you found this interesting, be sure to check out the full demo link at the top of this article. Now with the ability to spawn threads, it’s a great opportunity to explore and experiment with x86’s synchronization primitives, such as the lock instruction prefix, xadd, and compare-and-exchange (cmpxchg). I’ll discuss these in a future article.

Articles tagged linux at null program

Frankenwine: Multiple personas in a Wine process

Application to u-config

Practical libc-free threading on Linux

The clone system call

The stack “header”

The clone wrapper

Caller point of view

How to build a WaitGroup from a 32-bit integer

The four elements (of synchronization)

Four elements: Linux

Four elements: Windows

Illuminating synchronization edges for ThreadSanitizer

Redundant synchronization

My new debugbreak command

debugbreak on Linux

A missing feature

How to build and use DLLs on Windows

Mingw-w64

Viewing exported symbols

Mingw-w64 (improved)

MSVC

Reversing directions

Tying it all together

Makefile ideas

Asynchronously Opening and Closing Files in Asyncio

Test setup

Thread pools

A test drive

Caveat: no asynchronous reads and writes

Purgeable Memory Allocations for Linux

Purgeable API

Purgeable Implementation

Bookkeeping

Worth using?

A Survey of $RANDOM

Bash

Zsh

KornShell (ksh)

OpenBSD’s Public Domain Korn Shell (pdksh)

A JIT Compiler Skirmish with SELinux

Allocating memory

Anonymous memory mapping

Aligned allocation

Changing page protections

Asking for help

Finding a resolution

Inside glibc allocation

Two SELinux policies

Brute Force Incognito Browsing

Firefox

Chromium

Intercepting and Emulating Linux System Calls with Ptrace

strace

System call interception

Creating an artificial system call

Foreign system emulation

See also

When FFI Function Calls Beat Native C

Procedure Linkage Tables

Indirect dynamic calls

Direct function calls

It’s no mystery

A Crude Personal Package Manager

Birth of a package manager

Commands

Build (-B)

Create (-C)

Verify (-V)

Install (-I)

Uninstall (-U)

Package by accumulation

My use case, and an alternative

Initial Evaluation of the Windows Subsystem for Linux

Lack of device emulation

No Filesystem in Userspace (FUSE)

Fragile services

Limited graphics support

Filesystem woes

Filename translation

Build (`-B`)

Create (`-C`)

Verify (`-V`)

Install (`-I`)

Uninstall (`-U`)

Clearing `xmm0`