Articles tagged optimization at null program

An easy-to-implement, arena-friendly hash map

2023-09-30T23:18:40Z

My last article had tips for for arena allocation. This next article demonstrates a technique for building bespoke hash maps that compose nicely with arena allocation. In addition, they’re fast, simple, and automatically scale to any problem that could reasonably be solved with an in-memory hash map. To avoid resizing — both to better support arenas and to simplify implementation — they have slightly above average memory requirements. The design, which we’re calling a hash-trie, is the result of fruitful collaboration with NRK, whose sibling article includes benchmarks. It’s my new favorite data structure, and has proven incredibly useful. With a couple well-placed acquire/release atomics, we can even turn it into a lock-free concurrent hash map.

I’ve written before about MSI hash tables, a simple, very fast map that can be quickly implemented from scratch as needed, tailored to the problem at hand. The trade off is that one must know the upper bound a priori in order to size the base array. Scaling up requires resizing the array — an impedance mismatch with arena allocation. Search trees scale better, as there’s no underlying array, but tree balancing tends to be finicky and complex, unsuitable to rapid, on-demand implementation. We want the ease of an MSI hash table with the scaling of a tree.

I’ll motivate the discussion with example usage. Suppose we have an array of pointer+length strings, as defined last time:

typedef struct {
    uint8_t  *data;
    ptrdiff_t len;
} str;

And we need a function that removes duplicates in place, but (for the moment) we’re not worried about preserving order. This could be done naively in quadratic time. Smarter is to sort, then look for runs. Instead, I’ve used a hash map to track seen strings. It maps str to bool, and it is represented as type strmap and one insert+lookup function, upsert.

// Insert/get bool value for given str key.
bool *upsert(strmap **, str key, arena *);

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    strmap *seen = 0;
    while (count < len) {
        bool *b = upsert(&seen, strings[count], &scratch);
        if (*b) {
            // previously seen (discard)
            strings[count] = strings[--len];
        } else {
            // newly-seen (keep)
            count++;
            *b = 1;
        }
    }
    return count;
}

In particular, note:

A null pointer is an empty hash map and initialization is trivial. As discussed in the last article, one of my arena allocation principles is default zero-initializion. Put together, that means any data structure containing a map comes with a ready-to-use, empty map.
The map is allocated out of the scratch arena so it’s automatically freed upon any return. It’s as care-free as garbage collection.
The map directly uses strings in the input array as keys, without making copies nor worrying about ownership. Arenas own objects, not references. If I wanted to carve out some fixed keys ahead of time, I could even insert static strings.
upsert returns a pointer to a value. That is, a pointer into the map. This is not strictly required, but usually makes for a simple interface. When an entry is new, this value will be false (zero-initialized).

So, what is this wonderful data structure? Here’s the basic shape:

typedef struct {
    hashmap *child[4];
    keytype  key;
    valtype  value;
} hashmap;

They child and key fields are essential to the map. Adding a child to any data structure turns it into a hash map over whatever field you choose as the key. In other words, a hash-trie can serve as an intrusive hash map. In several programs I’ve combined intrusive lists and hash maps to create an insert-ordered hash map. Going the other direction, omitting value turns it into a hash set. (Which is what unique really needs!)

As you probably guessed, this hash-trie is a 4-ary tree. It can easily be 2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but 4-ary strikes a good balance, if a bit bulky. In the example above, keytype would be str and valtype would be bool. The most general form of upsert looks like this:

valtype *upsert(hashmap **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return &(*m)->value;
        }
        m = &(*m)->child[h>>62];
    }
    if (!perm) {
        return 0;
    }
    *m = new(perm, hashmap);
    (*m)->key = key;
    return &(*m)->value;
}

This will take some unpacking. The first argument is a pointer to a pointer. That’s the destination for any newly-allocated element. As it travels down the tree, this points into the parent’s child array. If it points to null, then it’s an empty tree which, by definition, does not contain the key.

We need two “methods” for keys: hash and equals. The hash function should return a uniformly distributed integer. As is usually the case, less uniform fast hashes generally do better than highly-uniform slow hashes. For hash maps under ~100K elements a 32-bit hash is fine, but larger maps should use a 64-bit hash state and result. Hash collisions revert to linear, linked list performance and, per the birthday paradox, that will happen often with 32-bit hashes on large hash maps.

If you’re worried about pathological inputs, add a seed parameter to upsert and hash. Or maybe even use the address m as a seed. The specifics depend on your security model. It’s not an issue for most hash maps, so I don’t demonstrate it here.

The top two bits of the hash are used to select a branch. These tend to be higher quality for multiplicative hash functions. At each level two bits are shifted out. This is what gives it its name: a trie of the hash bits. Though it’s un-trie-like in the way it deposits elements at the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a time.

I initially tried a Multiplicative Congruential Generator (MCG) to select the next branch at each trie level, instead of bit shifting, but NRK noticed it was consistently slower than shifting.

While “delete” could be handled using gravestones, many deletes would not work well. After all, the underlying allocator is an arena. A combination of uniformly distributed branching and no deletion means that rebalancing is unnecessary. This is what grants it its simplicity!

If no arena is provided, it reverts to a lookup and returns null when the key is not found. It allows one function to flexibly serve both modes. In unique, pure lookups are unneeded, so this condition could be skipped in its strmap.

Sometimes it’s useful to return the entire hashmap object itself rather than an internal pointer, particularly when it’s intrusive. Use whichever works best for the situation. Regardless, exploit zero-initialization to detect newly-allocated elements when possible.

In some cases we may deep copy the key in its arena before inserting it into the map. The provided key may be a temporary (e.g. sprintf) which the map outlives, and the caller doesn’t want to allocate a longer-lived key unless it’s needed. It’s all part of tailoring the map to the problem, which we can do because it’s so short and simple!

Fleshing it out

Putting it all together, unique could look like the following, with strmap/upsert renamed to strset/ismember:

uint64_t hash(str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i];
        h *= 1111111111111111111u;
    }
    return h;
}

bool equals(str a, str b)
{
    return a.len==b.len && !memcmp(a.data, b.data, a.len);
}

typedef struct {
    strset *child[4];
    str     key;
} strset;

bool ismember(strset **m, str key, arena *perm)
{
    for (uint64_t h = hash(key); *m; h <<= 2) {
        if (equals(key, (*m)->key)) {
            return 1;
        }
        m = &(*m)->child[h>>62];
    }
    *m = new(perm, strset);
    (*m)->key = key;
    return 0;
}

ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
    ptrdiff_t count = 0;
    for (strset *seen = 0; count < len;) {
        if (ismember(&seen, strings[count], &scratch)) {
            strings[count] = strings[--len];
        } else {
            count++;
        }
    }
    return count;
}

The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with an xorshift finalizer because the bits are used most-significant first. Exercise for the reader: Support retaining the original input order using an intrusive linked list on strset.

Relative pointers?

As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes these hash-tries a bit heavier than average. It’s not an issue for smaller hash maps, but has practical consequences for huge hash maps.

In attempt to address this, I experimented with relative pointers (example: markov.c). That is, instead of pointers I use signed integers whose value indicates an offset relative to itself. Because relative pointers can only refer to nearby memory, a custom allocator is imperative, and arenas fit the bill perfectly. Range can be extended by exploiting memory alignment. In particular, 32-bit relative pointers can reference up to 8GiB in either direction. Zero is reserved to represent a null pointer, and relative pointers cannot refer to themselves.

As a bonus, data structures built out of relative pointers are position independent. A collection of them — perhaps even a whole arena — can be dumped out to, say, a file, loaded back at a different position, then continue to operate as-is. Very cool stuff.

Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are guaranteed to work. No object is ever too far away. It’s a compounding effect, too. Smaller map nodes means a larger number of them are in reach of a relative pointer. Also very cool.

However, as far as I know, no generally available programming language implementation supports this concept well enough to put into practice. You could implement relative pointers with language extension facilities, such as C++ operator overloads, but no tools will understand them — a major bummer. You can no longer use a debugger to examine such structures, and it’s just not worth that cost. If only arena allocation was more popular…

As a concurrent hash map

For the finale, let’s convert upsert into a concurrent, lock-free hash map. That is, multiple threads can call upsert concurrently on the same map. Each must still have its own arena, probably per-thread arenas, and so no implicit locking for allocation.

The structure itself requires no changes! Instead we need two atomic operations: atomic load (acquire), and atomic compare-and-exchange (acquire/release). They operate only on child array elements and the tree root. To illustrate I will use GCC atomics, also supported by Clang.

valtype *upsert(map **m, keytype key, arena *perm)
{
    for (uint64_t h = hash(key);; h <<= 2) {
        map *n = __atomic_load_n(m, __ATOMIC_ACQUIRE);
        if (!n) {
            if (!perm) {
                return 0;
            }
            arena rollback = *perm;
            map *new = new(perm, map, 1);
            new->key = key;
            int pass = __ATOMIC_RELEASE;
            int fail = __ATOMIC_ACQUIRE;
            if (__atomic_compare_exchange_n(m, &n, new, 0, pass, fail)) {
                return &new->value;
            }
            *perm = rollback;
        }
        if (equals(n->key, key)) {
            return &n->value;
        }
        m = n->child + (h>>62);
    }
}

First an atomic load retrieves the current node. If there is no such node, then attempt to insert one using atomic compare-and-exchange. The ABA problem is not an issue thanks again to lack of deletion: Once set, a pointer never changes. Before allocating a node, take a snapshot of the arena so that the allocation can be reverted on failure. If another thread got there first, continue tumbling down the tree as though a null was never observed.

On compare-and-swap failure, it turns into an acquire load, just as it began. On success, it’s a release store, synchronizing with acquire loads on other threads.

The key field does not require atomics because it’s synchronized by the compare-and-swap. That is, the assignment will happen before the node is inserted, and keys do not change after insertion. The same goes for any zeroing done by the arena.

Loads and stores through the returned pointer are the caller’s responsibility. These likely require further synchronization. If valtype is a shared counter then an atomic increment is sufficient. In other cases, upsert should probably be modified to accept an initial value to be assigned alongside the key so that the entire key/value pair inserted atomically. Alternatively, break it into two steps. The details depend on the needs of the program.

On small trees there will much contention near the root of the tree during inserts. Fortunately, a contentious tree will not stay small for long! The hash function will spread threads around a large tree, generally keeping them off each other’s toes.

A complete demo you can try yourself: concurrent-hash-trie.c. It returns a value pointer like above, and store/load is synchronized by the thread join. Each thread is given a per-thread subarena allocated out of the main arena, and the final tree is built from these subarenas.

For a practical example: a multithreaded rainbow table to find hash function collisions. Threads are synchronized solely through atomics in the shared hash-trie.

A complete fast, concurrent, lock-free hash map in under 30 lines of C sounds like a sweet deal to me!

Solving "Two Sum" in C with a tiny hash table

2023-06-26T19:38:18Z

I came across a question: How does one efficiently solve Two Sum in C? There’s a naive quadratic time solution, but also an amortized linear time solution using a hash table. Without a built-in or standard library hash table, the latter sounds onerous. However, a mask-step-index table, a hash table construction suitable for many problems, requires only a few lines of code. This approach is useful even when a standard hash table is available, because by exploiting the known problem constraints, it beats typical generic hash table performance by an order of magnitude (demo).

The Two Sum exercise, restated:

Given an integer array and target, return the distinct indices of two elements that sum to the target.

In particular, the solution doesn’t find elements, but their indices. The exercise also constrains input ranges — important but easy to overlook:

2 <= count <= 10⁴
-10⁹ <= nums[i] <= 10⁹
-10⁹ <= target <= 10⁹

Notably, indices fit in a 16-bit integer with lots of room to spare. In fact, it will fit in a 14-bit address space (16,384) with still plenty of overhead. Elements fit in a signed 32-bit integer, and we can add and subtract elements without overflow, if just barely. The last constraint isn’t redundant, but it’s not readily exploitable either.

The naive solution is to linearly search the array for the complement. With nested loops, it’s obviously quadratic time. At 10k elements, we expect an abysmal 25M comparisons on average.

int16_t count = ...;
int32_t *nums = ...;

for (int16_t i = 0; i < count-1; i++) {
    for (int16_t j = i+1; j < count; j++) {
        if (nums[i]+nums[j] == target) {
            // found
        }
    }
}

The nums array is “keyed” by index. It would be better to also have the inverse mapping: key on elements to obtain the nums index. Then for each element we could compute the complement and find its index, if any, using this second mapping.

The input range is finite, so an inverse map is simple. Allocate an array, one element per integer in range, and store the index there. However, the input range is 2 billion, and even with 16-bit indices that’s a 4GB array. Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed to make it so. This array would be very sparse, at most less than half a percent of its elements populated. That’s a hint: Associative arrays are far more appropriate for representing such sparse mappings. That is, a hash table.

Using Go’s built-in hash table:

func TwoSumWithMap(nums []int32, target int32) (int, int, bool) {
    seen := make(map[int32]int16)
    for i, num := range nums {
        complement := target - num
        if j, ok := seen[complement]; ok {
            return int(j), i, true
        }
        seen[num] = int16(i)
    }
    return 0, 0, false
}

In essence, the hash table folds the sparse 2 billion element array onto a smaller array, with collision resolution when elements inevitably land in the same slot. For this exercise, that small array could be as small as 10,000 elements because that’s the most we’d ever need to track. For folding the large key space onto the smaller, we could use modulo. For collision resolution, we could keep walking the table.

int16_t seen[10000] = {0};

// Find or insert nums[index].
int16_t lookup(int32_t *nums, int16_t index)
{
    int i = nums[index] % 10000;
    for (;;) {
        int16_t j = seen[i] - 1;  // unbias
        if (j < 0) {  // empty slot
            seen[i] = index + 1;  // insert biased index
            return -1;
        } else if (nums[j] == nums[index]) {
            return j;  // match found
        }
        i = (i + 1) % 10000;  // keep looking
    }
}

Take note of a few details:

An empty slot is zero, and an empty table is a zero-initialized array. Since zero is a valid value, and all values are non-negative, it biases values by 1 in the table.
The nums array is part of the table structure, necessary for lookups. The two mappings — element-by-index and index-by-element — share structure.
It uses open addressing with linear probing, and so walks the table until it either either finds the element or hits an empty slot.
The “hash” function is modulo. If inputs are not random, they’ll tend to bunch up in the table. Combined with linear probing makes for lots of collisions. For the worst case, imagine sequentially ordered inputs.
Sometimes the table will almost completely fill, and lookups will be no better than the linear scans of the naive solution.
Most subtle of all: This hash table is not enough for the exercise. The keyed-on element may not even be in nums, and when lookup fails, that element is not inserted in the table. Instead, a different element is inserted. The conventional solution has at least two hash table lookups. In the Go code, it’s seen[complement] for lookups and seen[num] for inserts.

To solve (4) we’ll use a hash function to more uniformly distribute elements in the table. We’ll also probe the table in a random-ish order that depends on the key. In practice there will be little bunching even for non-random inputs.

To solve (5) we’ll use a larger table: 2¹⁴ or 16,384 elements. This has breathing room, and with a power of two we can use a fast mask instead of a slow division (though in practice, compilers usually implement division by a constant denominator with modular multiplication).

To solve (6) we’ll key complements together under the same key. It looks for the complement, but on failure it inserts the current element in the empty slot. In other words, this solution will only need a single hash table lookup per element!

Laying down some groundwork:

typedef struct {
    int16_t i, j;
    _Bool ok;
} TwoSum;

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        // ...
    }
    return r;
}

The seen array is a 32KiB hash table large enough for all inputs, small enough that it can be a local variable. In the loop:

        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;

Compute the complement, then apply a “max” operation to derive a key. Any commutative operation works, though obviously addition would be a poor choice. XOR is similar enough to cause many collisions. Multiplication works well, and is probably better if the ternary produces a branch.

The hash function is multiplication with a randomly-chosen prime. As we’ll see in a moment, step will also add-shift the hash before use. The initial index will be the bottom 14 bits of this hash. For step, recall from the MSI article that it must be odd so that every slot is eventually probed. I shift out 13 bits and then override the 14th bit, so step effectively skips over the 14 bits used for the initial table index.

I used unsigned because I don’t really care about the width of the hash table index, but more importantly, I want defined overflow from all the bit twiddling, even in the face of implicit promotion. As a bonus, it can help in reasoning about indirection: seen indices are unsigned, nums indices are int16_t.

        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }

The step is added before using the index the first time, helping to scatter the start point and reduce collisions. If it’s an empty slot, insert the current element, not the complement — which wouldn’t be possible anyway. Unlike conventional solutions, this doesn’t require another hash and lookup. If it finds the complement, problem solved, otherwise keep going.

Putting it all together, it’s only slightly longer than solutions using a generic hash table:

TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
    TwoSum r = {0};
    int16_t seen[1<<14] = {0};
    for (int16_t n = 0; n < count; n++) {
        int32_t complement = target - nums[n];
        int32_t key = complement>nums[n] ? complement : nums[n];
        uint32_t hash = key * 489183053u;
        unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
        unsigned step = hash>>13 | 1;
        for (unsigned i = hash;;) {
            i = (i + step) & mask;
            int16_t j = seen[i] - 1;  // unbias
            if (j < 0) {
                seen[i] = n + 1;  // bias and insert
                break;
            } else if (nums[j] == complement) {
                r.i = j;
                r.j = n;
                r.ok = 1;
                return r;
            }
        }
    }
    return r;
}

Applying this technique to Go:

func TwoSumWithBespoke(nums []int32, target int32) (int, int, bool) {
    var seen [1 << 14]int16
    for n, num := range nums {
        complement := target - num
        hash := int(num * complement * 489183053)
        mask := len(seen) - 1
        step := hash>>13 | 1
        for i := hash; ; {
            i = (i + step) & mask
            j := int(seen[i] - 1) // unbias
            if j < 0 {
                seen[i] = int16(n) + 1 // bias
                break
            } else if nums[j] == complement {
                return j, n, true
            }
        }
    }
    return 0, 0, false
}

With Go 1.20 this is an order of magnitude faster than map[int32]int16, which isn’t surprising. I used multiplication as the key operator because, in my first take, Go produced a branch for the “max” operation — at a 25% performance penalty on random inputs.

A full-featured, generic hash table may be overkill for your problem, and a bit of hashed indexing with collision resolution over a small array might be sufficient. The problem constraints might open up such shortcuts.

Practical libc-free threading on Linux

2023-03-23T05:32:41Z

Suppose you’re not using a C runtime on Linux, and instead you’re programming against its system call API. It’s long-term and stable after all. Memory management and buffered I/O are easily solved, but a lot of software benefits from concurrency. It would be nice to also have thread spawning capability. This article will demonstrate a simple, practical, and robust approach to spawning and managing threads using only raw system calls. It only takes about a dozen lines of C, including a few inline assembly instructions.

The catch is that there’s no way to avoid using a bit of assembly. Neither the clone nor clone3 system calls have threading semantics compatible with C, so you’ll need to paper over it with a bit of inline assembly per architecture. This article will focus on x86-64, but the basic concept should work on all architectures supported by Linux. The glibc clone(2) wrapper fits a C-compatible interface on top of the raw system call, but we won’t be using it here.

Before diving in, the complete, working demo: stack_head.c

The clone system call

On Linux, threads are spawned using the clone system call with semantics like the classic unix fork(2). One process goes in, two processes come out in nearly the same state. For threads, those processes share almost everything and differ only by two registers: the return value — zero in the new thread — and stack pointer. Unlike typical thread spawning APIs, the application does not supply an entry point. It only provides a stack for the new thread. The simple form of the raw clone API looks something like this:

long clone(long flags, void *stack);

Sounds kind of elegant, but it has an annoying problem: The new thread begins life in the middle of a function without any established stack frame. Its stack is a blank slate. It’s not ready to do anything except jump to a function prologue that will set up a stack frame. So besides the assembly for the system call itself, it also needs more assembly to get the thread into a C-compatible state. In other words, a generic system call wrapper cannot reliably spawn threads.

void brokenclone(void (*threadentry)(void *), void *arg)
{
    // ...
    long r = syscall(SYS_clone, flags, stack);
    // DANGER: new thread may access non-existant stack frame here
    if (!r) {
        threadentry(arg);
    }
}

For odd historical reasons, each architecture’s clone has a slightly different interface. The newer clone3 unifies these differences, but it suffers from the same thread spawning issue above, so it’s not helpful here.

The stack “header”

I figured out a neat trick eight years ago which I continue to use today. The parent and child threads are in nearly identical states when the new thread starts, but the immediate goal is to diverge. As noted, one difference is their stack pointers. To diverge their execution, we could make their execution depend on the stack. An obvious choice is to push different return pointers on their stacks, then let the ret instruction do the work.

Carefully preparing the new stack ahead of time is the key to everything, and there’s a straightforward technique that I like call the stack_head, a structure placed at the high end of the new stack. Its first element must be the entry point pointer, and this entry point will receive a pointer to its own stack_head.

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    // ...
};

The structure must have 16-byte alignment on all architectures. I used an attribute to help keep this straight, and it can help when using sizeof to place the structure, as I’ll demonstrate later.

Now for the cool part: The ... can be anything you want! Use that area to seed the new stack with whatever thread-local data is necessary. It’s a neat feature you don’t get from standard thread spawning interfaces. If I plan to “join” a thread later — wait until it’s done with its work — I’ll put a join futex in this space:

struct __attribute((aligned(16))) stack_head {
    void (*entry)(struct stack_head *);
    int join_futex;
    // ...
};

More details on that futex shortly.

The clone wrapper

I call the clone wrapper newthread. It has the inline assembly for the system call, and since it includes a ret to diverge the threads, it’s a “naked” function just like with setjmp. The compiler will generate no prologue or epilogue, and the function body is limited to inline assembly without input/output operands. It cannot even reliably reference its parameters by name. Like clone, it doesn’t accept a thread entry point. Instead it accepts a stack_head seeded with the entry point. The whole wrapper is just six instructions:

__attribute((naked))
static long newthread(struct stack_head *stack)
{
    __asm volatile (
        "mov  %%rdi, %%rsi\n"     // arg2 = stack
        "mov  $0x50f00, %%edi\n"  // arg1 = clone flags
        "mov  $56, %%eax\n"       // SYS_clone
        "syscall\n"
        "mov  %%rsp, %%rdi\n"     // entry point argument
        "ret\n"
        : : : "rax", "rcx", "rsi", "rdi", "r11", "memory"
    );
}

On x86-64, both function calls and system calls use rdi and rsi for their first two parameters. Per the reference clone(2) prototype above: the first system call argument is flags and the second argument is the new stack, which will point directly at the stack_head. However, the stack pointer arrives in rdi. So I copy stack into the second argument register, rsi, then load the flags (0x50f00) into the first argument register, rdi. The system call number goes in rax.

Where does that 0x50f00 come from? That’s the bare minimum thread spawn flag set in hexadecimal. If any flag is missing then threads will not spawn reliably — as discovered the hard way by trial and error across different system configurations, not from documentation. It’s computed normally like so:

    long flags = 0;
    flags |= CLONE_FILES;
    flags |= CLONE_FS;
    flags |= CLONE_SIGHAND;
    flags |= CLONE_SYSVSEM;
    flags |= CLONE_THREAD;
    flags |= CLONE_VM;

When the system call returns, it copies the stack pointer into rdi, the first argument for the entry point. In the new thread the stack pointer will be the same value as stack, of course. In the old thread this is a harmless no-op because rdi is a volatile register in this ABI. Finally, ret pops the address at the top of the stack and jumps. In the old thread this returns to the caller with the system call result, either an error (negative errno) or the new thread ID. In the new thread it pops the first element of stack_head which, of course, is the entry point. That’s why it must be first!

The thread has nowhere to return from the entry point, so when it’s done it must either block indefinitely or use the exit (not exit_group) system call to terminate itself.

Caller point of view

The caller side looks something like this:

static void threadentry(struct stack_head *stack)
{
    // ... do work ...
    __atomic_store_n(&stack->join_futex, 1, __ATOMIC_SEQ_CST);
    futex_wake(&stack->join_futex);
    exit(0);
}

__attribute((force_align_arg_pointer))
void _start(void)
{
    struct stack_head *stack = newstack(1<<16);
    stack->entry = threadentry;
    // ... assign other thread data ...
    stack->join_futex = 0;
    newthread(stack);

    // ... do work ...

    futex_wait(&stack->join_futex, 0);
    exit_group(0);
}

Despite the minimalist, 6-instruction clone wrapper, this is taking the shape of a conventional threading API. It would only take a bit more to hide the futex, too. Speaking of which, what’s going on there? The same principal as a WaitGroup. The futex, an integer, is zero-initialized, indicating the thread is running (“not done”). The joiner tells the kernel to wait until the integer is non-zero, which it may already be since I don’t bother to check first. When the child thread is done, it atomically sets the futex to non-zero and wakes all waiters, which might be nobody.

Caveat: It’s not safe to free/reuse the stack after a successful join. It only indicates the thread is done with its work, not that it exited. You’d need to wait for its SIGCHLD (or use CLONE_CHILD_CLEARTID). If this sounds like a problem, consider your context more carefully: Why do you feel the need to free the stack? It will be freed when the process exits. Worried about leaking stacks? Why are you starting and exiting an unbounded number of threads? In the worst case park the thread in a thread pool until you need it again. Only worry about this sort of thing if you’re building a general purpose threading API like pthreads. I know it’s tempting, but avoid doing that unless you absolutely must.

What’s with the force_align_arg_pointer? Linux doesn’t align the stack for the process entry point like a System V ABI function call. Processes begin life with an unaligned stack. This attribute tells GCC to fix up the stack alignment in the entry point prologue, just like on Windows. If you want to access argc, argv, and envp you’ll need more assembly. (I wish doing really basic things without libc on Linux didn’t require so much assembly.)

__asm (
    ".global _start\n"
    "_start:\n"
    "   movl  (%rsp), %edi\n"
    "   lea   8(%rsp), %rsi\n"
    "   lea   8(%rsi,%rdi,8), %rdx\n"
    "   call  main\n"
    "   movl  %eax, %edi\n"
    "   movl  $60, %eax\n"
    "   syscall\n"
);

int main(int argc, char **argv, char **envp)
{
    // ...
}

Getting back to the example usage, it has some regular-looking system call wrappers. Where do those come from? Start with this 6-argument generic system call wrapper.

long syscall6(long n, long a, long b, long c, long d, long e, long f)
{
    register long ret;
    register long r10 asm("r10") = d;
    register long r8  asm("r8")  = e;
    register long r9  asm("r9")  = f;
    __asm volatile (
        "syscall"
        : "=a"(ret)
        : "a"(n), "D"(a), "S"(b), "d"(c), "r"(r10), "r"(r8), "r"(r9)
        : "rcx", "r11", "memory"
    );
    return ret;
}

I could define syscall5, syscall4, etc. but instead I’ll just wrap it in macros. The former would be more efficient since the latter wastes instructions zeroing registers for no reason, but for now I’m focused on compacting the implementation source.

#define SYSCALL1(n, a) \
    syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
    syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
    syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
    syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))

Now we can have some exits:

__attribute((noreturn))
static void exit(int status)
{
    SYSCALL1(SYS_exit, status);
    __builtin_unreachable();
}

__attribute((noreturn))
static void exit_group(int status)
{
    SYSCALL1(SYS_exit_group, status);
    __builtin_unreachable();
}

Simplified futex wrappers:

static void futex_wait(int *futex, int expect)
{
    SYSCALL4(SYS_futex, futex, FUTEX_WAIT, expect, 0);
}

static void futex_wake(int *futex)
{
    SYSCALL3(SYS_futex, futex, FUTEX_WAKE, 0x7fffffff);
}

And so on.

Finally I can talk about that newstack function. It’s just a wrapper around an anonymous memory map allocating pages from the kernel. I’ve hardcoded the constants for the standard mmap allocation since they’re nothing special or unusual. The return value check is a little tricky since a large portion of the negative range is valid, so I only want to check for a small range of negative errnos. (Allocating a arena looks basically the same.)

static struct stack_head *newstack(long size)
{
    unsigned long p = SYSCALL6(SYS_mmap, 0, size, 3, 0x22, -1, 0);
    if (p > -4096UL) {
        return 0;
    }
    long count = size / sizeof(struct stack_head);
    return (struct stack_head *)p + count - 1;
}

The aligned attribute comes into play here: I treat the result like an array of stack_head and return the last element. The attribute ensures each individual elements is aligned.

That’s it! There’s not much to it other than a few thoughtful assembly instructions. It took doing this a few times in a few different programs before I noticed how simple it can be.

I solved the Dandelions paper-and-pencil game

2022-10-12T03:02:27Z

I’ve been reading Math Games with Bad Drawings, a great book well-aligned to my interests. It’s given me a lot of new, interesting programming puzzles to consider. The first to truly nerd snipe me was Dandelions (full rules), an asymmetric paper-and-pencil game invented by the book’s author, Ben Orlin. Just as with British Square two years ago — and essentially following the same technique — I wrote a program that explores the game tree sufficiently to play either side perfectly, “solving” the game in its standard 5-by-5 configuration.

The source: dandelions.c

The game is played on a 5-by-5 grid where one player plays the dandelions, the other plays the wind. Players alternate, dandelions placing flowers and wind blowing in one of the eight directions, spreading seeds from all flowers along the direction of the wind. Each side gets seven moves, and the wind cannot blow in the same direction twice. The dandelions’ goal is to fill the grid with seeds, and the wind’s goal is to prevent this.

Try playing a few rounds with a friend, and you will probably find that dandelions is difficult, at least in your first games, as though it cannot be won. However, my engine proves the opposite: The dandelions always win with perfect play. In fact, it’s so lopsided that the dandelions’ first move is irrelevant. Every first move is winnable. If the dandelions blunder, typically wind has one narrow chance to seize control, after which wind probably wins with any (or almost any) move.

For reasons I’ll discuss later, I only solved the 5-by-5 game, and the situation may be different for the 6-by-6 variant. Also, unlike British Square, my engine does not exhaustively explore the entire game tree because it’s far too large. Instead it does a minimax search to the bottom of the tree and stops when it finds a branch where all leaves are wins for the current player. Because of this, it cannot maximize the outcome — winning as early as possible as dandelions or maximizing the number of empty grid spaces as wind. I also can’t quantify the exact size of tree.

Like with British Square, my game engine only has a crude user interface for interactively exploring the game tree. While you can “play” it in a sense, it’s not intended to be played. It also takes a few seconds to initially explore the game tree, so wait for the >> prompt.

Bitboard seeding

I used bitboards of course: a 25-bit bitboard for flowers, a 25-bit bitboard for seeds, and an 8-bit set to track which directions the wind has blown. It’s especially well-suited for this game since seeds can be spread in parallel using bitwise operations. Shift the flower bitboard in the direction of the wind four times, ORing it into the seeds bitboard on each shift:

int wind;
uint32_t seeds, flowers;

flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;

Of course it’s a little more complicated than this. The flowers must be masked to keep them from wrapping around the grid, and wind may require shifting in the other direction. In order to “negative shift” I actually use a rotation (notated with >>> below). Consider, to rotate an N-bit integer left by R, one can right-rotate it by N-R — ex. on a 32-bit integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a negative wind that goes in the other direction:

flowers >>> (wind & 31);

With such a “programmable shift” I can implement the bulk of the game rules using a couple of tables and no branches:

// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;

The masks clear out the column/row about to be shifted “out” so that it doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5 times.

Bitboard packing and canonicalization

The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58 bits, which fits in a 64-bit integer with bits to spare. How incredibly convenient! So I represent the game state using a 64-bit integer, using a packing like I did with British Square. The bottom 25 bits are the seeds, the next 25 bits are the flowers, and the next 8 is the wind set.

000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS

Even more convenient, I could reuse my bitboard canonicalization code from British Square, also a 5-by-5 grid packed in the same way, saving me the trouble of working out all the bit sieves. I only had to figure out how to transpose and flip the wind bitset. Turns out that’s pretty easy, too. Here’s how I represent the 8 wind directions:

567
4 0
321

Flipping this vertically I get:

321
4 0
567

Unroll these to show how old maps onto new:

old: 01234567
new: 07654321

The new is just the old rotated and reversed. Transposition is the same story, just a different rotation. I use a small lookup table to reverse the bits, and then an 8-bit rotation. (See revrot.)

To determine how many moves have been made, popcount the flower bitboard and wind bitset.

int moves = POPCOUNT64(g & 0x3fffffffe000000);

To test if dandelions have won:

int win = (g&0x1ffffff) == 0x1ffffff;

Since the plan is to store all the game states in a big hash table — an MSI double hash in this case — I’d like to reserve the zero value as a “null” board state. This lets me zero-initialize the hash table. To do this, I invert the wind bitset such that a 1 indicates the direction is still available. So the initial game state looks like this (in the real program this is accounted for in the previously-discussed turn popcount):

#define GAME_INIT ((uint64_t)255 << 50)

The remaining 6 bits can be used to cache information about the rest of tree under this game state, namely who wins from this position, and this serves as the “value” in the hash table. Turns out the bitboards are already noisy enough that a single xorshift makes for a great hash function. The hash table, including hash function, is under a dozen lines of code.

// Find the hash table slot for the given game state.
uint64_t *lookup(uint64_t *ht, uint64_t g)
{
    uint64_t hash = g ^ g>>32;
    size_t mask = (1L << HASHTAB_EXP) - 1;
    size_t step = hash>>(64 - HASHTAB_EXP) | 1;
    for (size_t i = hash;;) {
        i = (i + step)&mask;
        if (!ht[i] || ht[i]&0x3ffffffffffffff == g) {
            return ht + i;
        }
    }
}

To explore a 6-by-6 grid I’d need to change my representation, which is part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit integer, so I’d need to double my storage requirements, which are already strained.

Computational limitations

Due to the way seeds spread, game states resulting from different moves rarely converge back to a common state later in the tree, so the hash table isn’t doing much deduplication. Exhaustively exploring the entire game tree, even cutting it down to an 8th using canonicalization, requires substantial computing resources, more than I personally have available for this project. So I had to stop at the slightly weaker form, find a winning branch rather than maximizing a “score.”

I configure the program to allocate 2GiB for the hash table, but if you run just a few dozen games off the same table (same program instance), each exploring different parts of the game tree, you’ll exhaust this table. A 6-by-6 doubles the memory requirements just to represent the game, but it also slows the search and substantially increases the width of the tree, which grows 44% faster. I’m sure it can be done, but it’s just beyond the resources available to me.

Dandelion Puzzles

As a side effect, I wrote a small routine to randomly play out games in search for “mate-in-two”-style puzzles. The dandelions have two flowers to place and can force a win with two specific placements — and only those two placements — regardless of how the wind blows. Here are two of the better ones, each involving a small trick that I won’t give away here (note: arrowheads indicate directions wind can still blow):

There are a variety of potential single-player puzzles of this form.

Cooperative: place a dandelion and pick the wind direction
Avoidance: don’t seed a particular tile
Hard ground: certain tiles can’t grow flowers (but still get seeded)
Weeding: as wind, figure out which flower to remove before blowing

There could be a whole “crossword book” of such dandelion puzzles.

The quick and practical "MSI" hash table

2022-08-08T23:57:08Z

Follow-up: Solving “Two Sum” in C with a tiny hash table

I generally prefer C, so I’m accustomed to building whatever I need on the fly, such as heaps, linked lists, and especially hash tables. Few programs use more than a small subset of a data structure’s features, making their implementation smaller, simpler, and more efficient than the general case, which must handle every edge case. A typical hash table tutorial will describe a relatively lengthy program, but in practice, bespoke hash tables are only a few lines of code. Over the years I’ve worked out some basic principles for hash table construction that aid in quick and efficient implementation. This article covers the technique and philosophy behind what I’ve come to call the “mask-step-index” (MSI) hash table, which is my standard approach.

MSI hash tables are nothing novel, just a double hashed, open address hash table layered generically atop an external array. It’s best regarded as a kind of database index — a lookup index over an existing array. The array exists independently, and the hash table provides an efficient lookup into that array over some property of its entries.

The core of the MSI hash table is this iterator function:

// Compute the next candidate index. Initialize idx to the hash.
int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (64 - exp)) | 1;
    return (idx + step) & mask;
}

The name should now make sense. I literally sound it out in my head when I type it, like a mnemonic. Compute a mask, then a step size, finally an index. The exp parameter is a power-of-two exponent for the hash table size, which may look familiar. I’ve used int32_t for the index, but it’s easy to substitute, say, size_t. I try to optimize for the common case, where a 31-bit index is more than sufficient, and a signed type since subscripts should be signed. Internally it uses unsigned types since overflow is both expected and harmless thanks to the power-of-two hash table size.

It’s the caller’s responsibility to compute the hash, and the MSI iterator tells the caller where to look next. For insertion, the caller (maybe) looks either for an existing entry to override, or an empty slot. For lookup, the caller looks for a matching entry, giving up as soon as it find an empty slot. An insertion loop looks like this string intern table:

#define EXP 15

// Initialize all slots to an "empty" value (null)
#define HT_INIT { {0}, 0 }
struct ht {
    char *ht[1<<EXP];
    int32_t len;
};

char *intern(struct ht *t, char *key)
{
    uint64_t h = hash(key, strlen(key)+1);
    for (int32_t i = h;;) {
        i = ht_lookup(h, EXP, i);
        if (!t->ht[i]) {
            // empty, insert here
            if ((uint32_t)t->len+1 == (uint32_t)1<<EXP) {
                return 0;  // out of memory
            }
            t->len++;
            t->ht[i] = key;
            return key;
        } else if (!strcmp(t->ht[i], key)) {
            // found, return canonical instance
            return t->ht[i];
        }
    }
}

The caller initializes the iterator to the hash result. This will probably be out of range, even negative, but that doesn’t matter. The iterator function will turn it into a valid index before use. This detail is key to double hashing: The low bits of the hash tell it where to start, and the high bits tell it how to step. The hash table size is a power of two, and the step size is forced to an odd number (via | 1), so it’s guaranteed to visit each slot in the table exactly once before restarting. It’s important that the search halts before looping, such as by guaranteeing the existence of an empty slot (i.e. the “out of memory” check).

Note: The example out of memory check pushes the hash table to the absolute limit, and in practice you’d want to stop at a smaller load factor — perhaps even as low as 50% since that’s simple and fast. Otherwise it degrades into a linear search as the table approaches capacity.

Even if two keys start or land at the same place, they’ll quickly diverge due to differing steps. For awhile I used plain linear probing — i.e. step=1 — but double hashing came out ahead every time I benchmarked, steering me towards this “MSI” construction. Ideally ht_lookup would be placed so that it’s inlined — e.g. in the same translation unit — so that the mask and step are not actually recomputed each iteration.

Deletion

What about deletion? First, consider how infrequently you delete entries from a hash table. When was the last time you used del on a dictionary in Python, or delete on a map in Go? This operation is rarely needed. However, when you do need it, reserve a gravestone value in addition to the empty value.

static char gravestone[] = "(deleted)";

char *intern(struct ht *t, char *key)
{
    char **dest = 0;
    // ...
        if (!t->ht[i]) {
            // ...
            dest = dest ? dest : &t->ht[i];
            *dest = key;
            return key;
        } else if (t->ht[i] == gravestone) {
            dest = dest ? dest : &t->ht[i];
        } else if (!strcmp(...)) {
            // ...
        }
    // ...
}

char *unintern(struct ht *t, char *key)
{
    // ...
        if (!t->ht[i]) {
            return 0;
        } else if (t->ht[i] == gravestone) {
            // skip over
        } else if (!strcmp(...)) {
            char *old = t->ht[i];
            t->ht[i] = gravestone;
            return old;
        }
    // ...
}

When searching, skip over gravestones. Note that gravestones are compared with == (identity), so this does not preclude a string "(deleted)". When inserting, use the first gravestone found if no entry was found.

As a database index

Iterating over the example string intern table is simple: Iterate over the underlying array, skipping empty slots (and maybe gravestones). Entries will be in a random order rather than, say, insertion order. This is a useful introductory example, but this isn’t where MSI most shines. As mentioned, it’s best when treated like a database index.

Let’s take a step back and consider the caller of intern. How does it allocate these strings? Perhaps they’re appended to a buffer, and intern indicates whether or not the string is unique so far.

struct buf {
    // lookup table over the buffer
    struct ht ht;

    // a collection of strings
    int32_t len;
    char buf[BUFLEN];
};

Strings are only appended to the buffer when unique, and the hash table can make that determination in constant time.

char *buf_push(struct buf *b, char *s)
{
    size_t len = strlen(s) + 1;
    if (b->len+len > sizeof(b->buf)) {
        return 0;  // out of memory
    }

    char *candidate = b->buf + buf->len;
    memcpy(candidate, s, len);

    char *result = intern(&b->ht, candidate);
    if (result == candidate) {
        // string is unique, keep it
        b->len += len;
    }
    return result;
}

In my first example, EXP was fixed. This could be converted into a dynamic allocation and the hash table resized as needed. Here’s a new constructor, which I’m including since I think it’s instructive:

struct ht {
    int32_t len;
    int exp;
    char **ht;
};

static struct ht
ht_new(int exp)
{
    struct ht ht = {0, exp, 0};

    assert(exp >= 0);
    if (exp >= 32) {
        return ht;  // request too large
    }

    ht.ht = calloc((size_t)1<<exp, sizeof(ht.ht[0]));
    return ht;
}

If intern fails, the hash table can be replaced with a new table twice as large, and since, like a database index, its contents are entirely redundant, the hash table can be discarded and rebuilt from scratch. The new and old table don’t need to exist simultaneously. Here’s a routine to populate an empty hash table from the buffer:

void buf_rehash(struct buf *b)
{
    assert(b->ht.len == 0);
    for (int32_t off = 0; off < b->len;) {
        char *s = b->buf + off;
        int32_t len = strlen(s) + 1;
        off += len;
        uint64_t h = hash(s, len);
        for (int32_t i = h;;) {
            i = ht_lookup(h, b->ht.exp, i);
            if (!b->ht.ht[i]) {
                b->ht.len++;
                b->ht.ht[i] = s;
                break;
            }
        }
    }
}

Note how this iterates in insertion order, which may be useful in other cases, too. On the rehash it doesn’t need to check for existing entries, as all entries are already known to be unique. Later when intern hits its capacity:

    char *result = intern(&b->ht, candidate);
    if (!result) {
        free(b->ht.ht);
        b->ht = ht_new(ht.exp+1);
        if (!b->ht) {
            return 0;  // out of memory
        }
        buf_rehash(b);
        result = intern(&b->ht, candidate);  // cannot fail
    }

I freed and reallocated the table, but it would be trivial to use a realloc instead, unlike the case where the old table isn’t redundant.

Multimaps

An MSI hash table is trivially converted into a multimap, a hash table with multiple values per key. Callers just make one small change: Don’t stop searching until an empty slot is found. Each match is an additional multimap value. The “value array” is stored along the hash table itself, in insertion order, without additional allocations.

For example, imagine the strings in the string buffer have a namespace prefix, delimited by a colon, like city:Austin and state:Texas. We’d like a fast lookup of all strings under a particular namespace. The solution is to add another hash table as you would an index to a database table.

struct buf {
    // ..
    struct ht ns;
    // ..
};

When a unique string is appended it’s also registered in the namespace multimap. It doesn’t check for an existing key, only for an empty slot, since it’s a multimap:

    // Check outside the loop since it always inserts.
    if (/* ... ns multimap lacks capacity ... */) {
        // ... grow+rehash ns mutilmap ...
    }

    int32_t nslen = strcspn(s, ":") + 1;
    uint64_t h = hash(s, nslen);
    for (int32_t i = h;;) {
        i = ht_lookup(h, b->ns.exp, i);
        if (!b->ns.ht[i]) {
            b->ns.len++;
            b->ns.ht[i] = s;
            break;
        }
    }

It includes the : as a terminator which simplifies lookups. Here’s a lookup loop to print all strings under a namespace (includes terminal : in the key):

    char *ns = "city:";
    int32_t nslen = strlen(ns);
    // ...

    uint64_t h = hash(ns, nslen);
    for (int32_t i = h;;) {
        i = ht_lookup(h, b->ns.exp, i);
        if (!b->ns.ht[i]) {
            break;
        } else if (!strncmp(b.ns->ht[i], ns, nslen)) {
            puts(b->ns.ht[i]+nslen);
        }
    }

An alternative approach to multimaps is to additionally key over a value subscript. For example, the first city is keyed {"city", 0}, the next {"city", 1}, etc. The value subscript could be mixed into the string hash with an integer permutation (more on this below):

uint64_t h = hash64(val_idx ^ hash(s, nslen));

The lookup loop would compare both the string and the value subscript, and stop when it finds a match. The underlying hash table is not truly a multimap, but rather a plain hash table with a larger key. This requires extra bookkeeping — tracking individual subscripts and the number of values per key — but provides constant time random access on the multimap value array.

Hash functions

The MSI iterator leaves hashing up to the caller, who has better knowledge about the input and how to hash it, though this takes a bit of knowledge of how to build a hash function. The good news is that it’s easy, and less is more. Better to do too little than too much, and a faster, weaker hash function is worth a few extra collisions.

The first rule is to never lose sight of the goal: The purpose of the hash function is to uniformly distribute entries over a table. The better you know and exploit your input, the less you need to do in the hash function. Sometimes your keys already contain random data, and so your hash function can be the identity function! For example, if your keys are “version 4” UUIDs, don’t waste time hashing them, just load a few bytes from the end as an integer and you’re done.

// "Hash" a v4 UUID
uint64_t uuid4_hash(unsigned char uuid[16])
{
    uint64_t h;
    memcpy(&h, uuid+8, 8);
    return h;
}

A reasonable start for strings is FNV-1a, such as this possible implementation for my hash() function above:

uint64_t hash(char *s, int32_t len)
{
    uint64_t h = 0x100;
    for (int32_t i = 0; i < len; i++) {
        h ^= s[i] & 255;
        h *= 1111111111111111111;
    }
    return h ^ h>>32;
}

The hash state is initialized to a basis, some arbitrary value. This a useful place to introduce a seed or hash key. It’s best that at least one bit above the low mix-in bits is set so that it’s not trivially stuck at zero. Above, I’ve chosen the most trivial basis with reasonable results, though often I’ll use the digits of π.

Next XOR some input into the low bits. This could be a byte, a Unicode code point, etc. More is better, since otherwise you’re stuck doing more work per unit, the main weakness of FNV-1a. Carefully note the byte mask, & 255, which inhibits sign extension. Do not mix sign-extended inputs into FNV-1a — a widespread implementation mistake.

Multiply by a large, odd random-ish integer. A prime is a reasonable choice, and I usually pick my favorite prime, shown above: 19 ones in base 10.

Finally, my own touch, an xorshift finalizer. The high bits are much better mixed than the low bits, so this improves the overall quality. Though if you take time to benchmark, you might find that this finalizer isn’t necessary. Remember, do just enough work to keep the number of collisions low — not lowest — and no more.

If your input is made of integers, or is a short, fixed length, use an integer permutation, particularly multiply-xorshift. It takes very little to get a sufficient distribution. Sometimes one multiplication does the trick. Fixed-sized, integer-permutation hashes tend to be the fastest, easily beating fancier SIMD-based hashes, including AES-NI. For example:

// Hash a timestamp-based, version 1 UUID
uint64_t uuid1_hash(unsigned char uuid[16])
{
    uint64_t s[2];
    memcpy(s, uuid, 16);
    s[0] += 0x3243f6a8885a308d;  // digits of pi
    s[0] *= 1111111111111111111;
    s[0] ^= s[0] >> 33;
    s[0] += s[1];
    s[0] *= 1111111111111111111;
    s[0] ^= s[0] >> 33;
    return s[0];
}

If I benchmarked this in a real program, I would probably cut it down even further, deleting hash operations one at a time and measuring the overall hash table performance. This memcpy trick works well with floats, too, especially packing two single precision floats into one 64-bit integer.

If you ever hesitate to build a hash table when the situation calls, I hope the MSI technique will make the difference next time. I have more hash table tricks up my sleeve, but since they’re not specific to MSI I’ll save them for a future article.

Benchmarks

There have been objections to my claims about performance, so I’ve assembled some benchmarks. These demonstrate that:

AES-NI slower than an integer permutation, at least for short keys.
A custom, 10-line MSI hash table is easily an order of magnitude faster than a typical generic hash table from your language’s standard library. This isn’t because the standard hash table is inferior, but because it wasn’t written for your specific problem.

My take on "where's all the code"

2022-05-22T23:59:59Z

This article was discussed on Lobsters.

Earlier this month Ted Unangst researched compiling the OpenBSD kernel 50% faster, which involved stubbing out the largest, extraneous branches of the source tree. To find the lowest-hanging fruit, he wrote a tool called watc — where’s all the code — that displays an interactive “usage” summary of a source tree oriented around line count. A followup post about exploring the tree in parallel got me thinking about the problem, especially since I had just written about a concurrent queue. Turning it over in my mind, I saw opportunities for interesting data structures and memory management, and so I wanted to write my own version of the tool, watc.c, which is the subject of this article.

The original watc is interactive and written in idiomatic Go. My version is non-interactive, written in C, and currently only supports Windows. Not only do I prefer batch programs generally, building an interactive user interface would be complicated and distract from the actual problem I wanted to tackle. As for the platform restriction, it has some convenient constraints (for implementers), and my projects are often about shooting multiple birds with one stone:

The longest path is MAX_PATH, a meager 260 pseudo-UTF-16 code points, is nice and short. Technically users can now opt-in to a maximum path length of 32,767, but so little software supports it, including much of Windows itself, that it’s not worth considering. Even with the upper limit, each path component is still restricted by MAX_PATH. I can rely on this platform restriction in my design.
Symbolic links, an annoying edge case, are outside of consideration. Technically Windows has them, but they’re sufficiently locked away that they don’t come up in practice.
After years of deliberating, I was finally convinced to buy and try RememdyBG, a super slick Windows debugger. I especially wanted to try out its multi-threading support, and I knew I’d be using multiple threads in this project. Since it’s incompatible with my development kit, my program also supports the MSVC compiler.
The very same day I improved GDB support in my development kit, and this was a great opportunity to dogfood the changes. I’ve used my kit so much these past two years, especially since both it and I have matured enough that I’m nearly as productive in it as I am on Linux.
It’s practice and experience with the wide API, and the tool fully supports Unicode paths. Perhaps a bit unnecessary considering how few source trees stray beyond ASCII, even just in source text — just too many ways things go wrong otherwise.

Running my tool on nearly the same source tree as the original example yields:

C:\openbsd>watc sys
. 6.89MLOC 364.58MiB
├─dev 5.69MLOC 332.75MiB
│ ├─pci 4.46MLOC 293.80MiB
│ │ ├─drm 3.99MLOC 280.25MiB
│ │ │ ├─amd 3.33MLOC 261.24MiB
│ │ │ │ ├─include 2.61MLOC 238.48MiB
│ │ │ │ │ ├─asic_reg 2.53MLOC 235.07MiB
│ │ │ │ │ │ ├─nbio 689.56kLOC 69.33MiB
│ │ │ │ │ │ ├─dcn 583.67kLOC 58.60MiB
│ │ │ │ │ │ ├─gc 290.26kLOC 28.90MiB
│ │ │ │ │ │ ├─dce 210.16kLOC 16.81MiB
│ │ │ │ │ │ ├─mmhub 155.60kLOC 16.03MiB
│ │ │ │ │ │ ├─dpcs 123.90kLOC 12.97MiB
│ │ │ │ │ │ ├─gca 105.91kLOC 5.87MiB
│ │ │ │ │ │ ├─bif 71.45kLOC 4.41MiB
│ │ │ │ │ │ ├─gmc 64.24kLOC 3.41MiB
│ │ │ │ │ │ └─(other) 230.99kLOC 18.73MiB
│ │ │ │ │ └─(other) 2.10kLOC 139.29kiB
│ │ │ │ └─(other) 718.93kLOC 22.76MiB
│ │ │ └─(other) 583.63kLOC 16.86MiB
│ │ └─(other) 8.53kLOC 259.07kiB
│ └─(other) 1.20MLOC 38.34MiB
└─(other) 1.20MLOC 31.83MiB

In place of interactivity it has -n (lines) and -d (depth) switches to control tree pruning, where branches are summarized as (other) entries. My idea is for users to run the tool repeatedly with different cutoffs and filters to get a feel for where’s all the code. (It could really use more such knobs.) Repeated counting makes performance all the more important. On my machine, and a hot cache, the above takes ~180ms to count those 6.89 million lines of code across 8,607 source files.

Each directory is treated like one big source file of its recursively concatenated contents, so the tool only needs to track directories. Each directory entry comprises a variable-length string name, line and byte totals, and tree linkage such that it can be later navigated for sorting and printing. That linkage has a clever solution, which I’ll get to later. First, lets deal with strings.

String management

It’s important to get out of the null-terminated string business early, only reverting to their use at system boundaries, such as constructing paths for the operating system. Better to handle strings as offset/length pairs into a buffer. Definitely avoid silly things like allocating many individual strings, as encouraged by strdup — and most other programming language idioms — and certainly avoid useless functions like strcpy.

When the operating system provides a path component that I need to track for later, I intern it into a single, large buffer. That buffer looks like so:

#define BUF_MAX  (1 << 22)
struct buf {
    int32_t len;
    wchar_t buf[BUF_MAX];
};

Empirically I determined that even large source trees cumulatively total on the order of 10,000 characters of directory names. The OpenBSD kernel source tree is only 2,992 characters of names.

$ find sys -type d -printf %f | wc -c
2992

The biggest I found was the LLVM source tree at 121,720 characters, not only because of its sheer volume but also because it has generally has relatively long names. So for my maximum buffer size I just maxed it out (explained in a moment) and called it good. Even with UTF-16, that’s only 8MiB which is perfectly reasonable to allocate all at once up front. Since my string handles don’t contain pointers, this buffer could be freely relocated in the case of realloc.

The operating system provides a null-terminated string. The buffer makes a copy and returns a handle. A handle is a 32-bit integer encoding offset and length.

int32_t buf_push(struct buf *b, wchar_t *s)
{
    int32_t off = b->len;
    int32_t len = wcslen(s);
    if (b->len+len > BUF_MAX) {
        return -1;  // out of memory
    }
    memcpy(b->buf+off, s, len*sizeof(*s));
    b->len += len;
    return len<<22 | off;
}

The negative range is reserved for errors, leaving 31 bits. I allocate 9 to the length — enough for MAX_PATH of 260 — and the remaining 22 bits for the buffer offset, exactly matching the range of my BUF_MAX. Splitting on a nibble boundary would have displayed more nicely in hexadecimal during debugging, but oh well.

A couple of helper functions are in order:

int     str_len(int32_t s) { return s >> 22;      }
int32_t str_off(int32_t s) { return s & 0x3fffff; }

Rather than allocate the string buffer on the heap, it’s a static (read: too big for the stack) scoped to main. I consistently call it b.

static struct buf b;

That’s string management solved efficiently in a dozen lines of code. I briefly considered a hash table to de-duplicate strings in the buffer, but real source trees aren’t redundant enough to make up for the hash table itself, plus there’s no reason here to make that sort of time/memory trade-off.

Directory entries

I settled on 24-byte directory entries:

struct dir {
    uint64_t nbytes;
    uint32_t nlines;
    int32_t  name;
    int32_t  link;
    int32_t  nsubdirs;
};

For nbytes I teetered between 32 bits and 64 bits for the byte count. No source tree I found overflows an unsigned 32-bit integer, but LLVM comes close, just barely overflowing a signed 31-bit integer as of this year. Since I wanted 10x over the worst case I could find, that left me with a 64-bit integer for bytes.

For nlines, 32 bits has plenty of overhead. More importantly, this field is updated concurrently and atomically by multiple threads — line counting is parallelized — and I want this program to work on 32-bit hosts limited to 32-bit atomics.

The name is the string handle for that directory’s name.

The link and nsubdirs is the tree linkage. The link field is an index, and serves two different purposes at different times. Initially it will identify the directory’s parent directory, and I had originally named it parent. nsubdirs is the number of subdirectories, but there is initially no link to a directory’s children.

Like with the buffer, I pre-allocate all the directory entries I’ll need:

#define DIRS_MAX  (1 << 17)
int32_t ndirs = 0;
static struct dir dirs[DIRS_MAX];

A directory handle is just an index into dirs. The link field is one such handle. Like string handles, directory entries contain no pointers, and so this dirs buffer could be freely relocated, a la realloc, if the context called for such flexibility. In my program, rather than allocate this on the heap, it’s just a static (read: too big for the stack) scoped to main.

For DIRS_MAX, I again looked at the worst case I could find, LLVM, which requires 12,163 entries. I had hoped for 16-bit directory handles, but that would limit source trees to 32,768 directories — not quite 10x over the worst case. I settled on 131,072 entries: 3MiB. At only 11MiB total so far, in the very worst case, it hardly matters that I couldn’t shave off these extra few bytes.

$ find llvm-project -type d | wc -l
12163

Allocating a directory entry is just a matter of bumping the ndirs counter. Reading a directory into dirs looks roughly like so:

int32_t glob = buf_push(&b, L"*");
static struct dir dirs[DIRS_MAX];

int32_t parent = ...;  // an existing directory handle
wchar_t path[MAX_PATH];
buildpath(path, &b, dirs, parent, glob);

WIN32_FIND_DATAW fd;
HANDLE h = FindFirstFileW(path, &fd);

do {
    if (FILE_ATTRIBUTE_DIRECTORY & fd.dwFileAttributes) {
        int32_t name = buf_push(&b, fd.cFileName);
        if (name < 0 || ndirs == DIRS_MAX) {
            // out of memory
        }
        int32_t i = ndirs++;
        dirs[i].name = name;
        dirs[i].link = parent;
        dirs[parent].nsubdirs++;
    } else {
        // ... process file ...
    }
} while (FindNextFileW(h, &fd));

CloseHandle(h);

Mentally bookmark that “process file” part. It will be addressed later.

The buildpath function walks the link fields, copying (memcpy) path components from the string buffer into the path, separated by backslashes.

Breadth-first tree traversal

At the top-level the program must first traverse a tree. There are two strategies for traversing a tree (or any graph):

Depth-first: stack-oriented (lends to recursion)
Breadth-first: queue-oriented

Recursion makes me nervous, but besides this, a queue is already a natural fit for this problem. The tree I build in dirs is also the breadth-first processing queue. (Note: This is entirely distinct from the message queue that I’ll introduce later, and is not a concurrent queue.) Further, building the tree in dirs via breadth-first traversal will have useful properties later.

The queue is initialized with the root directory, then iterated over until the iterator reaches the end. Additional directories may added during iteration, per the last section.

int32_t root = ndirs++;
dirs[root].name = buf_push(&b, L".");
dirs[root].link = -1;  // terminator

for (int32_t parent = 0; parent < ndirs; parent++) {
    // ... FindFirstFileW / FindNextFileW ...
}

When the loop exits, the program has traversed the full tree. Counts are now propagated up the tree using the link field, pointing from leaves to root. In this direction it’s just a linked list. Propagation starts at the root and works towards leaves to avoid multiple-counting, and the breadth-first dirs is already ordered for this.

for (int32_t i = 1; i < ndirs; i++) {
    for (int32_t j = dirs[i].link; j >= 0; j = dirs[j].link) {
        dirs[j].nbytes += dirs[i].nbytes;
        dirs[j].nlines += dirs[i].nlines;
    }
}

Since this is really another traversal, this could be done during the first traversal. However, line counting will be done concurrently, and it’s easier, and probably more efficient, to propagate concurrent results after the concurrent part of the code is complete.

Inverting the tree links

Printing the graph will require a depth-first traversal. Given an entry, the program will iterate over its children. However, the tree links are currently backwards, pointing from child to parent:

To traverse from root to leaves, those links will need to be inverted:

However, there’s only one link on each node, but potentially multiple children. The breadth-first traversal comes to the rescue: All child nodes for a given directory are adjacent in dirs. If link points to the first child, finding the rest is trivial. There’s an implicit link between siblings by virtue of position:

An entry’s first child immediately follows the previous entry’s last child. So to flip the links around, manually establish the root’s link field, then walk the tree breadth-first and hook link up to each entry’s children based on the previous entry’s link and nsubdirs:

dirs[0].link = 1;
for (int32_t i = 1; i < ndirs; i++) {
    dirs[i].link = dirs[i-1].link + dirs[i-1].nsubdirs;
}

The tree is now restructured for sorting and depth-first traversal.

Sort by line count

I won’t include it here, but I have a qsort-compatible comparison function, dircmp that compares by line count descending, then by name ascending. As a file system tree, siblings cannot have equal names.

int dircmp(const void *, const void *);

Since child entries are adjacent, it’s a trivial to qsort each entry’s children. A loop sorts the whole tree:

for (int32_t i = 0; i < ndirs; i++) {
    struct dir *beg = dirs + dirs[i].link;
    qsort(beg, dirs[i].nsubdirs, sizeof(*dirs), dircmp);
}

We’re almost to the finish line.

Depth-first traversal

As I said, recursion makes me nervous, so I took the slightly more complicated route of an explicit stack. Path components must be separated by a backslash delimiter, so the deepest possible stack is MAX_PATH/2. Each stack element tracks a directory handle (d) and a subdirectory index (i).

I have a printstat to output an entry. It takes an entry, the string buffer, and a depth for indentation level.

void printstat(struct dir *d, struct buf *b, int depth);

Here’s a simplified depth-first traversal calling printstat. (The real one has to make decisions about when to stop and summarize, and it’s dominated by edge cases.) I initialize the stack with the root directory, then loop until it’s empty.

int n = 0;  // top of stack
struct {
    int32_t d;
    int32_t i;
} stack[MAX_PATH/2];

stack[n].d = 0;
stack[n].i = 0;
printstat(dirs+0, &b, n);

while (n >= 0) {
    int32_t d = stack[n].d;
    int32_t i = stack[n].i++;
    if (i >= dirs[d].nsubdirs) {
        n--;  // pop
    } else {
        int32_t cur = dirs[d].link + i;
        printstat(dirs+cur, &b, n);
        n++;  // push
        stack[n].d = cur;
        stack[n].i = 0;
    }
}

Concurrency

At this point the “process file” part of traversal was a straightforward CreateFile, ReadFile loop, CloseHandle. I suspected it spent most of its time in the loop counting newlines since I didn’t do anything special, like SIMD, aside from not over-constraining code generation.

However after taking some measurements, I found the program was spending 99.9% its time waiting on Win32 functions. CreateFile was the most expensive at nearly 50% of the total run time, and even CloseHandle was a substantial blocker. These two alone meant overlapped I/O wouldn’t help much, and threads were necessary to run these Win32 blockers concurrently. Counting newlines, even over gigabytes of data, was practically free, and so required no further attention.

So I set up my lock-free work queue.

#define QUEUE_LEN (1<<15)
struct queue {
    uint32_t q;
    int32_t d[QUEUE_LEN];
    int32_t f[QUEUE_LEN];
};

As before, q here is the atomic. A max-size queue for QUEUE_LEN worked best in my tests. Larger queues were rarely full. Or empty, except at startup and shutdown. Queue elements are a pair of directory handle (d) and file string handle (f), stored in separate arrays.

I didn’t need to push the file name strings into the string buffer before, but now it’s a great way to supply strings to other threads. I push the string into the buffer, then send the handle through the queue. The recipient re-constructs the path on its end using the directory tree and this file name. Unfortunately this puts more stress on the string buffer, which is why I had to max out the size, but it’s worth it.

The “process files” part now looks like this:

dirs[parent].nbytes += fd.nFileSizeLow;
dirs[parent].nbytes += (uint64_t)fd.nFileSizeHigh << 32;

int32_t name = buf_push(&b, fd.cFileName);
if (!queue_send(&queue, parent, name)) {
    wchar_t path[MAX_PATH];
    buildpath(path, buf.buf, dirs, parent, name);
    processfile(path, dirs, parent);
}

If queue_send() returns false then the queue is full, so it processes the job itself. There might be room later for the next file.

Worker threads look similar, spinning until an item arrives in the queue:

    for (;;) {
        int32_t d;
        int32_t name;
        while (!queue_recv(q, &d, &name));
        if (d == -1) {
            return 0;
        }
        wchar_t path[MAX_PATH];
        buildpath(path, buf, dirs, d, name);
        processfile(path, dirs, d);
    }

A special directory entry handle of -1 tells the worker to exit. When traversal completes, the main thread becomes a worker until the queue empties, pushes one termination handle for each worker thread, then joins the worker threads — a synchronization point that indicates all work is complete, and the main thread can move on to propagation and sorting.

This was a substantial performance boost. At least on my system, running just 4 threads total is enough to saturate the Win32 interface, and additional threads do not make the program faster despite more available cores.

Aside from overall portability, I’m quite happy with the results.

A lock-free, concurrent, generic queue in 32 bits

2022-05-14T04:22:24Z

This article was discussed on Hacker News.

While considering concurrent queue design I came up with a generic, lock-free queue that fits in a 32-bit integer. The queue is “generic” in that a single implementation supports elements of any arbitrary type, despite an implementation in C. It’s lock-free in that there is guaranteed system-wide progress. It can store up to 32,767 elements at a time — more than enough for message queues, which must always be bounded. I will first present a single-consumer, single-producer queue, then expand support to multiple consumers at a cost. Like my lightweight barrier, I’m not presenting this as a packaged solution, but rather as a technique you can apply when circumstances call.

How can the queue store so many elements when it’s just 32 bits? It only handles the indexes of a circular buffer. The caller is responsible for allocating and manipulating the queue’s storage, which, in the single-consumer case, doesn’t require anything fancy. Synchronization is managed by the queue.

Like a typical circular buffer, it has a head index and a tail index. The head is the next element to be pushed, and the tail is the next element to be popped. The queue storage must have a power-of-two length, but the capacity is one less than the length. If the head and tail are equal then the queue is empty. This “wastes” one element, which is why the capacity is one less than the length of the storage. So already there are some notable constraints imposed by this design, but I believe the main use case for such a queue — a job queue for CPU-bound jobs — has no problem with these constraints.

Since this is a concurrent queue it’s worth noting “ownership” of storage elements. The consumer owns elements from the tail up to, but excluding, the head. The producer owns everything else. Both pushing and popping involve a “commit” step that transfers ownership of an element to the other thread. No elements are accessed concurrently, which makes things easy for either caller.

Queue usage

Pushing (to the front) and popping (from the back) are each a three-step process:

Obtain the element index
Access that element
Commit the operation

I’ll be using C11 atomics for my implementation, but it should be easy to translate these into something else no matter the programming language. As I mentioned, the queue fits in a 32-bit integer, and so it’s represented by an _Atomic uint32_t. Here’s the entire interface:

int  queue_pop(_Atomic uint32_t *queue, int exp);
void queue_pop_commit(_Atomic uint32_t *queue);

int  queue_push(_Atomic uint32_t *queue, int exp);
void queue_push_commit(_Atomic uint32_t *queue);

Both queue_pop and queue_push return -1 if the queue is empty/full.

To create a queue, initialize an atomic 32-bit integer to zero. Also choose a size exponent and allocate some storage. Here’s a 63-element queue of jobs:

#define EXP 6  // note; 2**6 == 64
struct job slots[1<<EXP];
_Atomic uint32_t q = 0;

Rather than a length, the queue functions accept a base-2 exponent, which is why I’ve defined EXP. If you don’t like this, you can just accept a length in your own implementation, though remember it’s constrained to powers of two. The producer might look like so:

for (;;) {
    int i;
    do {
        i = queue_push(&q, EXP);
    } while (i < 0);  // note: busy-wait while full
    slots[i] = job_create();
    queue_push_commit(&q);
}

This is a busy-wait loop, which makes for a simple illustration but isn’t ideal. In a real program I’d have the producer run a job while it waits for a queue slot, or just have it turn into a consumer (if this wasn’t a single-consumer queue). Similarly, if the queue is empty, then maybe a consumer turns into the producer. It all depends on the context.

The consumer might look like so:

for (;;) {
    int i;
    do {
        i = queue_pop(&q, EXP);
    } while (i < 0);  // note: busy-wait while empty
    struct job job = slots[i];
    queue_pop_commit(&q);
    job_run(job);
}

In either case it’s important that neither touches the element after committing since that transfers ownership away.

Pop operation

The queue is actually a pair of 16-bit integers, head and tail, each stored in the low and high halves of the 32-bit integer. So the first thing to do is atomically load the integer, then extract these “fields.”

If for some reason a capacity of 32,767 is insufficient, you can trivially upgrade your queue to an Enterprise Queue: a 64-bit integer with a capacity of over 2 billion elements. I’m going to stick with the 32-bit queue.

Starting with the pop operation since it’s simpler:

int queue_pop(_Atomic uint32_t *q, int exp)
{
    uint32_t r = *q;  // consider "acquire"
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    return head == tail ? -1 : tail;
}

If the indexes are equal, the queue is empty. Otherwise return the tail field. The *q is an atomic load since it’s qualified _Atomic. The load might be more efficient if this were an explicit “acquire” operation, which is what I used in some of my tests.

To complete the pop, atomically increment the tail index so that the element falls out of the range of elements owned by the consumer. The tail is the high half of the integer so add 0x10000 rather than just 1.

void queue_pop_commit(_Atomic uint32_t *q)
{
    *q += 0x10000;  // consider "release"
}

It’s harmless if this overflows since it’s congruent with the power-of-two storage length, and an overflow won’t affect the head index. The increment might be more efficient if this were an explicit “release” operation, which, again, is what I used in some of my tests.

Push operation

Pushing is a little more complex. As is typical with circular buffers, before doing anything it must ensure the result won’t ambiguously create an empty queue.

int queue_push(_Atomic uint32_t *q, int exp)
{
    uint32_t r = *q;  // consider "acquire"
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    int next = (head + 1u) & mask;
    if (r & 0x8000) {  // avoid overflow on commit
        *q &= ~0x8000;
    }
    return next == tail ? -1 : head;
}

It’s important that incrementing the head field won’t overflow into the tail field, so it atomically clears the high bit if set, giving the increment overhead into which it can overflow.

void queue_push_commit(_Atomic uint32_t *q)
{
    *q += 1;  // consider "release"
}

Multiple-consumers

The single producer and single consumer didn’t require locks nor atomic accesses to the storage array since the queue guaranteed that accesses at the specified index were not concurrent. However, this is not the case with multiple-consumers. Consumers race when popping. The loser’s access might occur after the winner’s commit, making its access concurrent with the producer. Both producer and consumers must account for this.

_Atomic struct job slots[1<<EXP];

To prepare for multiple consumers, the array now has an atomic qualifier: one of the costs of multiple consumers. Fortunately these new atomic accesses can use a “relaxed” ordering since there are no required ordering constraints. Even if it wasn’t atomic, and the load was torn, we’d detect it when attempting to commit. It’s simply against the rules to have a data race, and I don’t know how else to avoid it other than dropping into assembly.

The next cost is that committing can fail. Another consumer might have won the race, which means you must start over. Here’s my multiple-consumer interface, which I’ve uncreatively called mpop (“multiple-consumer pop”). Besides a _Bool for indicating failure, the main change is a new save parameter:

int   queue_mpop(_Atomic uint32_t *, int, uint32_t *save);
_Bool queue_mpop_commit(_Atomic uint32_t *, uint32_t save);

The caller must carry some temporary state (save), which is how failures are detected, ultimately communicated by that _Bool return.

for (;;) {
    int i;
    int32_t save;
    struct job job;
    do {
        do {
            i = queue_mpop(&q, EXP, &save);
        } while (i < 0);  // note: busy-wait while empty
        job = slots[i];
    } while (!queue_mpop_commit(&q, save));
    job_run(job);
}

It’s important that the consumer doesn’t attempt to use job until a successful commit, since it might not be valid. As noted, that load could be relaxed (what a mouthful):

job = atomic_load_explicit(slots+i, memory_order_relaxed);

Here’s the pop implementation:

int queue_mpop(_Atomic uint32_t *q, int exp, uint32_t *save)
{
    uint32_t r = *save = *q;
    int mask = (1u << exp) - 1;
    int head = r     & mask;
    int tail = r>>16 & mask;
    return head == tail ? -1 : tail;
}

So far it’s exactly the same, except it stores a full snapshot of the queue state in *save. This is needed for a compare-and-swap (CAS) in the commit, which checks that the queue hasn’t been modified concurrently (i.e. by another consumer):

_Bool queue_mpop_commit(_Atomic uint32_t *q, uint32_t save)
{
    return atomic_compare_exchange_strong(q, &save, save+0x10000);
}

As always with CAS, we must be wary of the ABA problem. Imagine that between starting to pop and this CAS that the producer and another consumer looped over the entire queue and ended up back at exactly the same spot as where we started. The queue would look like we expect, and the commit would “succeed” despite reading a garbage value.

Fortunately this matches the entire 32-bit state, and so a small queue capacity is not at a greater risk. The tail counter is always 16 bits, and the head counter is 15 bits (due to keeping the 16th clear for overflow). The chance of them landing at exactly the same count is low. Though if those odds aren’t low enough, as mentioned you can always upgrade to the 64-bit Enterprise Queue with larger counters.

There’s a notable performance defect with this particular design. If the producer concurrently pushes a new value, the commit will fail even if there was no real race since only the head field changed. It would be better if the head field was isolated from the tail field…

A less cheeky design

You might have noticed that there’s little reason to pack two 16-bit counters into a 32-bit integer. These could just be fields in a structure:

struct queue {
    _Atomic uint16_t head;
    _Atomic uint16_t tail;
};

While this entire structure can be atomically loaded just like the 32-bit integer, C11 (and later) do not permit non-atomic accesses to these atomic fields in an unshared copy loaded from an atomic. So I’d either use compiler-specific built-ins for atomics — much more flexible, and what I prefer anyway — or just load them individually:

int queue_pop(struct queue *q, int exp, uint16_t *save)
{
    int mask = (1u << exp) - 1;
    int head = q->head & mask;
    int tail = (*save = q->tail) & mask;
    return head == tail ? -1 : tail;
}

Technically with two loads this could extract a head/tail pair that were never contemporaneous. The worst case is the queue appears empty even if it was never actually empty.

_Bool queue_mpop_commit(struct queue *q, uint16_t save)
{
    return atomic_compare_exchange_strong(&q->tail, &save, save+1);
}

Since the head index isn’t part of the CAS, the producer can’t interfere with the commit. (Though there’s still certainly false sharing happening.)

Real implementation and tests

If you want to try it out, especially with my tests: queue.c. It has both single-consumer and multiple-consumer queues, and supports at least:

atomics: C11, GNU, MSC
threads: pthreads, win32
compilers: GCC, Clang, MSC
hosts: Linux, Windows, BSD

Since I wanted to test across a variety of implementations, especially under Thread Sanitizer (TSan). On a similar note, I also implemented a concurrent queue shared between C and Go: queue.go.

Luhn algorithm using SWAR and SIMD

2022-04-30T17:53:05Z

Ever been so successful that credit card processing was your bottleneck? Perhaps you’ve wondered, “If only I could compute check digits three times faster using the same hardware!” Me neither. But if that ever happens someday, then this article is for you. I will show how to compute the Luhn algorithm in parallel using SIMD within a register, or SWAR.

If you want to skip ahead, here’s the full source, tests, and benchmark: luhn.c

The Luhn algorithm isn’t just for credit card numbers, but they do make a nice target for a SWAR approach. The major payment processors use 16 digit numbers — i.e. 16 ASCII bytes — and typical machines today have 8-byte registers, so the input fits into two machine registers. In this context, the algorithm works like so:

Consider the digits number as an array, and double every other digit starting with the first. For example, 6543 becomes 12, 5, 8, 3.
Sum individual digits in each element. The example becomes 3 (i.e. 1+2), 5, 8, 3.
Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.

I will implement this algorithm in C with this prototype:

int luhn(const char *s);

It assumes the input is 16 bytes and only contains digits, and it will return the Luhn sum. Callers either validate a number by comparing the result to zero, or use it to compute a check digit when generating a number. (Read: You could use SWAR to rapidly generate valid numbers.)

The plan is to process the 16-digit number in two halves, and so first load the halves into 64-bit registers, which I’m calling hi and lo:

uint64_t hi =
    (uint64_t)(s[ 0]&255) <<  0 | (uint64_t)(s[ 1]&255) <<  8 |
    (uint64_t)(s[ 2]&255) << 16 | (uint64_t)(s[ 3]&255) << 24 |
    (uint64_t)(s[ 4]&255) << 32 | (uint64_t)(s[ 5]&255) << 40 |
    (uint64_t)(s[ 6]&255) << 48 | (uint64_t)(s[ 7]&255) << 56;
uint64_t lo =
    (uint64_t)(s[ 8]&255) <<  0 | (uint64_t)(s[ 9]&255) <<  8 |
    (uint64_t)(s[10]&255) << 16 | (uint64_t)(s[11]&255) << 24 |
    (uint64_t)(s[12]&255) << 32 | (uint64_t)(s[13]&255) << 40 |
    (uint64_t)(s[14]&255) << 48 | (uint64_t)(s[15]&255) << 56;

This looks complicated and possibly expensive, but it’s really just an idiom for loading a little endian 64-bit integer from a buffer. Breaking it down:

The input, *s, is char, which may be signed on some architectures. I chose this type since it’s the natural type for strings. However, I do not want sign extension, so I mask the low byte of the possibly-signed result by ANDing with 255. It’s as though *s was unsigned char.
The shifts assemble the 64-bit result in little endian byte order regardless of the host machine byte order. In other words, this will produce correct results even on big endian hosts.
I chose little endian since it’s the natural byte order for all the architectures I care about. Big endian hosts may pay a cost on this load (byte swap instruction, etc.). The rest of the function could just as easily be computed over a big endian load if I was primarily targeting a big endian machine instead.
I could have used unsigned long long (i.e. at least 64 bits) since no part of this function requires exactly 64 bits. I chose uint64_t since it’s succinct, and in practice, every implementation supporting long long also defines uint64_t.

Both GCC and Clang figure this all out and produce perfect code. On x86-64, just one instruction for each statement:

    mov  rax, [rdi+0]
    mov  rdx, [rdi+8]

Or, more impressively, loading both using a single instruction on ARM64:

    ldp  x0, x1, [x0]

The next step is to decode ASCII into numeric values. This is trivial and common in SWAR, and only requires subtracting '0' (0x30). So long as there is no overflow, this can be done lane-wise.

hi -= 0x3030303030303030;
lo -= 0x3030303030303030;

Each byte of the register now contains values in 0–9. Next, double every other digit. Multiplication in SWAR is not easy, but doubling just means adding the odd lanes to themselves. I can mask out the lanes that are not doubled. Regarding the mask, recall that the least significant byte is the first byte (little endian).

hi += hi & 0x00ff00ff00ff00ff;
lo += lo & 0x00ff00ff00ff00ff;

Each byte of the register now contains values in 0–18. Now for the tricky problem of folding the tens place into the ones place. Unlike 8 or 16, 10 is not a particularly convenient base for computers, especially since SWAR lacks lane-wide division or modulo. Perhaps a lane-wise binary-coded decimal could solve this. However, I have a better trick up my sleeve.

Consider that the tens place is either 0 or 1. In other words, we really only care if the value in the lane is greater than 9. If I add 6 to each lane, the 5th bit (value 16) will definitely be set in any lanes that were previously at least 10. I can use that bit as the tens place.

hi += (hi + 0x0006000600060006)>>4 & 0x0001000100010001;
lo += (lo + 0x0006000600060006)>>4 & 0x0001000100010001;

This code adds 6 to the doubled lanes, shifts the 5th bit to the least significant position in the lane, masks for just that bit, and adds it lane-wise to the total. Only applying this to doubled lanes is a style decision, and I could have applied it to all lanes for free.

The astute might notice I’ve strayed from the stated algorithm. A lane that was holding, say, 12 now hold 13 rather than 3. Since the final result of the algorithm is modulo 10, leaving the tens place alone is harmless, so this is fine.

At this point each lane contains values in 0–19. Now that the tens processing is done, I can combine the halves into one register with a lane-wise sum:

hi += lo;

Each lane contains values in 0–38. I would have preferred to do this sooner, but that would have complicated tens place handling. Even if I had rotated the doubled lanes in one register to even out the sums, some lanes may still have had a 2 in the tens place.

The final step is a horizontal sum reduction using the typical SWAR approach. Add the top half of the register to the bottom half, then the top half of what’s left to the bottom half, etc.

hi += hi >> 32;
hi += hi >> 16;
hi += hi >>  8;

Before the sum I said each lane was 0–38, so couldn’t this sum be as high as 304 (8x38)? It would overflow the lane, giving an incorrect result. Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled lanes. That’s a maximum of 224, which fits in the result lane without overflow. Whew! I’ve been tracking the range all along to guard against overflow like this.

Finally mask the result lane and return it modulo 10:

return (hi&255) % 10;

On my machine, SWAR is around 3x faster than a straightforward digit-by-digit implementation.

Usage examples

int is_valid(const char *s)
{
    return luhn(s) == 0;
}

void random_credit_card(char *s)
{
    sprintf(s, "%015llu0", rand64()%1000000000000000);
    s[15] = '0' + 10 - luhn(s);
}

SIMD

Conveniently, all the SWAR operations translate directly into SSE2 instructions. If you understand the SWAR version, then this is easy to follow:

int luhn(const char *s)
{
    __m128i r = _mm_loadu_si128((void *)s);

    // decode ASCII
    r = _mm_sub_epi8(r, _mm_set1_epi8(0x30));

    // double every other digit
    __m128i m = _mm_set1_epi16(0x00ff);
    r = _mm_add_epi8(r, _mm_and_si128(r, m));

    // extract and add tens digit
    __m128i t = _mm_set1_epi16(0x0006);
    t = _mm_add_epi8(r, t);
    t = _mm_srai_epi32(t, 4);
    t = _mm_and_si128(t, _mm_set1_epi8(1));
    r = _mm_add_epi8(r, t);

    // horizontal sum
    r = _mm_sad_epu8(r, _mm_set1_epi32(0));
    r = _mm_add_epi32(r, _mm_shuffle_epi32(r, 2));
    return _mm_cvtsi128_si32(r) % 10;
}

On my machine, the SIMD version is around another 3x increase over SWAR, and so nearly an order of magnitude faster than a digit-by-digit implementation.

Update: Const-me on Hacker News suggests a better option for handling the tens digit in the function above, shaving off 7% of the function’s run time on my machine:

    // if (digit > 9) digit -= 9
    __m128i nine = _mm_set1_epi8(9);
    __m128i gt = _mm_cmpgt_epi8(r, nine);
    r = _mm_sub_epi8(r, _mm_and_si128(gt, nine));

Update: u/aqrit on reddit has come up with a more optimized SSE2 solution, 12% faster than mine on my machine:

int luhn(const char *s)
{
    __m128i v = _mm_loadu_si128((void *)s);
    __m128i m = _mm_cmpgt_epi8(_mm_set1_epi16('5'), v);
    v = _mm_add_epi8(v, _mm_slli_epi16(v, 8));
    v = _mm_add_epi8(v, m);  // subtract 1 if less than 5
    v = _mm_sad_epu8(v, _mm_setzero_si128());
    v = _mm_add_epi32(v, _mm_shuffle_epi32(v, 2));
    return (_mm_cvtsi128_si32(v) - 4) % 10;
    // (('0' * 24) - 8) % 10 == 4
}

A flexible, lightweight, spin-lock barrier

2022-03-13T23:55:08Z

This article was discussed on Hacker News.

The other day I wanted try the famous memory reordering experiment for myself. It’s the double-slit experiment of concurrency, where a program can observe an “impossible” result on common hardware, as though a thread had time-traveled. While getting thread timing as tight as possible, I designed a possibly-novel thread barrier. It’s purely spin-locked, the entire footprint is a zero-initialized integer, it automatically resets, it can be used across processes, and the entire implementation is just three to four lines of code.

Here’s the entire barrier implementation for two threads in C11.

// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
    uint32_t v = ++*barrier;
    if (v & 1) {
        for (v &= 2; (*barrier&2) == v;);
    }
}

Or in Go:

func BarrierWait(barrier *uint32) {
    v := atomic.AddUint32(barrier, 1)
    if v&1 == 1 {
        v &= 2
        for atomic.LoadUint32(barrier)&2 == v {
        }
    }
}

Even more, these two implementations are compatible with each other. C threads and Go goroutines can synchronize on a common barrier using these functions. Also note how it only uses two bits.

When I was done with my experiment, I did a quick search online for other spin-lock barriers to see if anyone came up with the same idea. I found a couple of subtly-incorrect spin-lock barriers, and some straightforward barrier constructions using a mutex spin-lock.

Before diving into how this works, and how to generalize it, let’s discuss the circumstance that let to its design.

Experiment

Here’s the setup for the memory reordering experiment, where w0 and w1 are initialized to zero.

thread#1    thread#2
w0 = 1      w1 = 1
r1 = w1     r0 = w0

Considering all the possible orderings, it would seem that at least one of r0 or r1 is 1. There seems to be no ordering where r0 and r1 could both be 0. However, if raced precisely, this is a frequent or possibly even majority occurrence on common hardware, including x86 and ARM.

How to go about running this experiment? These are concurrent loads and stores, so it’s tempting to use volatile for w0 and w1. However, this would constitute a data race — undefined behavior in at least C and C++ — and so we couldn’t really reason much about the results, at least not without first verifying the compiler’s assembly. These are variables in a high-level language, not architecture-level stores/loads, even with volatile.

So my first idea was to use a bit of inline assembly for all accesses that would otherwise be data races. x86-64:

static int experiment(int *w0, int *w1)
{
    int r1;
    __asm volatile (
        "movl  $1, %1\n"
        "movl  %2, %0\n"
        : "=r"(r1), "=m"(*w0)
        : "m"(*w1)
    );
    return r1;
}

ARM64 (to try on my Raspberry Pi):

static int experiment(int *w0, int *w1)
{
    int r1 = 1;
    __asm volatile (
        "str  %w0, %1\n"
        "ldr  %w0, %2\n"
        : "+r"(r1), "=m"(w0)
        : "m"(w1)
    );
    return r1;
}

This is from the point-of-view of thread#1, but I can swap the arguments for thread#2. I’m expecting this to be inlined, and encouraging it with static.

Alternatively, I could use C11 atomics with a relaxed memory order:

static int experiment(_Atomic int *w0, _Atomic int *w1)
{
    atomic_store_explicit(w0, 1, memory_order_relaxed);
    return atomic_load_explicit(w1, memory_order_relaxed);
}

Since this is a race and I want both threads to run their two experiment instructions as simultaneously as possible, it would be wise to use some sort of starting barrier… exactly the purpose of a thread barrier! It will hold the threads back until they’re both ready.

int w0, w1, r0, r1;

// thread#1                   // thread#2
w0 = w1 = 0;
BARRIER;                      BARRIER;
r1 = experiment(&w0, &w1);    r0 = experiment(&w1, &w0);
BARRIER;                      BARRIER;

if (!r0 && !r1) {
    puts("impossible!");
}

The second thread goes straight into the barrier, but the first thread does a little more work to initialize the experiment and a little more at the end to check the result. The second barrier ensures they’re both done before checking.

Running this only once isn’t so useful, so each thread loops a few million times, hence the re-initialization in thread#1. The barriers keep them lockstep.

Barrier selection

On my first attempt, I made the obvious decision for the barrier: I used pthread_barrier_t. I was already using pthreads for spawning the extra thread, including on Windows, so this was convenient.

However, my initial results were disappointing. I only observed an “impossible” result around one in a million trials. With some debugging I determined that the pthreads barrier was just too damn slow, throwing off the timing. This was especially true with winpthreads, bundled with Mingw-w64, which in addition to the per-barrier mutex, grabs a global lock twice per wait to manage the barrier’s reference counter.

All pthreads implementations I used were quick to yield to the system scheduler. The first thread to arrive at the barrier would go to sleep, the second thread would wake it up, and it was rare they’d actually race on the experiment. This is perfectly reasonable for a pthreads barrier designed for the general case, but I really needed a spin-lock barrier. That is, the first thread to arrive spins in a loop until the second thread arrives, and it never interacts with the scheduler. This happens so frequently and quickly that it should only spin for a few iterations.

Barrier design

Spin locking means atomics. By default, atomics have sequentially consistent ordering and will provide the necessary synchronization for the non-atomic experiment variables. Stores (e.g. to w0, w1) made before the barrier will be visible to all other threads upon passing through the barrier. In other words, the initialization will propagate before either thread exits the first barrier, and results propagate before either thread exits the second barrier.

I know statically that there are only two threads, simplifying the implementation. The plan: When threads arrive, they atomically increment a shared variable to indicate such. The first to arrive will see an odd number, telling it to atomically read the variable in a loop until the other thread changes it to an even number.

At first with just two threads this might seem like a single bit would suffice. If the bit is set, the other thread hasn’t arrived. If clear, both threads have arrived.

void broken_wait1(_Atomic unsigned *barrier)
{
    ++*barrier;
    while (*barrier&1);
}

Or to avoid an extra load, use the result directly:

void broken_wait2(_Atomic unsigned *barrier)
{
    if (++*barrier & 1) {
        while (*barrier&1);
    }
}

Neither of these work correctly, and the other mutex-free barriers I found all have the same defect. Consider the broader picture: Between atomic loads in the first thread spin-lock loop, suppose the second thread arrives, passes through the barrier, does its work, hits the next barrier, and increments the counter. Both threads see an odd counter simultaneously and deadlock. No good.

To fix this, the wait function must also track the phase. The first barrier is the first phase, the second barrier is the second phase, etc. Conveniently the rest of the integer acts like a phase counter! Writing this out more explicitly:

void barrier_wait(_Atomic unsigned *barrier)
{
    unsigned observed = ++*barrier;
    unsigned thread_count = observed & 1;
    if (thread_count != 0) {
        // not last arrival, watch for phase change
        unsigned init_phase = observed >> 1;
        for (;;) {
            unsigned current_phase = *barrier >> 1;
            if (current_phase != init_phase) {
                break;
            }
        }
    }
}

The key: When the last thread arrives, it overflows the thread counter to zero and increments the phase counter in one operation.

By the way, I’m using unsigned since it may eventually overflow, and even _Atomic int overflow is undefined for the ++ operator. However, if you use atomic_fetch_add or C++ std::atomic then overflow is defined and you can use int.

Threads can never be more than one phase apart by definition, so only one bit is needed for the phase counter, making this effectively a two-phase, two-bit barrier. In my final implementation, rather than shift (>>), I mask (&) the phase bit with 2.

With this spin-lock barrier, the experiment observes r0 = r1 = 0 in ~10% of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.

Generalizing to more threads

Two threads required two bits. This generalizes to log2(n)+1 bits for n threads, where n is a power of two. You may have already figured out how to support more threads: spend more bits on the thread counter.

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
    unsigned v = ++*barrier;
    if (v & (n - 1)) {
        for (v &= n; (*barrier&n) == v;);
    }
}

Note: It never makes sense for n to exceed the logical core count! If it does, then at least one thread must not be actively running. The spin-lock ensures it does not get scheduled promptly, and the barrier will waste lots of resources doing nothing in the meantime.

If the barrier is used little enough that you won’t overflow the overall barrier integer — maybe just use a uint64_t — an implementation could support arbitrary thread counts with the same principle using modular division instead of the & operator. The denominator is ideally a compile-time constant in order to avoid paying for division in the spin-lock loop.

While C11 _Atomic seems like it would be useful, unsurprisingly it is not supported by one major, stubborn implementation. If you’re using C++11 or later, then go ahead use std::atomic since it’s well-supported. In real, practical C programs, I will continue using dual implementations: interlocked functions on MSVC, and GCC built-ins (also supported by Clang) everywhere else.

#if __GNUC__
#  define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
#  define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
#  define BARRIER_INC(x) _InterlockedIncrement(x)
#  define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif

// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
    int v = BARRIER_INC(barrier);
    if (v & (n - 1)) {
        for (v &= n; (BARRIER_GET(barrier)&n) == v;);
    }
}

This has the nice bonus that the interface does not have the _Atomic qualifier, nor std::atomic template. It’s just a plain old int, making the interface simpler and easier to use. It’s something I’ve grown to appreciate from Go.

If you’d like to try the experiment yourself: reorder.c. If you’d like to see a test of Go and C sharing a thread barrier: coop.go.

I’m intentionally not providing the spin-lock barrier as a library. First, it’s too trivial and small for that, and second, I believe context is everything. Now that you understand the principle, you can whip up your own, custom-tailored implementation when the situation calls for it, just as the one in my experiment is hard-coded for exactly two threads.

Fast CSV processing with SIMD

2021-12-04T01:13:33Z

This article was discussed on Hacker News.

I recently learned of csvquote, a tool that encodes troublesome CSV characters such that unix tools can correctly process them. It reverses the encoding at the end of the pipeline, recovering the original input. The original implementation handles CSV quotes using the straightforward, naive method. However, there’s a better approach that is not only simpler, but around 3x faster on modern hardware. Even more, there’s yet another approach using SIMD intrinsics, plus some bit twiddling tricks, which increases the processing speed by an order of magnitude. My csvquote implementation includes both approaches.

Background

Records in CSV data are separated by line feeds, and fields are separated by commas. Fields may be quoted.

aaa,bbb,ccc
xxx,"yyy",zzz

Fields containing a line feed (U+000A), quotation mark (U+0022), or comma (U+002C), must be quoted, otherwise they would be ambiguous with the CSV formatting itself. Quoted quotation marks are turned into a pair of quotes. For example, here are two records with two fields apiece:

"George Herman ""Babe"" Ruth","1919–1921, 1923, 1926"
"Frankenstein;
or, The Modern Prometheus",Mary Shelley

A CSV-unaware tool splitting on commas and line feeds (ex. awk) would process these records improperly. So csvquote translates quoted line feeds into record separators (U+001E) and commas into unit separators (U+001F). These control characters rarely appear in normal text data, and can be trivially processed in UTF-8-encoded text without decoding or encoding. The above records become:

"George Herman ""Babe"" Ruth","1919–1921\x1f 1923\x1f 1926"
"Frankenstein;\x1eor\x1f The Modern Prometheus",Mary Shelley

I’ve used \x1e and \x1f here to illustrate the control characters.

The data is exactly the same length since it’s a straight byte-for-byte replacement. Quotes are left entirely untouched. The challenge is parsing the quotes to track whether the two special characters fall inside or outside pairs of quotes.

State machine improvements

The original csvquote walks the input a byte at a time and is in one of three states:

Outside quotes (initial state)
Inside quotes
On a possibly “escaped” quote (the first " in a "")

Since I love state machines so much, here it is translated into a switch-based state machine:

// Return the next state given an input character.
int next(int state, int c)
{
    switch (state) {
    case 1: return c == '"' ? 2 : 1;
    case 2: return c == '"' ? 3 : 2;
    case 3: return c == '"' ? 2 : 1;
    }
}

The real program also has more conditions for potentially making a replacement. It’s an awful lot of performance-killing branching.

However, this context is about finding “in” and “out” — not validating the CSV — so the “escape” state is unnecessary. I need only match up pairs of quotes. An “escaped” quote can be considered terminating a quoted region and immediately starting a new quoted region. That’s means there’s just the first two states in a trivial arrangement:

int next(int state, int c)
{
    switch (state) {
    case 1: return c == '"' ? 2 : 1;
    case 2: return c == '"' ? 1 : 2;
    }
}

Since the text can be processed as bytes, there are only 256 possible inputs. With 2 states and 256 inputs, this state machine, with replacement machinery, can be implemented with a 512-byte table and no branches. Here’s the table initialization:

unsigned char table[2][256];

void init(void)
{
    for (int i = 0; i < 256; i++) {
        table[0][i] = i;
        table[1][i] = i;
    }
    table[1]['\n'] = 0x1e;
    table[1][',']  = 0x1f;
}

In the first state, characters map onto themselves. In the second state, characters map onto their replacements. This is the entire encoder and decoder:

void encode(unsigned char *buf, size_t len)
{
    int state = 0;
    for (size_t i = 0; i < len; i++) {
        state ^= (buf[i] == '"');
        buf[i] = table[state][buf[i]];
    }
}

Well, strictly speaking, the decoder need not process quotes. By my benchmark (csvdump in my implementation) this processes at ~1 GiB/s on my laptop — 3x faster than the original. However, there’s still low-hanging fruit to be picked!

SIMD and two’s complement

Any decent SIMD implementation is going to make use of masking. Find the quotes, compute a mask over quoted regions, compute another mask for replacement matches, combine the masks, then use that mask to blend the input with the replacements. Roughly:

quotes    = find_quoted_regions(input)
linefeeds = input == '\n'
commas    = input == ','
output    = blend(input, '\n', quotes & linefeeds)
output    = blend(output, ',', quotes & commas)

The hard part is computing the quote mask, and also somehow handle quoted regions straddling SIMD chunks (not pictured), and do all that without resorting to slow byte-at-time operations. Fortunately there are some bitwise tricks that can resolve each issue.

Imagine I load 32 bytes into a SIMD register (e.g. AVX2), and I compute a 32-bit mask where each bit corresponds to one byte. If that byte contains a quote, the corresponding bit is set.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001010

That last/lowest 1 corresponds to the beginning of a quoted region. For my mask, I’d like to set all bits following that bit. I can do this by subtracting 1.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001001

Using the Kernighan technique I can also remove this bit from the original input by ANDing them together.

"George Herman ""Babe"" Ruth","1
10000000000000011000011000001000

Now I’m left with a new bottom bit. If I repeat this, I build up layers of masks, one for each input quote.

10000000000000011000011000001001
10000000000000011000011000000111
10000000000000011000010111111111
10000000000000011000001111111111
10000000000000010111111111111111
10000000000000001111111111111111
01111111111111111111111111111111

Remember how I use XOR in the state machine above to toggle between states? If I XOR all these together, I toggle the quotes on and off, building up quoted regions:

"George Herman ""Babe"" Ruth","1
01111111111111100111100111110001

However, for reasons I’ll explain shortly, it’s critical that the opening quote is included in this mask. If I XOR the pre-subtracted value with the mask when I compute the mask, I can toggle the remaining quotes on and off such that the opening quotes are included. Here’s my function:

uint32_t find_quoted_regions(uint32_t x)
{
    uint32_t r = 0;
    while (x) {
        r ^= x;
        r ^= x - 1;
        x &= x - 1;
    }
    return r;
}

Which gives me exactly what I want:

"George Herman ""Babe"" Ruth","1
11111111111111101111101111110011

It’s important that the opening quote is included because it means a region that begins on the last byte will have that last bit set. I can use that last bit to determine if the next chunk begins in a quoted state. If a region begins in a quoted state, I need only NOT the whole result to reverse the quoted regions.

How can I “sign extend” a 1 into all bits set, or do nothing for zero? Negate it!

    uint32_t carry  = -(prev & 1);
    uint32_t quotes = find_quoted_regions(input) ^ carry;
    // ...
    prev = quotes;

That takes care of computing quoted regions and chaining them between chunks. The loop will unfortunately cause branch prediction penalties if the input has lots of quotes, but I couldn’t find a way around this.

However, I’ve made a serious mistake. I’m using _mm256_movemask_epi8 and it puts the first byte in the lowest bit. Doh! That means it looks like this:

1","htuR ""ebaB"" namreH egroeG"
01010000011000011000000000000001

There’s no efficient way to flip the bits around, so I just need to find a way to work in the other direction. To flip the bits to the left of a set bit, negate it.

00000000000000000000000010000000 = +0x00000080
11111111111111111111111110000000 = -0x00000080

Unlike before, this keeps the original bit set, so I need to XOR the original value into the input to flip the quotes. This is as simple as initializing to the input rather than zero. The new loop:

uint32_t find_quoted_regions(uint32_t x)
{
    uint32_t r = x;
    while (x) {
        r ^= -x ^ x;
        x &= x - 1;
    }
    return r;
}

The result:

1","htuR ""ebaB"" namreH egroeG"
11001111110111110111111111111111

The carry now depends on the high bit rather than the low bit:

uint32_t carry = -(prev >> 31);

Reversing movemask

The next problem: for reasons I don’t understand, AVX2 does not include the inverse of _mm256_movemask_epi8. Converting the bit-mask back into a byte-mask requires some clever shuffling. Fortunately I’m not the first to have this problem, and so I didn’t have to figure it out from scratch.

First fill the 32-byte register with repeated copies of the 32-bit mask.

abcdabcdabcdabcdabcdabcdabcdabcd

Shuffle the bytes so that the first 8 register bytes have the same copy of the first bit-mask byte, etc.

aaaaaaaabbbbbbbbccccccccdddddddd

In byte 0, I care only about bit 0, in byte 1 I care only about the bit 1, … in byte N I care only about bit N%8. I can pre-compute a mask to isolate each of these bits and produce a proper byte-wise mask from the bit-mask. Fortunately all this isn’t too bad: four instructions instead of the one I had wanted. It looks like a lot of code, but it’s really only a few instructions.

Results

In my benchmark, which includes randomly occurring quoted fields, the SIMD version processes at ~4 GiB/s — 10x faster than the original. I haven’t profiled, but I expect mispredictions on the bit-mask loop are the main obstacle preventing the hypothetical 32x speedup.

My version also optionally rejects inputs containing the two special control characters since the encoding would be irreversible. This is implemented in SIMD when available, and it slows processing by around 10%.

Followup: PCLMULQDQ

Geoff Langdale and others have graciously pointed out PCLMULQDQ, which can compute the quote masks using carryless multiplication (also) entirely in SIMD and without a loop. I haven’t yet quite worked out exactly how to apply it, but it should be much faster.

Billions of Code Name Permutations in 32 bits

2021-09-14T21:06:59Z

My friend over at Possibly Wrong created a code name generator. By coincidence I happened to be thinking about code names myself while recently replaying XCOM: Enemy Within (2012/2013). The game generates a random code name for each mission, and I wondered how often it repeats. The UFOpaedia page on the topic gives the word lists: 53 adjectives and 76 nouns, for a total of 4028 possible code names. A typical game has around 60 missions, and if code names are generated naively on the fly, then per the birthday paradox around half of all games will see a repeated mission code name! Fortunately this is easy to avoid, and the particular configuration here lends itself to an interesting implementation.

Mission code names are built using “adjective noun”. Some examples from the game’s word list:

Fading Hammer
Fallen Jester
Hidden Crown

To generate a code name, we could select a random adjective and a random noun, but as discussed it wouldn’t take long for a collision. The naive approach is to keep a database of previously-generated names, and to consult this database when generating new names. That works, but there’s an even better solution: use a random permutation. Done well, we don’t need to keep track of previous names, and the generator won’t repeat until it’s exhausted all possibilities.

Further, the total number of possible code names, 4028, is suspiciously shy of 4,096, a power of two (2**12). That makes designing and implementing an efficient permutation that much easier.

A linear congruential generator

A classic, obvious solution is a linear congruential generator (LCG). A full-period, 12-bit LCG is nothing more than a permutation of the numbers 0 to 4,095. When generating names, we can skip over the extra 68 values and pretend it’s a permutation of 4,028 elements. An LCG is constructed like so:

f(n) = (f(n-1)*A + C) % M

Typically the seed is used for f(0). M is selected based on the problem space or implementation efficiency, and usually a power of two. In this case it will be 4,096. Then there are some rules for choosing A and C.

Simply choosing a random f(0) per game isn’t great. The code name order will always be the same, and we’re only choosing where in the cycle to start. It would be better to vary the permutation itself, which we can do by also choosing unique A and C constants per game.

Choosing C is easy: It must be relatively prime with M, i.e. it must be odd. Since it’s addition modulo M, there’s no reason to choose C >= M since the results are identical to a smaller C. If we think of C as a 12-bit integer, 1 bit is locked in, and the other 11 bits are free to vary:

xxxxxxxxxxx1

Choosing A is more complicated: must be odd, A-1 must be divisible by 4, and A-1 should be divisible by 8 (better results). Again, thinking of this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits free:

xxxxxxxxx101

This ensures all the must and should properties of A.

Finally 0 <= f(0) < M. Because of modular arithmetic larger, values are redundant, and all possible values are valid since the LCG, being full-period, will cycle through all of them. This is just choosing the starting point in a particular permutation cycle. As a 12-bit number, all 12 bits are free:

xxxxxxxxxxxx

That’s 9 + 11 + 12 = 32 free bits to fill randomly: again, how incredibly convenient! Every 32-bit integer defines some unique code name permutation… almost. Any 32-bit descriptor where f(0) >= 4028 will collide with at least one other due to skipping, and so around 1.7% of the state space is redundant. A small loss that should shrink with slightly better word list planning. I don’t think anyone will notice.

Slice and dice

I love compact state machines, and this is an opportunity to put one to good use. My code name generator will be just one function:

uint32_t codename(uint32_t state, char *buf);

This takes one of those 32-bit permutation descriptors, writes the first code name to buf, and returns a descriptor for another permutation that starts with the next name. All we have to do is keep track of that 32-bit number and we’ll never need to worry about repeating code names until all have been exhausted.

First, lets extract A, C, and f(0), which I’m calling S. The low bits are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to lock in the hard-set bits.

long a = (state <<  3 | 5) & 0xfff;  //  9 bits
long c = (state >>  8 | 1) & 0xfff;  // 11 bits
long s =  state >> 20;               // 12 bits

Next iterate the LCG until we have a number in range:

do {
    s = (s*a + c) & 0xfff;
} while (s >= 4028);

Once we have an appropriate LCG state, compute the adjective/noun indexes and build a code name:

int i = s % 53;
int j = s / 53;
sprintf(buf, "%s %s", adjvs[i], nouns[j]);

Finally assemble the next 32-bit state. Since A and C don’t change, these are passed through while the old S is masked out and replaced with the new S.

return (state & 0xfffff) | (uint32_t)s<<20;

Putting it all together:

static const char *adjvs[] = { /* ... */ };
static const char *nouns[] = { /* ... */ };

uint32_t codename(uint32_t state, char *buf)
{
    long a = (state <<  3 | 5) & 0xfff;  //  9 bits
    long c = (state >>  8 | 1) & 0xfff;  // 11 bits
    long s =  state >> 20;               // 12 bits

    do {
        s = (s*a + c) & 0xfff;
    } while (s >= COUNTOF(adjvs)*COUNTOF(nouns));

    int i = s % COUNTOF(adjvs);
    int j = s / COUNTOF(adjvs);
    sprintf(buf, "%s %s", adjvs[i], nouns[j]);
    return (state & 0xfffff) | (uint32_t)s<<20;
}

The caller just needs to generate an initial 32-bit integer. Any 32-bit integer is valid — even zero — so this could just be, say, the unix epoch (time(2)), but adjacent values will have similar-ish permutations. I intentionally placed S in the high bits, which are least likely to vary, since it only affects where the cycle begins, while A and C have a much more dramatic impact and so are placed at more variable locations.

Regardless, it would be better to hash such an input so that adjacent time values map to distant states. It also helps hide poorer (less random) choices for A multipliers. I happen to have designed some great functions for exactly this purpose. Here’s one of my best:

static uint32_t
hash32(uint32_t x)
{
    x += 0x3243f6a8U; x ^= x >> 15;
    x *= 0xd168aaadU; x ^= x >> 15;
    x *= 0xaf723597U; x ^= x >> 15;
    return x;
}

This would be perfectly reasonable for generating all possible names in a random order:

uint32_t state = hash32(time(0));
for (int i = 0; i < 4028; i++) {
    char buf[32];
    state = codename(state, buf);
    puts(buf);
}

To further help cover up poorer A multipliers, it’s better for the word list to be pre-shuffled in its static storage. If that underlying order happens to show through, at least it will be less obvious (i.e. not in alphabetical order). Shuffling the string list in my source is just a few keystrokes in Vim, so this is easy enough.

Robustness

If you’re set on making the codename function easier to use such that consumers don’t need to think about hashes, you could “encode” and “decode” the descriptor going in an out of the function:

uint32_t codename(uint32_t state, char *buf)
{
    state += 0x3243f6a8U; state ^= state >> 17;
    state *= 0x9e485565U; state ^= state >> 16;
    state *= 0xef1d6b47U; state ^= state >> 16;

    // ...

    state = (state & 0xfffff) | (uint32_t)s<<20;
    state ^= state >> 16; state *= 0xeb00ce77U;
    state ^= state >> 16; state *= 0x88ccd46dU;
    state ^= state >> 17; state -= 0x3243f6a8U;
    return state;
}

This permutes the state coming in, and reverses that permutation on the way out (read: inverse hash). This breaks up similar starting points.

A random-access code name permutation

Of course this isn’t the only way to build a permutation. I recently picked up another trick: Kensler permutation. The key insight is cycle-walking, allowing for random-access to a permutation of a smaller domain (e.g. 4,028 elements) through permutation of a larger domain (e.g. 4096 elements).

Here’s such a code name generator built around a bespoke 12-bit xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply is less effective the smaller the permutation.

// Generate the nth code name for this seed.
void codename_n(char *buf, uint32_t seed, int n)
{
    uint32_t i = n;
    do {
        i ^= i >> 6; i ^= seed >>  0; i *= 0x325; i &= 0xfff;
        i ^= i >> 6; i ^= seed >>  8; i *= 0x3f5; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 16; i *= 0xa89; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 24; i *= 0x85b; i &= 0xfff;
        i ^= i >> 6;
    } while (i >= COUNTOF(adjvs)*COUNTOF(nouns));

    int a = i % COUNTOF(adjvs);
    int b = i / COUNTOF(adjvs);
    snprintf(buf, 22, "%s %s", adjvs[a], nouns[b]);
}

While this is more flexible, avoids poorer permutations, and doesn’t have state space collisions, I still have a soft spot for my LCG-based state machine generator.

Source code

You can find the complete, working source code with both generators here: codename.c. I used real US Secret Service code names for my word list. Some sample outputs:

PLASTIC HUMMINGBIRD
BLACK VENUS
SILENT SUNBURN
BRONZE AUTHOR
FADING MARVEL

State machines are wonderful tools

2020-12-31T22:48:13Z

This article was discussed on Hacker News.

I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:

Present minimal, tidy interfaces
Require few, fixed resources
Hold no opinions about input and output
Have a compact, concise implementation
Be easy to reason about

State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.

Morse code decoder state machine

Inspired by a puzzle, I came up with this deterministic state machine for decoding Morse code. It accepts a dot ('.'), dash ('-'), or terminator (0) one at a time, advancing through a state machine step by step:

int morse_decode(int state, int c)
{
    static const unsigned char t[] = {
        0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
        0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
        0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
        0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
        0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
        0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
        0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
        0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
    };
    int v = t[-state];
    switch (c) {
    case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
    case 0x2e: return v &  2 ? state*2 - 1 : 0;
    case 0x2d: return v &  1 ? state*2 - 2 : 0;
    default:   return 0;
    }
}

It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers. The full source listing, documentation, and comprehensive test suite:

https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c

The state machine is trie-shaped, and the 100-byte table t is the static encoding of the Morse code trie:

Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).

Each node in the trie is a byte in the table. Dot and dash each have a bit indicating if their edge exists. The remaining bits index into a 1-based character table (at the end of t), and a 0 “index” indicates an empty (red) node. The nodes themselves are laid out as a binary heap in an array: the left and right children of the node at i are found at i*2+1 and i*2+2. No need to waste memory storing edges!

Since C sadly does not have multiple return values, I’m using the sign bit of the return value to create a kind of sum type. A negative return value is a state — which is why the state is negated internally before use. A positive result is a character output. If zero, the input was invalid. Only the initial state is non-negative (zero), which is fine since it’s, by definition, not possible to traverse to the initial state. No c input will produce a bad state.

In the original problem the terminals were missing. Despite being a state machine, morse_decode is a pure function. The caller can save their position in the trie by saving the state integer and trying different inputs from that state.

UTF-8 decoder state machine

The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.

I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:

Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:

https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c

However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):

int utf8_decode(int state, long *cp, int byte)
{
    static const signed char table[8][256] = { /* ... */ };
    static const unsigned char masks[2][8] = { /* ... */ };
    int next = table[state][byte];
    *cp = (*cp << 6) | (byte & masks[!state][next&7]);
    return next;
}

Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.

Despite the huge table I’m pretty happy with it.

Word count state machine

Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).

/* State machine counting words in a sequence of code points.
 *
 * The current word count is the absolute value of the state, so
 * the initial state is zero. Code points are fed into the state
 * machine one at a time, each call returning the next state.
 */
long word_count(long state, long codepoint)
{
    switch (codepoint) {
    case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
    case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
    case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
    case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
    case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
        return state < 0 ? -state : state;
    default:
        return state < 0 ? state : -1 - state;
    }
}

I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.

The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.

Reader challenge: Find a slick, efficient way to encode all those code points as a table rather than rely on whatever the compiler generates for the switch (chain of branches, jump table?).

Coroutines and generators as state machines

In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)

Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:

WHITESPACE = {
    0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
    0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
    0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
    0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
    0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}

def wordcount():
    count = 0
    while True:
        while True:
            # low signal
            codepoint = yield count
            if codepoint not in WHITESPACE:
                count += 1
                break
        while True:
            # high signal
            codepoint = yield count
            if codepoint in WHITESPACE:
                break

However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:

wc = wordcount()
next(wc)  # prime the generator
wc.send(ord('A'))  # => 1
wc.send(ord(' '))  # => 1
wc.send(ord('B'))  # => 2
wc.send(ord(' '))  # => 2

Same idea in Lua, which famously has full coroutines:

local WHITESPACE = {
    [0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
    [0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
    [0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
    [0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
    [0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
    [0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
    [0x3000]=true
}

function wordcount()
    local count = 0
    while true do
        while true do
            -- low signal
            local codepoint = coroutine.yield(count)
            if not WHITESPACE[codepoint] then
                count = count + 1
                break
            end
        end
        while true do
            -- high signal
            local codepoint = coroutine.yield(count)
            if WHITESPACE[codepoint] then
                break
            end
        end
    end
end

Except for initially priming the coroutine, at least coroutine.wrap() hides the fact that it’s a coroutine.

wc = coroutine.wrap(wordcount)
wc()  -- prime the coroutine
wc(string.byte('A'))  -- => 1
wc(string.byte(' '))  -- => 1
wc(string.byte('B'))  -- => 2
wc(string.byte(' '))  -- => 2

Extra examples

Finally, a couple more examples not worth describing in detail here. First a Unicode case folding state machine:

https://github.com/skeeto/scratch/blob/master/misc/casefold.c

It’s just an interface to do a lookup into the official case folding table. It was an experiment, and I probably wouldn’t use it in a real program.

Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s not obvious from the interface, but internally it’s just a state machine for both encoder and decoder, which is what it allows it to “pause” between any pair of input/output bytes.

Improving on QBasic's Random Number Generator

2020-11-17T02:51:23Z

This article was discussed on Hacker News.

Pixelmusement produces videos about MS-DOS games and software. Each video ends with a short, randomly-selected listing of financial backers. In ADG Filler #57, Kris revealed the selection process, and it absolutely fits the channel’s core theme: a QBasic program. His program relies on QBasic’s built-in pseudo random number generator (PRNG). Even accounting for the platform’s limitations, the PRNG is much poorer quality than it could be. Let’s discuss these weaknesses and figure out how to make the selection more fair.

Kris’s program seeds the PRNG with the system clock (RANDOMIZE TIMER, a QBasic idiom), populates an array with the backers represented as integers (indices), continuously shuffles the list until the user presses a key, then finally prints out a random selection from the array. Here’s a simplified version of the program (note: QBasic comments start with apostrophe '):

CONST ntickets = 203  ' input parameter
CONST nresults = 12

RANDOMIZE TIMER

DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
    tickets(i) = i
NEXT

CLS
PRINT "Press any key to stop shuffling..."
DO
    i = INT(RND * ntickets)
    j = INT(RND * ntickets)
    SWAP tickets(i), tickets(j)
LOOP WHILE INKEY$ = ""

FOR i = 0 to nresults - 1
    PRINT tickets(i)
NEXT

This should be readable even if you don’t know QBasic. Note: In the real program, backers at higher tiers get multiple tickets in order to weight the results. This is accounted for in the final loop such that nobody appears more than once. It’s mostly irrelevant to the discussion here, so I’ve omitted it.

The final result is ultimately a function of just three inputs:

The system clock (TIMER)
The total number of tickets
The number of loop iterations until a key press

The second item has the nice property that by becoming a backer you influence the result.

QBasic RND

QBasic’s PRNG is this 24-bit Linear Congruential Generator (LCG):

uint32_t
rnd24(uint32_t *s)
{
    *s = (*s*0xfd43fd + 0xc39ec3) & 0xffffff;
    return *s;
}

The result is the entire 24-bit state. RND divides this by 2^24 and returns it as a single precision float so that the caller receives a value between 0 and 1 (exclusive).

Needless to say, this is a very poor PRNG. The LCG constants are reasonable, but the choice to limit the state to 24 bits is strange. According to the QBasic 16-bit assembly (note: the LCG constants listed here are wrong), the implementation is a full 32-bit multiply using 16-bit limbs, and it allocates and writes a full 32 bits when storing the state. As expected for the 8086, there was nothing gained by using only the lower 24 bits.

To illustrate how poor it is, here’s a randogram for this PRNG, which shows obvious structure. (This is a small slice of a 4096x4096 randogram where each of the 2^23 24-bit samples is plotted as two 12-bit coordinates.)

Admittedly this far overtaxes the PRNG. With a 24-bit state, it’s only good for 4,096 (2^12) outputs, after which it no longer follows the birthday paradox: No outputs are repeated even though we should start seeing some. However, as I’ll soon show, this doesn’t actually matter.

Instead of discarding the high 8 bits — the highest quality output bits — QBasic’s designers should have discarded the low 8 bits for the output, turning it into a truncated 32-bit LCG:

uint32_t
rnd32(uint32_t *s)
{
    *s = *s*0xfd43fd + 0xc39ec3;
    return *s >> 8;
}

This LCG would have the same performance, but significantly better quality. Here’s the randogram for this PRNG, and it is also heavily overtaxed (more than 65,536, 2^16 outputs).

It’s a solid upgrade, completely for free!

QBasic RANDOMIZE

That’s not the end of our troubles. The RANDOMIZE statement accepts a double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754 binary representation are XORed with the next highest 16 bits. The high 16 bits of the PRNG state is set to this result. The lowest 8 bits are preserved.

To make this clearer, here’s a C implementation, verified against QBasic 7.1:

uint32_t s;

void
randomize(double seed)
{
    uint64_t x;
    memcpy(&x ,&seed, 8);
    s = (x>>24 ^ x>>40) & 0xffff00 | (s & 0xff);
}

In other words, RANDOMIZE only sets the PRNG to one of 65,536 possible states.

As the final piece, here’s how RND is implemented, also verified against QBasic 7.1:

float
rnd(float arg)
{
    if (arg < 0) {
        memcpy(&s, &arg, 4);
        s = (s & 0xffffff) + (s >> 24);
    }
    if (arg != 0.0f) {
        s = (s*0xfd43fd + 0xc39ec3) & 0xffffff;
    }
    return s / (float)0x1000000;
}

System clock seed

The TIMER function returns the single precision number of seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer interrupt counter). This is strictly time of day, and the current date is not part of the result, unlike, say, the unix epoch.

This means there are only 1,572,480 distinct values returned by TIMER. That’s small even before considering that these map onto only 65,536 possible seeds with RANDOMIZE — all of which are fortunately realizable via TIMER.

Of the three inputs to random selection, this first one is looking pretty bad.

Loop iterations

Kris’s idea of continuously mixing the array until he presses a key makes up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000 array swaps — traversing over 2% of the PRNG’s period — and the array itself acts like an extended PRNG state, supplementing the 24-bit RND state.

Since iterations fly by quickly, the exact number of iterations becomes another source of entropy. The results will be quite different if it runs 214,600 iterations versus 273,500 iterations.

Possible improvement: Only exit the loop when a certain key is pressed. If any other key is pressed then that input and the TIMER are mixed into the PRNG state. Mashing the keyboard during the loop introduces more entropy.

Replacing the PRNG

Since the built-in PRNG is so poor, we could improve the situation by implementing a new one in QBasic itself. The challenge is that QBasic has no unsigned integers, not even unsigned integer operators (i.e. Java and JavaScript’s >>>), and signed overflow is a run-time error. We can’t even re-implement QBasic’s own LCG without doing long multiplication in software, since the intermediate result overflows its 32-bit LONG.

Popular choices in these constraints are Park–Miller generator (as we saw in Bash) or a lagged Fibonacci generator (as used by Emacs, which was for a long time constrained to 29-bit integers).

However, I have a better idea: a PRNG based on RC4. Specifically, my own design called Sponge4, a sponge construction built atop RC4. In short: Mixing in more input is just a matter of running the key schedule again. Implementing this PRNG requires just two simple operations: modular addition over 2^8, and array swap. QBasic has a SWAP statement, so it’s a natural fit!

Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I can mix in more sources of entropy. With its 1,700-bit state, it can absorb quite a bit of entropy without loss.

Learning QBasic

Until this past weekend, I had not touched QBasic for about 23 years and had to learn it essentially from scratch. Though within a couple of hours I probably already understood it better than I ever had. That’s in large part because I’m far more experienced, but also probably because QBasic tutorials are universally awful. Not surprisingly they’re written for beginners, but they also seem to be all written by beginners, too. I soon got the impression that QBasic community has usually been another case of the blind leading the blind.

There’s little direct information for experienced programmers, and even the official documentation tends to be thin in important places. I wanted documentation that started with the core language semantics:

The basic types are INTEGER (int16), LONG (int32), SINGLE (float32), DOUBLE (float64), and two flavors of STRING, fixed-width and variable-width. Late versions also had incomplete support for a 64-bit, 10,000x fixed-point CURRENCY type.
Variables are SINGLE by default and do not need to be declared ahead of time. Arrays have 11 elements by default.
Variables, constants, and functions may have a suffix if their type is not SINGLE: INTEGER %, LONG &, SINGLE !, DOUBLE #, STRING $, and CURRENCY @. For functions, this is the return type.
Each variable type has its own namespace, i.e. i% is distinct from i&. Arrays are also their own namespace, i.e. i% is distinct from i%(0) is distinct from i&(0).
Variables may be declared explicitly with DIM. Declaring a variable with DIM allows the suffix to be omitted. It also locks that name out of the other type namespaces, i.e. DIM i AS LONG makes any use of i% invalid in that scope. Though arrays and scalars can still have the same name even with DIM declarations.
Numeric operations with mixed types implicitly promote like C.
Functions and subroutines have a single, common namespace regardless of function suffix. As a result, the suffix can (usually) be omitted at function call sites. Built-in functions are special in this case.
Despite initial appearances, QBasic is statically-typed.
The default is pass-by-reference. Use BYVAL to pass by value.
In array declarations, the parameter is not the size but the largest index. Multidimensional arrays are supported. Arrays need not be indexed starting at zero (e.g. (x TO y)), though this is the default.
Strings are not arrays, but their own special thing with special accessor statements and functions.
Scopes are module, subroutine, and function. “Global” variables must be declared with SHARED.
Users can define custom structures with TYPE. Functions cannot return user-defined types and instead rely on pass-by-reference.
A crude kind of dynamic allocation is supported with REDIM to resize $DYNAMIC arrays at run-time. ERASE frees allocations.

These are the semantics I wanted to know getting started. Throw in some illustrative examples, and then it’s a tutorial for experienced developers. (Future article perhaps?) Anyway, that’s enough to follow along below.

Implementing Sponge4

Like RC4, I need a 256-element byte array, and two 1-byte indices, i and j. Sponge4 also keeps a third 1-byte counter, k, to count input.

TYPE sponge4
    i AS INTEGER
    j AS INTEGER
    k AS INTEGER
    s(0 TO 255) AS INTEGER
END TYPE

QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would normally be a good match here, but since they’re not arrays, strings are not compatible with SWAP and are not indexed efficiently. So instead I accept some wasted space and use 16-bit integers for everything.

There are four “methods” for this structure. Three are subroutines since they don’t return a value, but mutate the sponge. The last, squeeze, returns the next byte as an INTEGER (%).

DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)

Initialization follows RC4:

SUB init (r AS sponge4)
    r.i = 0
    r.j = 0
    r.k = 0
    FOR i% = 0 TO 255
        r.s(i%) = i%
    NEXT
END SUB

Absorbing a byte means running the RC4 key schedule one step. Absorbing a “stop” symbol, for separating inputs, transforms the state in a way that absorbing a byte cannot.

SUB absorb (r AS sponge4, b AS INTEGER)
    r.j = (r.j + r.s(r.i) + b) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    r.i = (r.i + 1) MOD 256
    r.k = (r.k + 1) MOD 256
END SUB

SUB absorbstop (r AS sponge4)
    r.j = (r.j + 1) MOD 256
END SUB

Squeezing a byte may involve mixing the state first, then it runs the RC4 generator normally.

FUNCTION squeeze% (r AS sponge4)
    IF r.k > 0 THEN
        absorbstop r
        DO WHILE r.k > 0
            absorb r, r.k
        LOOP
    END IF

    r.j = (r.j + r.i) MOD 256
    r.i = (r.i + 1) MOD 256
    SWAP r.s(r.i), r.s(r.j)
    squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION

That’s the entire generator in QBasic! A couple more helper functions will be useful, though. One absorbs entire strings, and the second emits 24-bit results.

SUB absorbstr (r AS sponge4, s AS STRING)
    FOR i% = 1 TO LEN(s)
        absorb r, ASC(MID$(s, i%))
    NEXT
END SUB

FUNCTION squeeze24& (r AS sponge4)
    b0& = squeeze%(r)
    b1& = squeeze%(r)
    b2& = squeeze%(r)
    squeeze24& = b2& * &H10000 + b1& * &H100 + b0&
END FUNCTION

QBasic doesn’t have bit-shift operations, so we must make due with multiplication. The &H is hexadecimal notation.

Putting the sponge to use

One of the problems with the original program is that only the time of day was a seed. Even were it mixed better, if we run the program at exactly the same instant on two different days, we get the same seed. The DATE$ function returns the current date, which we can absorb into the sponge to make the whole date part of the input.

DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE$
absorbstr sponge, MKS$(TIMER)
absorbstr sponge, MKI$(ntickets)

I follow this up with the timer. It’s converted to a string with MKS$, which returns the little-endian, single precision binary representation as a 4-byte string. MKI$ does the same for INTEGER, as a 2-byte string.

One of the problems with the original program was bias: Multiplying RND by a constant, then truncating the result to an integer is not uniform in most cases. Some numbers are selected slightly more often than others because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all the shuffling in the original it probably doesn’t make a practical difference, but I’d like to avoid it.

In my program I account for it by generating another number if it happens to fall into that extra “tail” part of the input distribution (very unlikely for small ntickets). The squeezen function uniformly generates a number in 0 to N (exclusive).

FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
    DO
       x& = squeeze24&(r) - &H1000000 MOD n
    LOOP WHILE x& < 0
    squeezen% = x& MOD n
END FUNCTION

Finally a Fisher–Yates shuffle, then print the first N elements:

FOR i% = ntickets - 1 TO 1 STEP -1
    j% = squeezen%(sponge, i% + 1)
    SWAP tickets(i%), tickets(j%)
NEXT

FOR i% = 1 TO nresults
    PRINT tickets(i%)
NEXT

Though if you really love Kris’s loop idea:

PRINT "Press Esc to finish, any other key for entropy..."
DO
    c& = c& + 1
    LOCATE 2, 1
    PRINT "cycles ="; c&; "; keys ="; k%

    FOR i% = ntickets - 1 TO 1 STEP -1
        j% = squeezen%(sponge, i% + 1)
        SWAP tickets(i%), tickets(j%)
    NEXT

    k$ = INKEY$
    IF k$ = CHR$(27) THEN
        EXIT DO
    ELSEIF k$ <> "" THEN
        k% = k% + 1
        absorbstr sponge, k$
    END IF
    absorbstr sponge, MKS$(TIMER)
LOOP

If you want to try it out for yourself in, say, DOSBox, here’s the full source: sponge4.bas

I Solved British Square

2020-10-19T19:32:52Z

Update: I solved another game using essentially the same technique.

British Square is a 1978 abstract strategy board game which I recently discovered from a YouTube video. It’s well-suited to play by pencil-and-paper, so my wife and I played a few rounds to try it out. Curious about strategies, I searched online for analysis and found nothing whatsoever, meaning I’d have to discover strategies for myself. This is exactly the sort of problem that nerd snipes, and so I sunk a couple of evenings building an analysis engine in C — enough to fully solve the game and play perfectly.

Repository: British Square Analysis Engine (and prebuilt binaries)

The game is played on a 5-by-5 grid with two players taking turns placing pieces of their color. Pieces may not be placed on tiles 4-adjacent to an opposing piece, and as a special rule, the first player may not play the center tile on the first turn. Players pass when they have no legal moves, and the game ends when both players pass. The score is the difference between the piece counts for each player.

In the default configuration, my engine takes a few seconds to explore the full game tree, then presents the minimax values for the current game state along with the list of perfect moves. The UI allows manually exploring down the game tree. It’s intended for analysis, but there’s enough UI present to “play” against the AI should you so wish. For some of my analysis I made small modifications to the program to print or count game states matching certain conditions.

Game analysis

Not accounting for symmetries, there are 4,233,789,642,926,592 possible playouts. In these playouts, the first player wins 2,179,847,574,830,592 (~51%), the second player wins 1,174,071,341,606,400 (~28%), and the remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious the first player has a huge advantage.

Accounting for symmetries, there are 8,659,987 total game states. Of these, 6,955 are terminal states, of which the first player wins 3,599 (~52%) and the second player wins 2,506 (~36%). This small number of states is what allows the engine to fully explore the game tree in a few seconds.

Most importantly: The first player can always win by two points. In other words, it’s not like Tic-Tac-Toe where perfect play by both players results in a tie. Due to the two-point margin, the first player also has more room for mistakes and usually wins even without perfect play. There are fewer opportunities to blunder, and a single blunder usually results in a lower win score. The second player has a narrow lane of perfect play, making it easy to blunder.

Below is the minimax analysis for the first player’s options. The number is the first player’s score given perfect play from that point — i.e. perfect play starts on the tiles marked “2”, and the tiles marked “0” are blunders that lead to ties.

The special center rule probably exists to reduce the first player’s obvious advantage, but in practice it makes little difference. Without the rule, the first player has an additional (fifth) branch for a win by two points:

Improved alternative special rule: Bias the score by two in favor of the second player. This fully eliminates the first player’s advantage, perfect play by both sides results in a tie, and both players have a narrow lane of perfect play.

The four tie openers are interesting because the reasoning does not require computer assistance. If the first player opens on any of those tiles, the second player can mirror each of the first player’s moves, guaranteeing a tie. Note: The first player can still make mistakes that results in a second player win if the second player knows when to stop mirroring.

One of my goals was to develop a heuristic so that even human players can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was not able to develop any such heuristic, though I was able to prove that a greedy heuristic — always claim as much territory as possible — is often incorrect and, in some cases, leads to blunders.

Engine implementation

As I’ve done before, my engine represents the game using bitboards. Each player has a 25-bit bitboard representing their pieces. To make move validation more efficient, it also sometimes tracks a “mask” bitboard where invalid moves have been masked. Updating all bitboards is cheap (place(), mask()), as is validating moves against the mask (valid()).

The longest possible game is 32 moves. This would just fit in 5 bits, except that I needed a special “invalid” turn, making it a total of 33 bits. So I use 6 bits to store the turn counter.

Besides generally being unnecessary, the validation masks can be derived from the main bitboards, so I don’t need to store them in the game tree. That means I need 25 bits per player, and 6 bits for the counter: 56 bits total. I pack these into a 64-bit integer. The first player’s bitboard goes in the bottom 25 bits, the second player in the next 25 bits, and the turn counter in the topmost 6 bits. The turn counter starts at 1, so an all zero state is invalid. I exploit this in the hash table so that zeroed slots are empty (more on this later).

In other words, the empty state is 0x4000000000000 (INIT) and zero is the null (invalid) state.

Since the state is so small, rather than passing a pointer to a state to be acted upon, bitboard functions return a new bitboard with the requested changes… functional style.

    // Compute bitboard+mask where first play is tile 6
    // -----
    // -X---
    // -----
    // -----
    // -----
    uint64_t b = INIT;
    uint64_t m = INIT;
    b = place(b, 6);
    m = mask(m, 6);

Minimax costs

The engine uses minimax to propagate information up the tree. Since the search extends to the very bottom of the tree, the minimax “heuristic” evaluation function is the actual score, not an approximation, which is why it’s able to play perfectly.

When I’ve used minimax before, I built an actual tree data structure in memory, linking states by pointer / reference. In this engine there is no such linkage, and instead the links are computed dynamically via the validation masks. Storing the pointers is more expensive than computing their equivalents on the fly, so I don’t store them. Therefore my game tree only requires 56 bits per node — or 64 bits in practice since I’m using a 64-bit integer. With only 8,659,987 nodes to store, that’s a mere 66MiB of memory! This analysis could have easily been done on commodity hardware two decades ago.

What about the minimax values? Game scores range from -10 to 11: 22 distinct values. (That the first player can score up to 11 and the second player at most 10 is another advantage to going first.) That’s 5 bits of information. However, I didn’t have this information up front, and so I assumed a range from -25 to 25, which requires 6 bits.

There are still 8 spare bits left in the 64-bit integer, so I use 6 of them for the minimax score. Rather than worry about two’s complement, I bias the score to eliminate negative values before storing it. So the minimax score rides along for free above the state bits.

Hash table (memoization)

The vast majority of game tree branches are redundant. Even without taking symmetries into account, nearly all states are reachable from multiple branches. Exploring all these redundant branches would take centuries. If I run into a state I’ve seen before, I don’t want to recompute it.

Once I’ve computed a result, I store it in a hash table so that I can find it later. Since the state is just a 64-bit integer, I use an integer hash function to compute a starting index from which to linearly probe an open addressing hash table. The entire hash table implementation is literally a dozen lines of code:

uint64_t *
lookup(uint64_t bitboard)
{
    static uint64_t table[N];
    uint64_t mask = 0xffffffffffffff; // sans minimax
    uint64_t hash = bitboard;
    hash *= 0xcca1cee435c5048f;
    hash ^= hash >> 32;
    for (size_t i = hash % N; ; i = (i + 1) % N) {
        if (!table[i] || table[i]&mask == bitboard) {
            return &table[i];
        }
    }
}

If the bitboard is not found, it returns a pointer to the (zero-valued) slot where it should go so that the caller can fill it in.

Canonicalization

Memoization eliminates nearly all redundancy, but there’s still a major optimization left. Many states are equivalent by symmetry or reflection. Taking that into account, about 7/8th of the remaining work can still be eliminated.

Multiple different states that are identical by symmetry must to be somehow “folded” into a single, canonical state to represent them all. I do this by visiting all 8 rotations and reflections and choosing the one with the smallest 64-bit integer representation.

I only need two operations to visit all 8 symmetries, and I chose transpose (flip around the diagonal) and vertical flip. Alternating between these operations visits each symmetry. Since they’re bitboards, transforms can be implemented using fancy bit-twiddling hacks. Chess boards, with their power-of-two dimensions, have useful properties which these British Square boards lack, so this is the best I could come up with:

// Transpose a board or mask (flip along the diagonal).
uint64_t
transpose(uint64_t b)
{
    return ((b >> 16) & 0x00000020000010) |
           ((b >> 12) & 0x00000410000208) |
           ((b >>  8) & 0x00008208004104) |
           ((b >>  4) & 0x00104104082082) |
           ((b >>  0) & 0xfe082083041041) |
           ((b <<  4) & 0x01041040820820) |
           ((b <<  8) & 0x00820800410400) |
           ((b << 12) & 0x00410000208000) |
           ((b << 16) & 0x00200000100000);
}

// Flip a board or mask vertically.
uint64_t
flipv(uint64_t b)
{
    return ((b >> 20) & 0x0000003e00001f) |
           ((b >> 10) & 0x000007c00003e0) |
           ((b >>  0) & 0xfc00f800007c00) |
           ((b << 10) & 0x001f00000f8000) |
           ((b << 20) & 0x03e00001f00000);
}

These transform both players’ bitboards in parallel while leaving the turn counter intact. The logic here is quite simple: Shift the bitboard a little bit at a time while using a mask to deposit bits in their new home once they’re lined up. It’s like a coin sorter. Vertical flip is analogous to byte-swapping, though with 5-bit “bytes”.

Canonicalizing a bitboard now looks like this:

uint64_t
canonicalize(uint64_t b)
{
    uint64_t c = b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    return c;
}

Callers need only use canonicalize() on values they pass to lookup() or store in the table (via the returned pointer).

Developing a heuristic

If you can come up with a perfect play heuristic, especially one that can be reasonably performed by humans, I’d like to hear it. My engine has a built-in heuristic tester, so I can test it against perfect play at all possible game positions to check that it actually works. It’s currently programmed to test the greedy heuristic and print out the millions of cases where it fails. Even a heuristic that fails in only a small number of cases would be pretty reasonable.

When Parallel: Pull, Don't Push

2020-04-30T22:35:51Z

This article was discussed on Hacker News.

I’ve noticed a small pattern across a few of my projects where I had vectorized and parallelized some code. The original algorithm had a “push” approach, the optimized version instead took a “pull” approach. In this article I’ll describe what I mean, though it’s mostly just so I can show off some pretty videos, pictures, and demos.

Sandpiles

A good place to start is the Abelian sandpile model, which, like many before me, completely captured my attention for awhile. It’s a cellular automaton where each cell is a pile of grains of sand — a sandpile. At each step, any sandpile with more than four grains of sand spill one grain into its four 4-connected neighbors, regardless of the number of grains in those neighboring cell. Cells at the edge spill their grains into oblivion, and those grains no longer exist.

With excess sand falling over the edge, the model eventually hits a stable state where all piles have three or fewer grains. However, until it reaches stability, all sorts of interesting patterns ripple though the cellular automaton. In certain cases, the final pattern itself is beautiful and interesting.

Numberphile has a great video describing how to form a group over recurrent configurations (also). In short, for any given grid size, there’s a stable identity configuration that, when “added” to any other element in the group will stabilize back to that element. The identity configuration is a fractal itself, and has been a focus of study on its own.

Computing the identity configuration is really just about running the simulation to completion a couple times from certain starting configurations. Here’s an animation of the process for computing the 64x64 identity configuration:

As a fractal, the larger the grid, the more self-similar patterns there are to observe. There are lots of samples online, and the biggest I could find was this 3000x3000 on Wikimedia Commons. But I wanted to see one that’s even bigger, damnit! So, skipping to the end, I eventually computed this 10000x10000 identity configuration:

This took 10 days to compute using my optimized implementation:

https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c

I picked an algorithm described in a code golf challenge:

f(ones(n)*6 - f(ones(n)*6))

Where f() is the function that runs the simulation to a stable state.

I used OpenMP to parallelize across cores, and SIMD to parallelize within a thread. Each thread operates on 32 sandpiles at a time. To compute the identity sandpile, each sandpile only needs 3 bits of state, so this could potentially be increased to 85 sandpiles at a time on the same hardware. The output format is my old mainstay, Netpbm, including the video output.

Sandpile push and pull

So, what do I mean about pushing and pulling? The naive approach to simulating sandpiles looks like this:

for each i in sandpiles {
    if input[i] < 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
        for each j in neighbors {
            output[j] = output[j] + 1
        }
    }
}

As the algorithm examines each cell, it pushes results into neighboring cells. If we’re using concurrency, that means multiple threads of execution may be mutating the same cell, which requires synchronization — locks, atomics, etc. That much synchronization is the death knell of performance. The threads will spend all their time contending for the same resources, even if it’s just false sharing.

The solution is to pull grains from neighbors:

for each i in sandpiles {
    if input[i] < 4 {
        output[i] = input[i]
    } else {
        output[i] = input[i] - 4
    }
    for each j in neighbors {
        if input[j] >= 4 {
            output[i] = output[i] + 1
        }
    }
}

Each thread only modifies one cell — the cell it’s in charge of updating — so no synchronization is necessary. It’s shader-friendly and should sound familiar if you’ve seen my WebGL implementation of Conway’s Game of Life. It’s essentially the same algorithm. If you chase down the various Abelian sandpile references online, you’ll eventually come across a 2017 paper by Cameron Fish about running sandpile simulations on GPUs. He cites my WebGL Game of Life article, bringing everything full circle. We had spoken by email at the time, and he shared his interactive simulation with me.

Vectorizing this algorithm is straightforward: Load multiple piles at once, one per SIMD channel, and use masks to implement the branches. In my code I’ve also unrolled the loop. To avoid bounds checking in the SIMD code, I pad the state data structure with zeros so that the edge cells have static neighbors and are no longer special.

WebGL Fire

Back in the old days, one of the cool graphics tricks was fire animations. It was so easy to implement on limited hardware. In fact, the most obvious way to compute it was directly in the framebuffer, such as in the VGA buffer, with no outside state.

There’s a heat source at the bottom of the screen, and the algorithm runs from bottom up, propagating that heat upwards randomly. Here’s the algorithm using traditional screen coordinates (top-left corner origin):

func rand(min, max) // random integer in [min, max]

for each x, y from bottom {
    buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}

As a push algorithm it works fine with a single-thread, but it doesn’t translate well to modern video hardware. So convert it to a pull algorithm!

for each x, y {
    sx = x + rand(-1, 1)
    sy = y + rand(1, 2)
    output[y][x] = input[sy][sx] - rand(0, 1)
}

Cells pull the fire upward from the bottom. Though this time there’s a catch: This algorithm will have subtly different results.

In the original, there’s a single state buffer and so a flame could propagate upwards multiple times in a single pass. I’ve compensated here by allowing a flames to propagate further at once.
In the original, a flame only propagates to one other cell. In this version, two cells might pull from the same flame, cloning it.

In the end it’s hard to tell the difference, so this works out.

source code and instructions

There’s still potentially contention in that rand() function, but this can be resolved with a hash function that takes x and y as inputs.

Purgeable Memory Allocations for Linux

2019-12-29T00:25:49Z

I saw (part of) a video, OS hacking: Purgeable memory, by Andreas Kling who’s writing an operating system called Serenity and recording videos his progress. In the video he implements purgeable memory as found on some Apple platforms by adding special support in the kernel. A process tells the kernel that a particular range of memory isn’t important, and so the kernel can reclaim if it the system is under memory pressure — the memory is purgeable.

Linux has a mechanism like this, madvise(2), that allows processes to provide hints to the kernel on how memory is expected to be used. The flag of interest is MADV_FREE:

The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page.

So, given this, I built a proof of concept / toy on top of MADV_FREE that provides this functionality for Linux:

https://github.com/skeeto/purgeable

It allocates anonymous pages using mmap(2). When the allocation is “unlocked” — i.e. the process isn’t actively using it — its pages are marked with MADV_FREE so that the kernel can reclaim them at any time. To lock the allocation so that the process can safely make use of them, the MADV_FREE is canceled. This is all a little trickier than it sounds, and that’s the subject of this article.

Note: There’s also MADV_DONTNEED which seems like it would fit the bill, but it’s implemented incorrectly in Linux. It immediately frees the pages, and so it’s useless for implementing purgeable memory.

Purgeable API

Before diving into the implementation, here’s the API. It’s just four functions with no structure definitions. The pointer used by the API is the memory allocation itself. All the bookkeeping associated with that pointer is hidden away, out of sight from the API’s consumer. The full documentation is in purgeable.h.

void *purgeable_alloc(size_t);
void  purgeable_unlock(void *);
void *purgeable_lock(void *);
void  purgeable_free(void *);

The semantics are much like a C++ weak_ptr in that locking both validates that the allocation is still available and creates a “strong” reference to it that prevents it from being purged. Though unlike a weak reference, the allocation is stickier. It will remain until the system is actually under pressure, not just when the garbage collector happens to run or the last strong reference is gone.

Here’s how it might be used to, say, store decoded PNG data that can decompressed again if needed:

uint32_t *texture = 0;
struct png *png = png_load("texture.png");
if (!png) die();

/* ... */

for (;;) {
    if (!texture) {
        texture = purgeable_alloc(png->width * png->height * 4);
        if (!texture) die();
        png_decode_rgba(png, texture);
    } else if (!purgeable_lock(texture)) {
        purgeable_free(texture);
        texture = 0;
        continue;
    }
    glTexImage2D(
        GL_TEXTURE_2D, 0,
        GL_RGBA, png->width, png->height, 0,
        GL_RGBA, GL_UNSIGNED_BYTE, texture
    );
    purgeable_unlock(texture);
    break;
}

Memory is allocated in a locked state since it’s very likely to be immediately filled with data. The application should unlock it before moving on with other tasks. The purgeable memory must always be freed using purgeable_free(), even if purgeable_lock() failed. This not only frees the bookkeeping, but also releases the now-zero pages and the mapping itself. Originally I had purgeable_lock() free the purgeable memory on failure, but I felt this was clearer. There’s no technical reason it couldn’t, though.

Purgeable Implementation

The main challenge is that the kernel doesn’t necessarily treat the MADV_FREE range contiguously. It might reclaim just some pages, and do so in an arbitrary order. In order to lock the region, each page must be handled individually. Per the man page quoted above, reversing MADV_FREE requires a write to each page — to either trigger a page fault or set a dirty bit.

The only way to tell if a page has been purged is to check if it’s been filled with zeros. That’s easy if we’re sure a particular byte in the page should be zero, but, since this is a library, the caller might just store anything on these pages.

So here’s my solution: To unlock a page, look at the first byte on the page. Remember whether or not it’s zero. If it’s zero, write a 1 into that byte. Once this has been done for all pages, use madvise(2) to mark them all MADV_FREE.

With this approach, the library only needs to track one bit of information per page regardless of the page’s contents. Assuming 4kB pages, each 32kB of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead. Not too bad!

Locking purgeable memory is a little trickier. Again, each page must be visited in turn, and if any page was purged, then the whole allocation is considered lost. If the first byte was non-zero when unlocking, the library checks that it’s still non-zero. If the first byte was zero when unlocking, then it prepares to write a zero back into that byte, which must currently be non-zero.

In either case, the MADV_FREE needs to be canceled using a write, so the library does an atomic compare-and-swap (CAS) to write the correct byte into the page, even if it’s the same value in the non-zero case. The atomic CAS is essential because it ensures the page wasn’t purged between the check and the write, as both are done together, atomically. If every page has the expected first byte, and every CAS succeeded, then the purgeable memory has been successfully locked.

As an optimization, the library could consider more than just the first byte, and look at, say, the first long int on each page. The library does less work when the page contains a non-zero value, and the chance of an arbitrary 8-byte value being zero is much lower. However, I wanted to avoid potential aliasing issues, especially if this library were to be embedded, so I passed on the idea.

Bookkeeping

The bookkeeping data is stored just before the buffer returned as the purgeable memory, and it’s never marked with MADV_FREE. Assuming 4kB pages, for each 128MB of purgeable memory the library allocates one extra anonymous page to track it. The number of pages in the allocation is stored just before the purgeable memory as a size_t, and the rest is the per-page bit table described above.

size_t *p = purgeable_alloc(1<<14);
size_t numpages = p[-1];

So the library can immediately find it starting from the purgeable memory address. Here’s an illustration:

      ,--- p
      |
      v
----------------------------------------------
|...Z|    |    |    |    |    |    |    |    |
----------------------------------------------
 ^  ^
 |  |
 |  `--- size_t numpages
 |
 `--- bit table

The downside is that buffer underflows in the application would easily trample the numpages value because it’s located immediately adjacent. It would be safer to move it to the beginning of the first page before the purgeable memory, but this would have made bit table access more complicated. While the region is locked, the contents of the bit table don’t matter, so it won’t be damaged by an underflow. Another idea: put a checksum alongside numpages. It could just be a simple integer hash.

This makes for a really slick API since the consumer doesn’t need to track anything more than a single pointer, the address of the purgeable memory allocation itself.

Worth using?

I’m not quite sure how often I’d actually use purgeable memory in real programs, especially in software intended to be portable. Each operating system needs its own implementation, and this library is not portable since it relies on interfaces and behaviors specific to Linux.

It also has a not-so-unlikely pathological case: Imagine a program that makes two purgeable memory allocation, and they’re large enough that one always evicts the other. The program would thrash back and forth fighting itself as it used each allocation. Detecting this situation might be difficult, especially as the number of purgeable memory allocations increases.

Regardless, it’s another tool for my software toolbelt.

Efficient Alias of a Built-In Emacs Lisp Function

2019-12-10T02:32:04Z

Suppose you don’t like the names car and cdr, the traditional identifiers for two halves of a lisp cons cell. This is misguided. A cons is really just a 2-tuple, and the halves don’t have any particular meaning on their own, even as “head” and “tail.” However, maybe this is really important to you so you want to do it anyway. What’s the best way to go about it?

defalias

Emacs Lisp has a built-in function just for this, defalias, which is the obvious choice.

(defalias 'car-alias #'car)

The car built-in function is so fundamental to the language that it gets its own byte-code opcode. When you call car in your code, the byte-compiler doesn’t generate a function call, but instead uses a single instruction. For example, here’s an add function that sums the car of its two arguments. I’ve followed the definition with its disassembly (Emacs 26.3, lexical scope):

(defun add (a b)
  (+ (car a) (car b)))
;; 0       stack-ref 1
;; 1       car
;; 2       stack-ref 1
;; 3       car
;; 4       plus
;; 5       return

There are zero function calls because of the dedicated car opcode, and it has the optimal six byte-code instructions.

The problem with defalias is that the definition is permitted change — or be advised — and that robs the byte-compiler of optimization opportunities. It’s a constraint. When the byte-code compiler sees car-alias, it must emit a function call:

(defun add-alias (a b)
  (+ (car-alias a) (car-alias b)))
;; 0       constant  car-alias
;; 1       stack-ref 2
;; 2       call      1
;; 3       constant  car-alias
;; 4       stack-ref 2
;; 5       call      1
;; 6       plus
;; 7       return

This has two function calls and eight byte-code instructions. Those function calls are significantly more expensive than a car instruction, which will show in the benchmark later.

defsubst

An alternative is defsubst, an inlined function definition, which will inline an actual car. The semantics for defsubst are, like macros, explicit that re-definitions may not affect previous uses, so the constraint is gone. Unfortunately the byte-code compiler is pretty dumb, and does a poor job inlining car-subst.

(defsubst car-subst (x)
  (car x))

(defun add-subst (a b)
  (+ (car-subst a) (car-subst b)))
;; 0       stack-ref 1
;; 1       dup
;; 2       car
;; 3       stack-set 1
;; 5       stack-ref 1
;; 6       dup
;; 7       car
;; 8       stack-set 1
;; 10      plus
;; 11      return

There are zero function calls and ten byte-code instructions. The car opcode is in use, but there are five unnecessary instructions. This is still faster than making the function calls, though. If the byte-code compiler was just a little smarter and could compile this to the ideal case, then this would be the end of the discussion.

cl-first

The built-in cl-lib package has a cl-first alias for car. This was written by someone with intimate knowledge of Emacs Lisp, so how how well did they do?

(require 'cl-lib)

(defun add-cl-first (a b)
  (+ (cl-first a) (cl-first b)))
;; 0       stack-ref 1
;; 1       car
;; 2       stack-ref 1
;; 3       car
;; 4       plus
;; 5       return

It’s just like plain old car! How did they manage this? By using a byte-compiler hint:

(defalias 'cl-first 'car)
(put 'cl-first 'byte-optimizer 'byte-compile-inline-expand)

They used defalias, but they also manually told the byte-compiler to inline the definition like defsubst. In fact, defsubst expands to an expression that sets byte-compile-inline-expand, but, as seen above, the inline function overhead gets inlined and doesn’t get eliminated.

Benchmark

So how do the alternatives perform? (benchmark source)

add           (0.594811299 0 0.0)
add-alias     (1.232037132 0 0.0)
add-subst     (0.700044324 0 0.0)
add-cl-first  (0.58332882 0 0.0)

(The car of the list is the running time.) Since add and add-cl-first have the same byte-codes, we shouldn’t, and didn’t, see a significant difference. The simple use of defalias doubles the running time, and using defsubst is about 15% slower.

Chunking Optimizations: Let the Knife Do the Work

2019-12-09T22:37:55Z

There’s an old saying, let the knife do the work. Whether preparing food in the kitchen or whittling a piece of wood, don’t push your weight into the knife. Not only is it tiring, you’re much more likely to hurt yourself. Use the tool properly and little force will be required.

The same advice also often applies to compilers.

Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of data. The simplest approach would be to do it a byte at a time:

/* XOR src into dst */
void
xor512a(void *dst, void *src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}

Maybe you benchmark it or you look at the assembly output, and the results are disappointing. Your compiler did exactly what you asked of it and produced code that performs 64 single-byte XOR operations (GCC 9.2.0, x86-64, -Os):

xor512a:
        xor    eax, eax
.L0:    mov    cl, [rsi+rax]
        xor    [rdi+rax], cl
        inc    rax
        cmp    rax, 64
        jne    .L0
        ret

The target architecture has wide registers so it could be doing at least 8 bytes at a time. Since your compiler isn’t doing it, you decide to chunk the work into 8 byte blocks yourself in an attempt to manually implement a chunking operation. Here’s some real world code that does so:

/* WARNING: Broken, do not use! */
void
xor512b(void *dst, void *src)
{
    uint64_t *pd = dst;
    uint64_t *ps = src;
    for (int i = 0; i < 8; i++) {
        pd[i] ^= ps[i];
    }
}

You check the assembly output of this function, and it looks much better. It’s now processing 8 bytes at a time, so it should be about 8 times faster than before.

xor512b:
        xor    eax, eax
.L0:    mov    rcx, [rsi+rax*8]
        xor    [rdi+rax*8], rcx
        inc    rax
        cmp    rax, 8
        jne    .L0
        ret

Still, this machine has 16-byte wide registers (SSE2 xmm), so there could be another doubling in speed. Oh well, this is good enough, so you plug it into your program. But something strange happens: The output is now wrong!

int
main(void)
{
    uint32_t dst[32] = {
        1, 2, 3, 4, 5, 6, 7, 8,
        9, 10, 11, 12, 13, 14, 15, 16
    };
    uint32_t src[32] = {
        1, 4, 9, 16, 25, 36, 49, 64,
        81, 100, 121, 144, 169, 196, 225, 256,
    };
    xor512b(dst, src);
    for (int i = 0; i < 16; i++) {
        printf("%d\n", (int)dst[i]);
    }
}

Your program prints 1..16 as if xor512b() was never called. You check over everything a dozen times, and you can’t find anything wrong. Even crazier, if you disable optimizations then the bug goes away. It must be some kind of compiler bug!

Investigating a bit more, you learn that the -fno-strict-aliasing option also fixes the bug. That’s because this program violates C strict aliasing rules. An array of uint32_t was accessed as a uint64_t. As an important optimization, compilers are allowed to assume such variables do not alias and generate code accordingly. Otherwise every memory store could potentially modify any variable, which limits the compiler’s ability to produce decent code.

The original version is fine because char *, including both signed and unsigned, has a special exemption and may alias with anything. For the same reason, using char * unnecessarily can also make your programs slower.

What could you do to keep the chunking operation while not running afoul of strict aliasing? Counter-intuitively, you could use memcpy(). Copy the chunks into legitimate, local uint64_t variables, do the work, and copy the result back out.

void
xor512c(void *dst, void *src)
{
    for (int i = 0; i < 8; i++) {
        uint64_t buf[2];
        memcpy(buf + 0, (char *)dst + i*8, 8);
        memcpy(buf + 1, (char *)src + i*8, 8);
        buf[0] ^= buf[1];
        memcpy((char *)dst + i*8, buf, 8);
    }
}

Since memcpy() is a built-in function, your compiler knows its semantics and can ultimately elide all that copying. The assembly listing for xor512c is identical to xor512b, but it won’t go haywire when integrated into a real program.

It works and it’s correct, but you can still do much better than this!

Letting your compiler do the work

The problem is you’re forcing the knife and not letting it do the work. There’s a constraint on your compiler that hasn’t been considered: It must work correctly for overlapping inputs.

char buf[74] = {...};
xor512a(buf, buf + 10);

In this situation, the byte-by-byte and chunked versions of the function will have different results. That’s exactly why your compiler can’t do the chunking operation itself. However, you don’t care about this situation because the inputs never overlap.

Let’s revisit the first, simple implementation, but this time being smarter about it. The restrict keyword indicates that the inputs will not overlap, freeing your compiler of this unwanted concern.

void
xor512d(void *restrict dst, void *restrict src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}

(Side note: Adding restrict to the manually chunked function, xor512b(), will not fix it. Using restrict can never make an incorrect program correct.)

Compiled with GCC 9.2.0 and -O3, the resulting unrolled code processes 16-byte chunks at a time (pxor):

xor512d:
        movdqu  xmm0, [rdi+0x00]
        movdqu  xmm1, [rsi+0x00]
        movdqu  xmm2, [rsi+0x10]
        movdqu  xmm3, [rsi+0x20]
        pxor    xmm0, xmm1
        movdqu  xmm4, [rdi+0x30]
        movups  [rdi+0x00], xmm0
        movdqu  xmm0, [rdi+0x10]
        pxor    xmm0, xmm2
        movups  [rdi+0x10], xmm0
        movdqu  xmm0, [rdi+0x20]
        pxor    xmm0, xmm3
        movups  [rdi+0x20], xmm0
        movdqu  xmm0, [rsi+0x30]
        pxor    xmm0, xmm4
        movups  [rdi+0x30], xmm0
        ret

Compiled with Clang 9.0.0 with AVX-512 enabled in the target (-mavx512bw), it does the entire operation in a single, big chunk!

xor512d:
        vmovdqu64   zmm0, [rdi]
        vpxorq      zmm0, zmm0, [rsi]
        vmovdqu64   [rdi], zmm0
        vzeroupper
        ret

“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.

On-the-fly Linear Congruential Generator Using Emacs Calc

2019-11-19T01:17:50Z

I regularly make throwaway “projects” and do a surprising amount of programming in /tmp. For Emacs Lisp, the equivalent is the *scratch* buffer. These are places where I can make a mess, and the mess usually gets cleaned up before it becomes a problem. A lot of my established projects (ex.) start out in volatile storage and only graduate to more permanent storage once the concept has proven itself.

Throughout my whole career, this sort of throwaway experimentation has been an important part of my personal growth, and I try to encourage it in others. Even if the idea I’m trying doesn’t pan out, I usually learn something new, and occasionally it translates into an article here.

I also enjoy small programming challenges. One of the most abused tools in my mental toolbox is the Monte Carlo method, and I readily apply it to solve toy problems. Even beyond this, random number generators are frequently a useful tool (1, 2), so I find myself reaching for one all the time.

Nearly every programming language comes with a pseudo-random number generation function or library. Unfortunately the language’s standard PRNG is usually a poor choice (C, C++, C#, Go). It’s probably mediocre quality, slower than it needs to be (also), lacks reliable semantics or behavior between implementations, or is missing some other property I want. So I’ve long been a fan of BYOPRNG: Bring Your Own Pseudo-random Number Generator. Just embed a generator with the desired properties directly into the program. The best non-cryptographic PRNGs today are tiny and exceptionally friendly to embedding. Though, depending on what you’re doing, you might need to be creative about seeding.

Crafting a PRNG

On occasion I don’t have an established, embeddable PRNG in reach, and I have yet to commit xoshiro256** to memory. Or maybe I want to use a totally unique PRNG for a particular project. In these cases I make one up. With just a bit of know-how it’s not too difficult.

Probably the easiest decent PRNG to code from scratch is the venerable Linear Congruential Generator (LCG). It’s a simple recurrence relation:

x[1] = (x[0] * A + C) % M

That’s trivial to remember once you know the details. You only need to choose appropriate values for A, C, and M. Done correctly, it will be a full-period generator — a generator that visits a permutation of each of the numbers between 0 and M - 1. The seed — the value of x[0] — is chooses a starting position in this (looping) permutation.

M has a natural, obvious choice: a power of two matching the range of operands, such as 2^32 or 2^64. With this the modulo operation is free as a natural side effect of the computer architecture.

Choosing C also isn’t difficult. It must be co-prime with M, and since M is a power of two, any odd number is valid. Even 1. In theory choosing a small value like 1 is faster since the compiler won’t need to embed a large integer in the code, but this difference doesn’t show up in any micro-benchmarks I tried. If you want a cool, unique generator, then choose a large random integer. More on that below.

The tricky value is A, and getting it right is the linchpin of the whole LCG. It must be coprime with M (i.e. not even), and, for a full-period generator, A-1 must be divisible by four. For better results, A-1 should not be divisible by 8. A good choice is a prime number that satisfies these properties.

If your operands are 64-bit integers, or larger, how are you going to generate a prime number?

Primes from Emacs Calc

Emacs Calc can solve this problem. I’ve noted before how featureful it is. It has arbitrary precision, random number generation, and primality testing. It’s everything we need to choose A. (In fact, this is nearly identical to the process I used to implement RSA.) For this example I’m going to generate a 64-bit LCG for the C programming language, but it’s easy to use whatever width you like and mostly whatever language you like. If you wanted a minimal standard 128-bit LCG, this will still work.

Start by opening up Calc with M-x calc, then:

Push 2 on the stack
Push 64 on the stack
Press ^, computing 2^64 and pushing it on the stack
Press k r to generate a random number in this range
Press d r 16 to switch to hexadecimal display
Press k n to find the next prime following the random value
Repeat step 6 until you get a number that ends with 5 or D
Press k p a few times to avoid false positives.

What’s left on the stack is your A! If you want a random value for C, you can follow a similar process. Heck, make it prime, too!

The reason for using hexadecimal (step 5) and looking for 5 or D (step 7) is that such numbers satisfy both of the important properties for A-1.

Calc doesn’t try to factor your random integer. Instead it uses the Miller–Rabin primality test, a probabilistic test that, itself, requires random numbers. It has false positives but no false negatives. The false positives can be mitigated by repeating the test multiple times, hence step 8.

Trying this all out right now, I got this implementation (in C):

uint64_t lcg1(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    return s;
}

However, we can still do a little better. Outputting the entire state doesn’t have great results, so instead it’s better to create a truncated LCG and only return some portion of the most significant bits.

uint32_t lcg2(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    return s >> 32;
}

This won’t quite pass BigCrush in 64-bit form, but the results are pretty reasonable for most purposes.

But we can still do better without needing to remember much more than this.

Appending permutation

A Permuted Congruential Generator (PCG) is really just a truncated LCG with a permutation applied to its output. Like LCGs themselves, there are arbitrarily many variations. The “official” implementation has a data-dependent shift, for which I can never remember the details. Fortunately a couple of simple, easy to remember transformations is sufficient. Basically anything I used while prospecting for hash functions. I love xorshifts, so lets add one of those:

uint32_t pcg1(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    uint32_t r = s >> 32;
    r ^= r >> 16;
    return r;
}

This is a big improvement, but it still fails one BigCrush test. As they say, when xorshift isn’t enough, use xorshift-multiply! Below I generated a 32-bit prime for the multiply, but any odd integer is a valid permutation.

uint32_t pcg2(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    uint32_t r = s >> 32;
    r ^= r >> 16;
    r *= UINT32_C(0x60857ba9);
    return r;
}

This passes BigCrush, and I can reliably build a new one entirely from scratch using Calc any time I need it.

Bonus: Adapting to other languages

Sometimes it’s not so straightforward to adapt this technique to other languages. For example, JavaScript has limited support for 32-bit integer operations (enough for a poor 32-bit LCG) and no 64-bit integer operations. Though BigInt is now a thing, and should make a great 96- or 128-bit LCG easy to build.

function lcg(seed) {
    let s = BigInt(seed);
    return function() {
        s *= 0xef725caa331524261b9646cdn;
        s += 0x213734f2c0c27c292d814385n;
        s &= 0xffffffffffffffffffffffffn;
        return Number(s >> 64n);
    }
}

Java doesn’t have unsigned integers, so how could you build the above PCG in Java? Easy! First, remember is that Java has two’s complement semantics, including wrap around, and that two’s complement doesn’t care about unsigned or signed for multiplication (or addition, or subtraction). The result is identical. Second, the oft-forgotten >>> operator does an unsigned right shift. With these two tips:

long s = 0;

int pcg2() {
    s = s*0x7c3c3267d015ceb5L + 0x24bd2d95276253a9L;
    int r = (int)(s >>> 32);
    r ^= r >>> 16;
    r *= 0x60857ba9;
    return r;
}

So, in addition to the Calc step list above, you may need to know some of the finer details of your target language.

Legitimate-ish Use of alloca()

2019-10-28T00:42:23Z

This article was discussed on Hacker News.

Yesterday I wrote about a legitimate use for variable length arrays. While recently discussing this topic with a co-worker, I also thought of a semi-legitimate use for alloca(), a non-standard “function” for dynamically allocating memory on the stack.

void *alloca(size_t);

I say “function” in quotes because it’s not truly a function and cannot be implemented as a function or by a library. It’s implemented in the compiler and is essentially part of the language itself. It’s a tool allowing a function to manipulate its own stack frame.

Like VLAs, it has the problem that if you’re able to use alloca() safely, then you really don’t need it in the first place. Allocation failures are undetectable and once they happen it’s already too late.

Opaque structs

To set the scene, let’s talk about opaque structs. Suppose you’re writing a C library with a clean interface. It’s set up so that changing your struct fields won’t break the Application Binary Interface (ABI), and callers are largely unable to depend on implementation details, even by accident. To achieve this, it’s likely you’re making use of opaque structs in your interface. Callers only ever receive pointers to library structures, which are handed back into the interface when they’re used. The internal details are hidden away.

/* opaque float stack API */
struct stack *stack_create(void);
void          stack_destroy(struct stack *);
int           stack_push(struct stack *, float v);
float         stack_pop(struct stack *);

Callers can use the API above without ever knowing the layout or even the size of struct stack. Only a pointer to the struct is ever needed. However, in order for this to work, the library must allocate the struct itself. If this is a concern, then the library will typically allow the caller to supply an allocator via function pointers. To see a really slick version of this in practice, check out Lua’s lua_Alloc, a single function allocator API.

Suppose we wanted to support something simpler: The library will advertise the size of the struct so the caller can allocate it.

/* API additions */
size_t stack_sizeof(void);
void   stack_init(struct stack *);  // like stack_create()
void   stack_free(struct stack *);  // like stack_destroy()

The implementation of stack_sizeof() would literally just be return sizeof struct stack. The caller might use it like so:

size_t len = stack_sizeof();
struct stack *s = malloc(len);
if (s) {
    stack_init(s);
    /* ... */
    stack_free(s);
    free(s);
}

However, that’s still a heap allocation. If this wasn’t an opaque struct, the caller could very naturally use automatic (i.e. stack) allocation, which is likely even preferred in this case. Is this still possible? Idea: Allocate it via a generic char array (VLA in this case).

size_t len = stack_sizeof();
char buf[len];
struct stack *s = (struct stack *)buf;
stack_init(s);

However, this is technically undefined behavior. While a char pointer is special and permitted to alias with anything, the inverse isn’t true. Pointers to other types don’t get a free pass to alias with a char array. Accessing a char value as if it were a different type just isn’t allowed. Why? Because the standard says so. If you want one of the practical reasons: the alignment might be incorrect.

Hmmm, is there another option? Maybe with alloca()!

size_t len = stack_sizeof();
struct stack *s = alloca(len);
stack_init(s);

Since len is expected to be small, it’s not any less safe than the non-opaque alternative. It doesn’t undermine the type system, either, since alloca() has the same semantics as malloc(). The downsides are:

It’s not portable: alloca() is only a common extension, never standardized, and for good reason.
This is still a dynamic stack allocation, so, like I showed in the last article, the function making this allocation becomes more complex. It must manage its own stack frame dynamically.

Optimizing out `alloca()`?

The second issue can possibly be resolved if the size is available as a compile time constant. This starts to break the abstraction provided by opaque structs, but they’re still mostly opaque. For example:

/* API additions */
#define STACK_SIZE 24

/* In practice, this would likely be horrific #ifdef spaghetti! */

The caller might use it like this:

struct stack *s = alloca(STACK_SIZE);
stack_init(s);

Now the compiler can see the allocation size, and potentially optimize away the alloca(). As of this writing, Clang (all versions) can optimize these fixed-size alloca() usages, but GCC (9.2) still does not. Here’s a simple example:

#include 

void
foo(void)
{
#ifdef ALLOCA
    volatile char *s = alloca(64);
#else
    volatile char s[64];
#endif
    s[63] = 0;
}

With the char array version, both GCC and Clang produce optimal code:

0000000000000000 :
   0:	c6 44 24 f8 00       	mov    BYTE PTR [rsp-0x1],0x0
   5:	c3                   	ret

Side note: This is on x86-64 Linux, which uses the System V ABI. The entire array falls within the red zone, so it doesn’t need to be explicitly allocated.

With -DALLOCA, Clang does the same, but GCC does the allocation inefficiently as if it were dynamic:

0000000000000000 :
55                   	push   rbp
48 89 e5             	mov    rbp,rsp
48 83 ec 50          	sub    rsp,0x50
48 8d 44 24 0f       	lea    rax,[rsp+0xf]
   d:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
c6 40 3f 00          	mov    BYTE PTR [rax+0x3f],0x0
c9                   	leave
c3                   	ret

It would make a slightly better case for alloca() here if GCC was better about optimizing it. Regardless, this is another neat little trick that I probably wouldn’t use in practice.

Legitimate Use of Variable Length Arrays

2019-10-27T19:58:00Z

This article was discussed on Hacker News and on reddit.

The C99 (ISO/IEC 9899:1999) standard of C introduced a new, powerful feature called Variable Length Arrays (VLAs). The size of an array with automatic storage duration (i.e. stack allocated) can be determined at run time. Each instance of the array may even have a different length. Unlike alloca(), they’re a sanctioned form of dynamic stack allocation.

At first glance, VLAs seem convenient, useful, and efficient. Heap allocations have a small cost because the allocator needs to do some work to find or request some free memory, and typically the operation must be synchronized since there may be other threads also making allocations. Stack allocations are trivial and fast by comparison: Allocation is a matter of bumping the stack pointer, and no synchronization is needed.

For example, here’s a function that non-destructively finds the median of a buffer of floats:

/* note: nmemb must be non-zero */
float
median(const float *a, size_t nmemb)
{
    float copy[nmemb];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

It uses a VLA, copy, as a temporary copy of the input for sorting. The function doesn’t know at compile time how big the input will be, so it cannot just use a fixed size. With a VLA, it efficiently allocates exactly as much memory as needed on the stack.

Well, sort of. If nmemb is too large, then the VLA will silently overflow the stack. By silent I mean that the program has no way to detect it and avoid it. In practice, it can be a lot louder, from a segmentation fault in the best case, to an exploitable vulnerability in the worst case: stack clashing. If an attacker can control nmemb, they might choose a value that causes copy to overlap with other allocations, giving them control over those values as well.

If there’s any risk that nmemb is too large, it must be guarded.

#define COPY_MAX 4096

float
median(const float *a, size_t nmemb)
{
    if (nmemb > COPY_MAX)
        abort();  /* or whatever */
    float copy[nmemb];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

However, if median is expected to safely accommodate COPY_MAX elements, it may as well always allocate an array of this size. If it can’t, then that’s not a safe maximum.

float
median(const float *a, size_t nmemb)
{
    if (nmemb > COPY_MAX)
        abort();
    float copy[COPY_MAX];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

And rather than abort, you might still want to support arbitrary input sizes:

float
median(const float *a, size_t nmemb)
{
    float buf[COPY_MAX];
    float *copy = buf;
    if (nmemb > COPY_MAX)
        copy = malloc(sizeof(a[0]) * nmemb);
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    float result = copy[nmemb / 2];
    if (copy != buf)
        free(copy);
    return result;
}

Then small inputs are fast, but large inputs still work correctly. This is called small size optimization.

If the correct solution ultimately didn’t use a VLA, then what good are they? In general, VLAs not useful. They’re time bombs. VLAs are nearly always the wrong choice. You must be careul to check that they don’t exceed some safe maximum, and there’s no reason not to always use the maximum. This problem was realized for the C11 standard (ISO/IEC 9899:2011) where VLAs were made optional. A program containing a VLA will not necessarily compile on a C11 compiler.

Some purists also object to a special exception required for VLAs: The sizeof operator may evaluate its operand, and so it does not always evaluate to compile-time constant. If the operand contains a VLA, then the result depends on a run-time value.

Because they’re optional, it’s best to avoid even trivial VLAs like this:

float
median(const float *a, size_t nmemb)
{
    int max = 4096;
    if (nmemb > max)
        abort();
    float copy[max];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

It’s easy to prove that the array length is always 4096, but technically this is still a VLA. That would still be true even if max were const int, because the array length still isn’t a constant integral expression.

VLA overhead

Finally, there’s also the problem that VLAs just aren’t as efficient as you might hope. A function that does dynamic stack allocation requires additional stack management. It must track additional memory addresses and will require extra instructions.

void
fixed(int n)
{
    if (n <= 1<<14) {
        volatile char buf[1<<14];
        buf[n - 1] = 0;
    }
}

void
dynamic(int n)
{
    if (n <= 1<<14) {
        volatile char buf[n];
        buf[n - 1] = 0;
    }
}

Compiled with gcc -Os and viewed with objdump -d -Mintel:

0000000000000000 :
   0:	81 ff 00 40 00 00    	cmp    edi,0x4000
   6:	7f 19                	jg     21 
   8:	ff cf                	dec    edi
   a:	48 81 ec 88 3f 00 00 	sub    rsp,0x3f88
  11:	48 63 ff             	movsxd rdi,edi
  14:	c6 44 3c 88 00       	mov    BYTE PTR [rsp+rdi*1-0x78],0x0
  19:	48 81 c4 88 3f 00 00 	add    rsp,0x3f88
  20:	c3                   	ret    
  21:	c3                   	ret    

0000000000000022 :
  22:	81 ff 00 40 00 00    	cmp    edi,0x4000
  28:	7f 23                	jg     4d 
  2a:	55                   	push   rbp
  2b:	48 63 c7             	movsxd rax,edi
  2e:	ff cf                	dec    edi
  30:	48 83 c0 0f          	add    rax,0xf
  34:	48 63 ff             	movsxd rdi,edi
  37:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  3b:	48 89 e5             	mov    rbp,rsp
  3e:	48 89 e2             	mov    rdx,rsp
  41:	48 29 c4             	sub    rsp,rax
  44:	c6 04 3c 00          	mov    BYTE PTR [rsp+rdi*1],0x0
  48:	48 89 d4             	mov    rsp,rdx
  4b:	c9                   	leave  
  4c:	c3                   	ret    
  4d:	c3                   	ret    

Note the use of a base pointer, rbp and leave, in the second function in order to dynamically track the stack frame. (Hmm, in both cases GCC could easily shave off the extra ret at the end of each function. Missed optimization?)

The story is even worse when stack clash protection is enabled (-fstack-clash-protection). The compiler generates extra code to probe every page of allocation in case one of those pages is a guard page. That’s also more complex when the allocation is dynamic. The VLA version more than doubles in size (from 44 bytes to 101 bytes)!

Safe and useful variable length arrays

There is one convenient, useful, and safe form of VLAs: a pointer to a VLA. It’s convenient and useful because it makes some expressions simpler. It’s safe because there’s no arbitrary stack allocation.

Pointers to arrays are a rare sight in C code, whether variable length or not. That’s because, the vast majority of the time, C programmers implicitly rely on array decay: arrays quietly “decay” into pointers to their first element the moment you do almost anything with them. Also because they’re really awkward to use.

For example, the function sum3 takes a pointer to an array of exactly three elements.

int
sum3(int (*array)[3])
{
    return (*array)[0] + (*array)[1] + (*array)[2];
}

The parentheses are necessary because, without them, array would be an array of pointers — a type far more common than a pointer to an array. To index into the array, first the pointer to the array must be dereferenced to the array value itself, then this intermediate array is indexed triggering array decay. Conceptually there’s quite a bit to it, but, in practice, it’s all as efficient as the conventional approach to sum3 that accepts a plain int *.

The caller must take the address of an array of exactly the right length:

int buf[] = {1, 2, 4};
int r = sum3(&buf);

Or if dynamically allocating the array:

int (*array)[3] = malloc(sizeof(*array));
(*array)[0] = 1;
(*array)[1] = 2;
(*array)[2] = 4;
int r = sum3(array);
free(array);

The mandatory parentheses and strict type requirements make this awkward and rarely useful. However, with VLAs perhaps it’s worth the trouble! Consider an NxN matrix expressed using a pointer to a VLA:

int n = /* run-time value */;
/* TODO: Check for integer overflow. See note. */
float (*identity)[n][n] = malloc(sizeof(*identity));
if (identity) {
    for (int y = 0; y < n; y++) {
        for (int x = 0; x < n; x++) {
            (*identity)[y][x] = x == y;
        }
    }
}

When indexing, the parentheses are weird, but the indices have the convenient [y][x] format. The non-VLA alternative is to compute a 1D index manually from 2D indices (y*n+x):

int n = /* run-time value */;
/* TODO: Check for integer overflow. */
float *identity = malloc(sizeof(*identity) * n * n);
if (identity) {
    for (int y = 0; y < n; y++) {
        for (int x = 0; x < n; x++) {
            identity[y*n + x] = x == y;
        }
    }
}

Note: What’s the behavior in the VLA version when n is so large that sizeof(*identity) doesn’t fit in a size_t? I couldn’t find anything in the standard about it, though I bet it’s undefined behavior. Neither GCC and Clang check for overflow and, when it occurs, the overflow is silent. Neither the undefined behavior sanitizer nor address sanitizer complain when this happens.

Update: bru del pointed out that these multi-dimensional VLAs can be simplified such that the parenthesis may be omitted when indexing. The trick is to omit the first dimension from the VLA expression:

float (*identity)[n] = malloc(sizeof(*identity) * n);
if (identity) {
    for (int y = 0; y < n; y++) {
        for (int x = 0; x < n; x++) {
            identity[y][x] = x == y;
        }
    }
}

So VLAs might be worth the trouble when using pointers to multi-dimensional, dynamically-allocated arrays. However, I’m still judicious about their use due to reduced portability. As a practical example, MSVC famously does not, and likely will never will, support VLAs.

The CPython Bytecode Compiler is Dumb

2019-02-24T21:56:35Z

This article was discussed on Hacker News.

Due to sheer coincidence of several unrelated tasks converging on Python at work, I recently needed to brush up on my Python skills. So far for me, Python has been little more than a fancy extension language for BeautifulSoup, though I also used it to participate in the recent tradition of writing one’s own static site generator, in this case for my wife’s photo blog. I’ve been reading through Fluent Python by Luciano Ramalho, and it’s been quite effective at getting me up to speed.

As I write Python, like with Emacs Lisp, I can’t help but consider what exactly is happening inside the interpreter. I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options. Ultimately I’d like the code I write to drive the interpreter efficiently and effectively. The Zen of Python says there should “only one obvious way to do it,” but in practice there’s a lot of room for expression. Given multiple ways to express the same algorithm or idea, I tend to prefer the one that compiles to the more efficient bytecode.

Fortunately CPython, the main and most widely used implementation of Python, is very transparent about its bytecode. It’s easy to inspect and reason about its bytecode. The disassembly listing is easy to read and understand, and I can always follow it without consulting the documentation. This contrasts sharply with modern JavaScript engines and their opaque use of JIT compilation, where performance is guided by obeying certain patterns (hidden classes, etc.), helping the compiler understand my program’s types, and being careful not to unnecessarily constrain the compiler.

So, besides just catching up with Python the language, I’ve been studying the bytecode disassembly of the functions that I write. One fact has become quite apparent: the CPython bytecode compiler is pretty dumb. With a few exceptions, it’s a very literal translation of a Python program, and there is almost no optimization. Below I’ll demonstrate a case where it’s possible to detect one of the missed optimizations without inspecting the bytecode disassembly thanks to a small abstraction leak in the optimizer.

To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb) because it’s such a close match to the source listing.

Update: Darius Bacon points out that Guido van Rossum himself said, “Python is about having the simplest, dumbest compiler imaginable.” So this is all very much by design.

The consensus seems to be that if you want or need better performance, use something other than Python. (And if you can’t do that, at least use PyPy.) That’s a fairly reasonable and healthy goal. Still, if I’m writing Python, I’d like to do the best I can, which means exploiting the optimizations that are available when possible.

Disassembly examples

I’m going to compare three bytecode compilers in this article: CPython 3.7, Lua 5.3, and Emacs 26.1. Each of these languages are dynamically typed, are primarily executed on a bytecode virtual machine, and it’s easy to access their disassembly listings. One caveat: CPython and Emacs use a stack-based virtual machine while Lua uses a register-based virtual machine.

For CPython I’ll be using the dis module. For Emacs Lisp I’ll use M-x disassemble, and all code will use lexical scoping. In Lua I’ll use lua -l on the command line.

Local variable elimination

Will the bytecode compiler eliminate local variables? Keeping the variable around potentially involves allocating memory for it, assigning to it, and accessing it. Take this example:

def foo():
    x = 0
    y = 1
    return x

This function is equivalent to:

def foo():
    return 0

Despite this, CPython completely misses this optimization for both x and y:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               0 (x)
  3           4 LOAD_CONST               2 (1)
              6 STORE_FAST               1 (y)
  4           8 LOAD_FAST                0 (x)
             10 RETURN_VALUE

It assigns both variables, and even loads again from x for the return. Missed optimizations, but, as I said, by keeping these variables around, debugging is more straightforward. Users can always inspect variables.

How about Lua?

function foo()
    local x = 0
    local y = 1
    return x
end

It also misses this optimization, though it matters a little less due to its architecture (the return instruction references a register regardless of whether or not that register is allocated to a local variable):

        1       [2]     LOADK           0 -1    ; 0
        2       [3]     LOADK           1 -2    ; 1
        3       [4]     RETURN          0 2
        4       [5]     RETURN          0 1

Emacs Lisp also misses it:

(defun foo ()
  (let ((x 0)
        (y 1))
    x))

Disassembly:

constant  0
constant  1
stack-ref 1
return

All three are on the same page.

Constant folding

Does the bytecode compiler evaluate simple constant expressions at compile time? This is simple and everyone does it.

def foo():
    return 1 + 2 * 3 / 4

Disassembly:

  2           0 LOAD_CONST               1 (2.5)
              2 RETURN_VALUE

Lua:

function foo()
    return 1 + 2 * 3 / 4
end

Disassembly:

        1       [2]     LOADK           0 -1    ; 2.5
        2       [2]     RETURN          0 2
        3       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (+ 1 (/ (* 2 3) 4.0))

Disassembly:

0	constant  2.5
1	return

That’s something we can count on so long as the operands are all numeric literals (or also, for Python, string literals) that are visible to the compiler. Don’t count on your operator overloads to work here, though.

Allocation optimization

Optimizers often perform escape analysis, to determine if objects allocated in a function ever become visible outside of that function. If they don’t then these objects could potentially be stack-allocated (instead of heap-allocated) or even be eliminated entirely.

None of the bytecode compilers are this sophisticated. However CPython does have a trick up its sleeve: tuple optimization. Since tuples are immutable, in certain circumstances CPython will reuse them and avoid both the constructor and the allocation.

def foo():
    return (1, 2, 3)

Check it out, the tuple is used as a constant:

  2           0 LOAD_CONST               1 ((1, 2, 3))
              2 RETURN_VALUE

Which we can detect by evaluating foo() is foo(), which is True. Though deviate from this too much and the optimization is disabled. Remember how CPython can’t optimize away variables, and that they break constant folding? The break this, too:

def foo():
    x = 1
    return (x, 2, 3)

Disassembly:

  2           0 LOAD_CONST               1 (1)
              2 STORE_FAST               0 (x)
  3           4 LOAD_FAST                0 (x)
              6 LOAD_CONST               2 (2)
              8 LOAD_CONST               3 (3)
             10 BUILD_TUPLE              3
             12 RETURN_VALUE

This function might document that it always returns a simple tuple, but we can tell if its being optimized or not using is like before: foo() is foo() is now False! In some future version of Python with a cleverer bytecode compiler, that expression might evaluate to True. (Unless the Python language specification is specific about this case, which I didn’t check.)

Note: Curiously PyPy replicates this exact behavior when examined with is. Was that deliberate? I’m impressed that PyPy matches CPython’s semantics so closely here.

Putting a mutable value, such as a list, in the tuple will also break this optimization. But that’s not the compiler being dumb. That’s a hard constraint on the compiler: the caller might change the mutable component of the tuple, so it must always return a fresh copy.

Neither Lua nor Emacs Lisp have a language-level concept equivalent of an immutable tuple, so there’s nothing to compare.

Other than the tuples situation in CPython, none of the bytecode compilers eliminate unnecessary intermediate objects.

def foo():
    return [1024][0]

Disassembly:

  2           0 LOAD_CONST               1 (1024)
              2 BUILD_LIST               1
              4 LOAD_CONST               2 (0)
              6 BINARY_SUBSCR
              8 RETURN_VALUE

Lua:

function foo()
    return ({1024})[1]
end

Disassembly:

        1       [2]     NEWTABLE        0 1 0
        2       [2]     LOADK           1 -1    ; 1024
        3       [2]     SETLIST         0 1 1   ; 1
        4       [2]     GETTABLE        0 0 -2  ; 1
        5       [2]     RETURN          0 2
        6       [3]     RETURN          0 1

Emacs Lisp:

(defun foo ()
  (car (list 1024)))

Disassembly:

constant  1024
list1
car
return

Don’t expect too much

I could go on with lots of examples, looking at loop optimizations and so on, and each case is almost certainly unoptimized. The general rule of thumb is to simply not expect much from these bytecode compilers. They’re very literal in their translation.

Working so much in C has put me in the habit of expecting all obvious optimizations from the compiler. This frees me to be more expressive in my code. Lots of things are cost-free thanks to these optimizations, such as breaking a complex expression up into several variables, naming my constants, or not using a local variable to manually cache memory accesses. I’m confident the compiler will optimize away my expressiveness. The catch is that clever compilers can take things too far, so I’ve got to be mindful of how it might undermine my intentions — i.e. when I’m doing something unusual or not strictly permitted.

These bytecode compilers will never truly surprise me. The cost is that being more expressive in Python, Lua, or Emacs Lisp may reduce performance at run time because it shows in the bytecode. Usually this doesn’t matter, but sometimes it does.

Prospecting for Hash Functions

2018-07-31T22:32:45Z

Update 2022: TheIronBorn has found even better permutations using a smarter technique. That thread completely eclipses my efforts in this article.

I recently got an itch to design my own non-cryptographic integer hash function. Firstly, I wanted to better understand how hash functions work, and the best way to learn is to do. For years I’d been treating them like magic, shoving input into it and seeing random-looking, but deterministic, output come out the other end. Just how is the avalanche effect achieved?

Secondly, could I apply my own particular strengths to craft a hash function better than the handful of functions I could find online? Especially the classic ones from Thomas Wang and Bob Jenkins. Instead of struggling with the mathematics, maybe I could software engineer my way to victory, working from the advantage of access to the excessive computational power of today.

Suppose, for example, I wrote tool to generate a random hash function definition, then JIT compile it to a native function in memory, then execute that function across various inputs to evaluate its properties. My tool could rapidly repeat this process in a loop until it stumbled upon an incredible hash function the world had never seen. That’s what I actually did. I call it the Hash Prospector:

https://github.com/skeeto/hash-prospector

It only works on x86-64 because it uses the same JIT compiling technique I’ve discussed before: allocate a page of memory, write some machine instructions into it, set the page to executable, cast the page pointer to a function pointer, then call the generated code through the function pointer.

Generating a hash function

My focus is on integer hash functions: a function that accepts an n-bit integer and returns an n-bit integer. One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.

This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversible, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input.

For example, XOR with a constant is trivially reversible: XOR the output with the same constant to reverse it. Addition with a constant is reversed by subtraction with the same constant. Since the integer operations are modular arithmetic, modulo 2^n for n-bit integers, multiplication by an odd number is reversible. Odd numbers are coprime with the power-of-two modulus, so there is some modular multiplicative inverse that reverses the operation.

Bret Mulvey’s hash function article provides a convenient list of some reversible operations available for constructing integer hash functions. This list was the catalyst for my little project. Here are the ones used by the hash prospector:

x  = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation

I’ve come across a couple more useful operations while studying existing integer hash functions, but I didn’t put these in the prospector.

hash += ~(hash << constant);
hash -= ~(hash << constant);

The prospector picks some operations at random and fills in their constants randomly within their proper constraints. For example, here’s an awful hash function I made it generate as an example:

// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
    x *= 0x1eca7d79U;
    x ^= x >> 20;
    x  = (x << 8) | (x >> 24);
    x  = ~x;
    x ^= x << 5;
    x += 0x10afe4e7U;
    return x;
}

That function is reversible, and it would be relatively straightforward to define its inverse. However, it has awful biases and poor avalanche. How do I know this?

The measure of a hash function

There are two key properties I’m looking for in randomly generated hash functions.

High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.
Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

Initially I screwed up and only measured the first property. This lead to some hash functions that seemed to be amazing before close inspection, since, for a 32-bit hash function, it was flipping over 15 output bits on average. However, the particular bits being flipped were heavily biased, resulting in obvious patterns in the output.

For example, when hashing a counter starting from zero, the high bits would follow a regular pattern. 15 to 16 bits were being flipped each time, but it was always the same bits.

Conveniently it’s easy to measure both properties at the same time. For an n-bit integer hash function, create an n by n table initialized to zero. The rows are input bits and the columns are output bits. The ith row and jth column track the correlation between the ith input bit and jth output bit.

Then exhaustively iterate over all 2^n inputs, and flip each bit one at a time. Increment the appropriate element in the table if the output bit flips.

When you’re done, ideally each element in the table is exactly 2^(n-1). That is, each output bit was flipped exactly half the time by each input bit. Therefore the bias of the hash function is the distance (the error) of the computed table from the ideal table.

For example, the ideal bias table for an 8-bit hash function would be:

128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128

The hash prospector computes the standard deviation in order to turn this into a single, normalized measurement. Lower scores are better.

However, there’s still one problem: the input space for a 32-bit hash function is over 4 billion values. The full test takes my computer about an hour and a half. Evaluating a 64-bit hash function is right out.

Again, Monte Carlo to the rescue! Rather than sample the entire space, just sample a random subset. This provides a good estimate in less than a second, allowing lots of terrible hash functions to be discarded early. The full test can be saved only for the known good 32-bit candidates. 64-bit functions will only ever receive the estimate.

What did I find?

Once I got the bias issue sorted out, and after hours and hours of running, followed up with some manual tweaking on my part, the prospector stumbled across this little gem:

// DO use this one!
uint32_t
prospector32(uint32_t x)
{
    x ^= x >> 15;
    x *= 0x2c1b3c6dU;
    x ^= x >> 12;
    x *= 0x297a2d39U;
    x ^= x >> 15;
    return x;
}

According to a full (e.g. not estimated) bias evaluation, this function beats the snot out of most of 32-bit hash functions I could find. It even comes out ahead of this well known hash function that I believe originates from the H2 SQL Database. (Update: Thomas Mueller has confirmed that, indeed, this is his hash function.)

uint32_t
hash32(uint32_t x)
{
    x = ((x >> 16) ^ x) * 0x45d9f3bU;
    x = ((x >> 16) ^ x) * 0x45d9f3bU;
    x = (x >> 16) ^ x;
    return x;
}

It’s still an excellent hash function, just slightly more biased than mine.

Very briefly, prospector32() was the best 32-bit hash function I could find, and I thought I had a major breakthrough. Then I noticed the finalizer function for the 32-bit variant of MurmurHash3. It’s also a 32-bit hash function:

uint32_t
murmurhash32_mix32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x85ebca6bU;
    x ^= x >> 13;
    x *= 0xc2b2ae35U;
    x ^= x >> 16;
    return x;
}

This one is just barely less biased than mine. So I still haven’t discovered the best 32-bit hash function, only the second best one. :-)

A pattern emerges

If you’re paying close enough attention, you may have noticed that all three functions above have the same structure. The prospector had stumbled upon it all on its own without knowledge of the existing functions. It may not be so obvious for the second function, but here it is refactored:

uint32_t
hash32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x45d9f3bU;
    x ^= x >> 16;
    x *= 0x45d9f3bU;
    x ^= x >> 16;
    return x;
}

I hadn’t noticed this until after the prospector had come across it on its own. The pattern for all three is XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift. There’s something particularly useful about this multiply-xorshift construction (also). The XOR-right-shift diffuses bits rightward and the multiply diffuses bits leftward. I like to think it’s “sloshing” the bits right, left, right, left.

It seems that multiplication is particularly good at diffusion, so it makes perfect sense to exploit it in non-cryptographic hash functions, especially since modern CPUs are so fast at it. Despite this, it’s not used much in cryptography due to issues with completing it in constant time.

I like to think of this construction in terms of a five-tuple. For the three functions it’s the following:

(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()

The prospector actually found lots of decent functions following this pattern, especially where the middle shift is smaller than the outer shift. Thinking of it in terms of this tuple, I specifically directed it to try different tuple constants. That’s what I meant by “tweaking.” Eventually my new function popped out with its really low bias.

The prospector has a template option (-p) if you want to try it yourself:

$ ./prospector -p xorr,mul,xorr,mul,xorr

If you really have your heart set on certain constants, such as my specific selection of shifts, you can lock those in while randomizing the other constants:

$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15

Or the other way around:

$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr

My function seems a little strange using shifts of 15 bits rather than a nice, round 16 bits. However, changing those constants to 16 increases the bias. Similarly, neither of the two 32-bit constants is a prime number, but nudging those constants to the nearest prime increases the bias. These parameters really do seem to be a local minima in the bias, and using prime numbers isn’t important.

What about 64-bit integer hash functions?

So far I haven’t been able to improve on 64-bit hash functions. The main function to beat is SplittableRandom / SplitMix64:

uint64_t
splittable64(uint64_t x)
{
    x ^= x >> 30;
    x *= 0xbf58476d1ce4e5b9U;
    x ^= x >> 27;
    x *= 0x94d049bb133111ebU;
    x ^= x >> 31;
    return x;
}

Here’s its inverse since it’s sometimes useful:

uint64_t
splittable64_r(uint64_t x)
{
    x ^= x >> 31 ^ x >> 62;
    x *= 0x319642b2d24d8ec3U;
    x ^= x >> 27 ^ x >> 54;
    x *= 0x96de1b173f119089U;
    x ^= x >> 30 ^ x >> 60;
    return x;
}

I also came across this function:

uint64_t
hash64(uint64_t x)
{
    x ^= x >> 32;
    x *= 0xd6e8feb86659fd93U;
    x ^= x >> 32;
    x *= 0xd6e8feb86659fd93U;
    x ^= x >> 32;
    return x;
}

Again, these follow the same construction as before. There really is something special about it, and many other people have noticed, too.

Both functions have about the same bias. (Remember, I can only estimate the bias for 64-bit hash functions.) The prospector has found lots of functions with about the same bias, but nothing provably better. Until it does, I have no new 64-bit integer hash functions to offer.

Beyond random search

Right now the prospector does a completely random, unstructured search hoping to stumble upon something good by chance. Perhaps it would be worth using a genetic algorithm to breed those 5-tuples towards optimum? Others have had success in this area with simulated annealing.

There’s probably more to exploit from the multiply-xorshift construction that keeps popping up. If anything, the prospector is searching too broadly, looking at constructions that could never really compete no matter what the constants. In addition to everything above, I’ve been looking for good 32-bit hash functions that don’t use any 32-bit constants, but I’m really not finding any with a competitively low bias.

Update after one week

About one week after publishing this article I found an even better hash function. I believe this is the least biased 32-bit integer hash function of this form ever devised. It’s even less biased than the MurmurHash3 finalizer.

// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x7feb352dU;
    x ^= x >> 15;
    x *= 0x846ca68bU;
    x ^= x >> 16;
    return x;
}

// inverse
uint32_t
lowbias32_r(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x43021123U;
    x ^= x >> 15 ^ x >> 30;
    x *= 0x1d69e2a5U;
    x ^= x >> 16;
    return x;
}

If you’re willing to use an additional round of multiply-xorshift, this next function actually reaches the theoretical bias limit (bias = ~0.021) as exhibited by a perfect integer hash function:

// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
    x ^= x >> 17;
    x *= 0xed5ad4bbU;
    x ^= x >> 11;
    x *= 0xac4c1b51U;
    x ^= x >> 15;
    x *= 0x31848babU;
    x ^= x >> 14;
    return x;
}

~~It’s statistically indistinguishable from a random permutation of all 32-bit integers.~~(Update 2025: Peter Schmidt-Nielsen has provided a second-order characteristic test that quickly identifies statistically significant biases in triple32.)

Update, February 2020

Some people have been experimenting with using my hash functions in GLSL shaders, and the results are looking good:

The Value of Undefined Behavior

2018-07-20T21:31:18Z

In several places, the C and C++ language specifications use a curious, and fairly controversial, phrase: undefined behavior. For certain program constructs, the specification prescribes no specific behavior, instead allowing anything to happen. Such constructs are considered erroneous, and so the result depends on the particulars of the platform and implementation. The original purpose of undefined behavior was for implementation flexibility. In other words, it’s slack that allows a compiler to produce appropriate and efficient code for its target platform.

Specifying a particular behavior would have put unnecessary burden on implementations — especially in the earlier days of computing — making for inefficient programs on some platforms. For example, if the result of dereferencing a null pointer was defined to trap — to cause the program to halt with an error — then platforms that do not have hardware trapping, such as those without virtual memory, would be required to instrument, in software, each pointer dereference.

In the 21st century, undefined behavior has taken on a somewhat different meaning. Optimizers use it — or abuse it depending on your point of view — to lift constraints that would otherwise inhibit more aggressive optimizations. It’s not so much a fundamentally different application of undefined behavior, but it does take the concept to an extreme.

The reasoning works like this: A program that evaluates a construct whose behavior is undefined cannot, by definition, have any meaningful behavior, and so that program would be useless. As a result, compilers assume programs never invoke undefined behavior and use those assumptions to prove its optimizations.

Under this newer interpretation, mistakes involving undefined behavior are more punishing and surprising than before. Programs that seem to make some sense when run on a particular architecture may actually compile into a binary with a security vulnerability due to conclusions reached from an analysis of its undefined behavior.

This can be frustrating if your programs are intended to run on a very specific platform. In this situation, all behavior really could be locked down and specified in a reasonable, predictable way. Such a language would be like an extended, less portable version of C or C++. But your toolchain still insists on running your program on the abstract machine rather than the hardware you actually care about. However, even in this situation undefined behavior can still be desirable. I will provide a couple of examples in this article.

Signed integer overflow

To start things off, let’s look at one of my all time favorite examples of useful undefined behavior, a situation involving signed integer overflow. The result of a signed integer overflow isn’t just unspecified, it’s undefined behavior. Full stop.

This goes beyond a simple matter of whether or not the underlying machine uses a two’s complement representation. From the perspective of the abstract machine, just the act a signed integer overflowing is enough to throw everything out the window, even if the overflowed result is never actually used in the program.

On the other hand, unsigned integer overflow is defined — or, more accurately, defined to wrap, not overflow. Both the undefined signed overflow and defined unsigned overflow are useful in different situations.

For example, here’s a fairly common situation, much like what actually happened in bzip2. Consider this function that does substring comparison:

int
cmp_signed(int i1, int i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

In this function, the indices i1 and i2 will always be some small, non-negative value. Since it’s non-negative, it should be unsigned, right? Not necessarily. That puts an extra constraint on code generation and, at least on x86-64, makes for a less efficient function. Most of the time you actually don’t want overflow to be defined, and instead allow the compiler to assume it just doesn’t happen.

The constraint is that the behavior of i1 or i2 overflowing as an unsigned integer is defined, and the compiler is obligated to implement that behavior. On x86-64, where int is 32 bits, the result of the operation must be truncated to 32 bits one way or another, requiring extra instructions inside the loop.

In the signed case, incrementing the integers cannot overflow since that would be undefined behavior. This permits the compiler to perform the increment only in 64-bit precision without truncation if it would be more efficient, which, in this case, it is.

Here’s the output of Clang 6.0.0 with -Os on x86-64. Pay close attention to the main loop, which I named .loop:

cmp_signed:
        movsxd rdi, edi             ; use i1 as a 64-bit integer
        mov    al, [rdx + rdi]
        movsxd rsi, esi             ; use i2 as a 64-bit integer
        mov    cl, [rdx + rsi]
        jmp    .check

.loop:  mov    al, [rdx + rdi + 1]
        mov    cl, [rdx + rsi + 1]
        inc    rdx                  ; increment only the base pointer
.check: cmp    al, cl
        je     .loop

        movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx             ; return c1 - c2
        ret

cmp_unsigned:
        mov    eax, edi
        mov    al, [rdx + rax]
        mov    ecx, esi
        mov    cl, [rdx + rcx]
        cmp    al, cl
        jne    .ret
        inc    edi
        inc    esi

.loop:  mov    eax, edi             ; truncated i1 overflow
        mov    al, [rdx + rax]
        mov    ecx, esi             ; truncated i2 overflow
        mov    cl, [rdx + rcx]
        inc    edi                  ; increment i1
        inc    esi                  ; increment i2
        cmp    al, cl
        je     .loop

.ret:   movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx
        ret

As unsigned values, i1 and i2 can overflow independently, so they have to be handled as independent 32-bit unsigned integers. As signed values they can’t overflow, so they’re treated as if they were 64-bit integers and, instead, the pointer, buf, is incremented without concern for overflow. The signed loop is much more efficient (5 instructions versus 8).

The signed integer helps to communicate the narrow contract of the function — the limited range of i1 and i2 — to the compiler. In a variant of C where signed integer overflow is defined (i.e. -fwrapv), this capability is lost. In fact, using -fwrapv deoptimizes the signed version of this function.

Side note: Using size_t (an unsigned integer) is even better on x86-64 for this example since it’s already 64 bits and the function doesn’t need the initial sign/zero extension. However, this might simply move the sign extension out to the caller.

Strict aliasing

Another controversial undefined behavior is strict aliasing. This particular term doesn’t actually appear anywhere in the C specification, but it’s the popular name for C’s aliasing rules. In short, variables with types that aren’t compatible are not allowed to alias through pointers.

Here’s the classic example:

int
foo(int *a, int *b)
{
    *b = 0;    // store
    *a = 1;    // store
    return *b; // load
}

Naively one might assume the return *b could be optimized to a simple return 0. However, since a and b have the same type, the compiler must consider the possibility that they alias — that they point to the same place in memory — and must generate code that works correctly under these conditions.

If foo has a narrow contract that forbids a and b to alias, we have a couple of options for helping our compiler.

First, we could manually resolve the aliasing issue by returning 0 explicitly. In more complicated functions this might mean making local copies of values, working only with those local copies, then storing the results back before returning. Then aliasing would no longer matter.

int
foo(int *a, int *b)
{
    *b = 0;
    *a = 1;
    return 0;
}

Second, C99 introduced a restrict qualifier to communicate to the compiler that pointers passed to functions cannot alias. For example, the pointers to memcpy() are qualified with restrict as of C99. Passing aliasing pointers through restrict parameters is undefined behavior, e.g. this doesn’t ever happen as far as a compiler is concerned.

int foo(int *restrict a, int *restrict b);

The third option is to design an interface that uses incompatible types, exploiting strict aliasing. This happens all the time, usually by accident. For example, int and long are never compatible even when they have the same representation.

int foo(int *a, long *b);

If you use an extended or modified version of C without strict aliasing (-fno-strict-aliasing), then the compiler must assume everything aliases all the time, generating a lot more precautionary loads than necessary.

What irritates a lot of people is that compilers will still apply the strict aliasing rule even when it’s trivial for the compiler to prove that aliasing is occurring:

/* note: forbidden */
long a;
int *b = (int *)&a;

It’s not just a simple matter of making exceptions for these cases. The language specification would need to define all the rules about when and where incompatible types are permitted to alias, and developers would have to understand all these rules if they wanted to take advantage of the exceptions. It can’t just come down to trusting that the compiler is smart enough to see the aliasing when it’s sufficiently simple. It would need to be carefully defined.

Besides, there are probably conforming, portable solutions that, with contemporary compilers, will safely compile to the efficient code you actually want anyway.

There is one special exception for strict aliasing: char * is allowed to alias with anything. This is important to keep in mind both when you intentionally want aliasing, but also when you want to avoid it. Writing through a char * pointer could force the compiler to generate additional, unnecessary loads.

In fact, there’s a whole dimension to strict aliasing that, even today, no compiler yet exploits: uint8_t is not necessarily unsigned char. That’s just one possible typedef definition for it. It could instead typedef to, say, some internal __byte type.

In other words, technically speaking, uint8_t does not have the strict aliasing exemption. If you wanted to write bytes to a buffer without worrying the compiler about aliasing issues with other pointers, this would be the tool to accomplish it. Unfortunately there’s far too much existing code that violates this part of strict aliasing that no toolchain is willing to exploit it for optimization purposes.

Other undefined behaviors

Some kinds of undefined behavior don’t have performance or portability benefits. They’re only there to make the compiler’s job a little simpler. Today, most of these are caught trivially at compile time as syntax or semantic issues (i.e. a pointer cast to a float).

Some others are obvious about their performance benefits and don’t require much explanation. For example, it’s undefined behavior to index out of bounds (with some special exceptions for one past the end), meaning compilers are not obligated to generate those checks, instead relying on the programmer to arrange, by whatever means, that it doesn’t happen.

Undefined behavior is like nitro, a dangerous, volatile substance that makes things go really, really fast. You could argue that it’s too dangerous to use in practice, but the aggressive use of undefined behavior is not without merit.

When FFI Function Calls Beat Native C

2018-05-27T20:03:15Z

Update: There’s a good discussion on Hacker News.

Over on GitHub, David Yu has an interesting performance benchmark for function calls of various Foreign Function Interfaces (FFI):

https://github.com/dyu/ffi-overhead

He created a shared object (.so) file containing a single, simple C function. Then for each FFI he wrote a bit of code to call this function many times, measuring how long it took.

For the C “FFI” he used standard dynamic linking, not dlopen(). This distinction is important, since it really makes a difference in the benchmark. There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure.

The most surprising result of the benchmark is that LuaJIT’s FFI is substantially faster than C. It’s about 25% faster than a native C function call to a shared object function. How could a weakly and dynamically typed scripting language come out ahead on a benchmark? Is this accurate?

It’s actually quite reasonable. The benchmark was run on Linux, so the performance penalty we’re seeing comes the Procedure Linkage Table (PLT). I’ve put together a really simple experiment to demonstrate the same effect in plain old C:

https://github.com/skeeto/dynamic-function-benchmark

Here are the results on an Intel i7-6700 (Skylake):

plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call

These are three different types of function calls:

Through the PLT
An indirect function call (via dlsym(3))
A direct function call (via a JIT-compiled function)

As shown, the last one is the fastest. It’s typically not an option for C programs, but it’s natural in the presence of a JIT compiler, including, apparently, LuaJIT.

In my benchmark, the function being called is named empty():

void empty(void) { }

And to compile it into a shared object:

$ cc -shared -fPIC -Os -o empty.so empty.c

Just as in my PRNG shootout, the benchmark calls this function repeatedly as many times as possible before an alarm goes off.

Procedure Linkage Tables

When a program or library calls a function in another shared object, the compiler cannot know where that function will be located in memory. That information isn’t known until run time, after the program and its dependencies are loaded into memory. These are usually at randomized locations — e.g. Address Space Layout Randomization (ASLR).

How is this resolved? Well, there are a couple of options.

One option is to make a note about each such call in the binary’s metadata. The run-time dynamic linker can then patch in the correct address at each call site. How exactly this would work depends on the particular code model used when compiling the binary.

The downside to this approach is slower loading, larger binaries, and less sharing of code pages between different processes. It’s slower loading because every dynamic call site needs to be patched before the program can begin execution. The binary is larger because each of these call sites needs an entry in the relocation table. And the lack of sharing is due to the code pages being modified.

On the other hand, the overhead for dynamic function calls would be eliminated, giving JIT-like performance as seen in the benchmark.

The second option is to route all dynamic calls through a table. The original call site calls into a stub in this table, which jumps to the actual dynamic function. With this approach the code does not need to be patched, meaning it’s trivially shared between processes. Only one place needs to be patched per dynamic function: the entries in the table. Even more, these patches can be performed lazily, on the first function call, making the load time even faster.

On systems using ELF binaries, this table is called the Procedure Linkage Table (PLT). The PLT itself doesn’t actually get patched — it’s mapped read-only along with the rest of the code. Instead the Global Offset Table (GOT) gets patched. The PLT stub fetches the dynamic function address from the GOT and indirectly jumps to that address. To lazily load function addresses, these GOT entries are initialized with an address of a function that locates the target symbol, updates the GOT with that address, and then jumps to that function. Subsequent calls use the lazily discovered address.

The downside of a PLT is extra overhead per dynamic function call, which is what shows up in the benchmark. Since the benchmark only measures function calls, this appears to be pretty significant, but in practice it’s usually drowned out in noise.

Here’s the benchmark:

/* Cleared by an alarm signal. */
volatile sig_atomic_t running;

static long
plt_benchmark(void)
{
    long count;
    for (count = 0; running; count++)
        empty();
    return count;
}

Since empty() is in the shared object, that call goes through the PLT.

Indirect dynamic calls

Another way to dynamically call functions is to bypass the PLT and fetch the target function address within the program, e.g. via dlsym(3).

void *h = dlopen("path/to/lib.so", RTLD_NOW);
void (*f)(void) = dlsym("f");
f();

Once the function address is obtained, the overhead is smaller than function calls routed through the PLT. There’s no intermediate stub function and no GOT access. (Caveat: If the program has a PLT entry for the given function then dlsym(3) may actually return the address of the PLT stub.)

However, this is still an indirect function call. On conventional architectures, direct function calls have an immediate relative address. That is, the target of the call is some hard-coded offset from the call site. The CPU can see well ahead of time where the call is going.

An indirect function call has more overhead. First, the address has to be stored somewhere. Even if that somewhere is just a register, it increases register pressure by using up a register. Second, it provokes the CPU’s branch predictor since the call target isn’t static, making for extra bookkeeping in the CPU. In the worst case the function call may even cause a pipeline stall.

Here’s the benchmark:

volatile sig_atomic_t running;

static long
indirect_benchmark(void (*f)(void))
{
    long count;
    for (count = 0; running; count++)
        f();
    return count;
}

The function passed to this benchmark is fetched with dlsym(3) so the compiler can’t do something tricky like convert that indirect call back into a direct call.

If the body of the loop was complicated enough that there was register pressure, thereby requiring the address to be spilled onto the stack, this benchmark might not fare as well against the PLT benchmark.

Direct function calls

The first two types of dynamic function calls are simple and easy to use. Direct calls to dynamic functions is trickier business since it requires modifying code at run time. In my benchmark I put together a little JIT compiler to generate the direct call.

There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB range due to a signed 32-bit immediate. This means the JIT code has to be placed virtually nearby the target function, empty(). If the JIT code needed to call two different dynamic functions separated by more than 2GB, then it’s not possible for both to be direct.

To keep things simple, my benchmark isn’t precise or very careful about picking the JIT code address. After being given the target function address, it blindly subtracts 4MB, rounds down to the nearest page, allocates some memory, and writes code into it. To do this correctly would mean inspecting the program’s own memory mappings to find space, and there’s no clean, portable way to do this. On Linux this requires parsing virtual files under /proc.

Here’s what my JIT’s memory allocation looks like. It assumes reasonable behavior for uintptr_t casts:

static void
jit_compile(struct jit_func *f, void (*empty)(void))
{
    uintptr_t addr = (uintptr_t)empty;
    void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK);
    /* ... */
    unsigned char *p = mmap(desired, len, prot, flags, fd, 0);
    /* ... */
}

It allocates two pages, one writable and the other containing non-writable code. Similar to my closure library, the lower page is writable and holds the running variable that gets cleared by the alarm. It needed to be nearby the JIT code in order to be an efficient RIP-relative access, just like the other two benchmark functions. The upper page contains this assembly:

jit_benchmark:
        push  rbx
        xor   ebx, ebx
.loop:  mov   eax, [rel running]
        test  eax, eax
        je    .done
        call  empty
        inc   ebx
        jmp   .loop
.done:  mov   eax, ebx
        pop   rbx
        ret

The call empty is the only instruction that is dynamically generated — necessary to fill out the relative address appropriately (the minus 5 is because it’s relative to the end of the instruction):

    // call empty
    uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5;
    *p++ = 0xe8;
    *p++ = rel >>  0;
    *p++ = rel >>  8;
    *p++ = rel >> 16;
    *p++ = rel >> 24;

If empty() wasn’t in a shared object and instead located in the same binary, this is essentially the direct call that the compiler would have generated for plt_benchmark(), assuming somehow it didn’t inline empty().

Ironically, calling the JIT-compiled code requires an indirect call (e.g. via a function pointer), and there’s no way around this. What are you going to do, JIT compile another function that makes the direct call? Fortunately this doesn’t matter since the part being measured in the loop is only a direct call.

It’s no mystery

Given these results, it’s really no mystery that LuaJIT can generate more efficient dynamic function calls than a PLT, even if they still end up being indirect calls. In my benchmark, the non-PLT indirect calls were 28% faster than the PLT, and the direct calls 43% faster than the PLT. That’s a small edge that JIT-enabled programs have over plain old native programs, though it comes at the cost of absolutely no code sharing between processes.

When the Compiler Bites

2018-05-01T23:28:06Z

Update: There are discussions on Reddit and on Hacker News.

So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.

In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.

x86-64 ABI ambiguity

The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:

double
zero_multiply(double x)
{
    return x * 0.0;
}

The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)

As a result, both GCC and Clang perform the multiply:

zero_multiply:
    xorpd  xmm1, xmm1
    mulsd  xmm0, xmm1
    ret

The -ffast-math option relaxes the C standard floating point rules, permitting an optimization at the cost of conformance and consistency:

zero_multiply:
    xorps  xmm0, xmm0
    ret

Side note: -ffast-math doesn’t necessarily mean “less precise.” Sometimes it will actually improve precision.

Here’s a modified version of the function that’s a little more interesting. I’ve changed the argument to a short:

double
zero_multiply_short(short x)
{
    return x * 0.0;
}

It’s no longer possible for the argument to be one of those special values. The short will be promoted to one of 65,535 possible double values, each of which results in 0.0 when multiplied by 0.0. GCC misses this optimization (-Os):

zero_multiply_short:
    movsx     edi, di       ; sign-extend 16-bit argument
    xorps     xmm1, xmm1    ; xmm1 = 0.0
    cvtsi2sd  xmm0, edi     ; convert int to double
    mulsd     xmm0, xmm1
    ret

Clang also misses this optimization:

zero_multiply_short:
    cvtsi2sd xmm1, edi
    xorpd    xmm0, xmm0
    mulsd    xmm0, xmm1
    ret

But hang on a minute. This is shorter by one instruction. What happened to the sign-extension (movsx)? Clang is treating that short argument as if it were a 32-bit value. Why do GCC and Clang differ? Is GCC doing something unnecessary?

It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.

Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.

I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.

Floating point precision

Without looking it up or trying it, what does this function return? Think carefully.

int
float_compare(void)
{
    float x = 1.3f;
    return x == 1.3f;
}

Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.

$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1

So what’s going on here? The original ANSI C specification wasn’t clear about how intermediate floating point values get rounded, and implementations all did it differently. The C99 specification cleaned this all up and introduced FLT_EVAL_METHOD. Implementations can still differ, but at least you can now determine at compile-time what the compiler would do by inspecting that macro.

Back in the late 1980’s or early 1990’s when the GCC developers were deciding how GCC should implement floating point arithmetic, the trend at the time was to use as much precision as possible. On the x86 this meant using its support for 80-bit extended precision floating point arithmetic. Floating point operations are performed in long double precision and truncated afterward (FLT_EVAL_METHOD == 2).

In float_compare() the left-hand side is truncated to a float by the assignment, but the right-hand side, despite being a float literal, is actually “1.3” at 80 bits of precision as far as GCC is concerned. That’s pretty unintuitive!

The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.

The trend reversed once SIMD hardware became widely available and there were huge performance gains to be had. Multiple values could be computed at once, side by side, at lower precision. So on x86-64, this became the default (FLT_EVAL_METHOD == 0). The young Clang compiler wasn’t around until well after this trend reversed, so it behaves differently than the backwards compatible GCC on the old x86.

I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.

Built-in Function Elimination

I’ve saved this one for last since it’s my favorite. Suppose we have this little function, new_image(), that allocates a greyscale image for, say, some multimedia library.

static unsigned char *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, shade, w * h);
        }
    }
    return p;
}

It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.

I write a unit test to make sure it detects overflow. This function should return 0.

/* expected return == 0 */
int
test_new_image_overflow(void)
{
    void *p = new_image(2, SIZE_MAX, 0);
    return !!p;
}

So far my test passes. Good.

I’d also like to make sure it correctly returns NULL — or, more specifically, that it doesn’t crash — if the allocation fails. But how can I make malloc() fail? As a hack I can pass image dimensions that I know cannot ever practically be allocated. Essentially I want to force a malloc(SIZE_MAX), e.g. allocate every available byte in my virtual address space. For a conventional 64-bit machine, that’s 16 exibytes of memory, and it leaves space for nothing else, including the program itself.

/* expected return == 0 */
int
test_new_image_oom(void)
{
    void *p = new_image(1, SIZE_MAX, 0xff);
    return !!p;
}

I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?

Disassembling the test reveals what’s going on:

test_new_image_overflow:
    xor  eax, eax
    ret

test_new_image_oom:
    mov  eax, 1
    ret

The first test is actually being evaluated at compile time by the compiler. The function being tested was inlined into the unit test itself. This permits the compiler to collapse the whole thing down to a single instruction. The path with malloc() became dead code and was trivially eliminated.

In the second test, Clang correctly determined that the image buffer is not actually being used, despite the memset(), so it eliminated the allocation altogether and then simulated a successful allocation despite it being absurdly large. Allocating memory is not an observable side effect as far as the language specification is concerned, so it’s allowed to do this. My thinking was wrong, and the compiler outsmarted me.

I soon realized I can take this further and trick Clang into performing an invalid optimization, revealing a bug. Consider this slightly-optimized version that uses calloc() when the shade is zero (black). The calloc() function does its own overflow check, so new_image() doesn’t need to do it.

static void *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (shade == 0) { // shortcut
        p = calloc(w, h);
    } else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, color, w * h);
        }
    }
    return p;
}

With this change, my overflow unit test is now also failing. The situation is even worse than before. The calloc() is being eliminated despite the overflow, and replaced with a simulated success. This time it’s actually a bug in Clang. While failing a unit test is mostly harmless, this could introduce a vulnerability in a real program. The OpenBSD folks are so worried about this sort of thing that they’ve disabled this optimization.

Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).

struct table {
    unsigned *counter;
    unsigned *values;
};

static int
table_init(struct table *t, size_t n)
{
    t->counter = calloc(n, sizeof(*t->counter));
    if (t->counter) {
        /* Overflow already tested above */
        t->values = malloc(n * sizeof(*t->values));
        if (!t->values) {
            free(t->counter);
            return 0; // fail
        }
        return 1; // success
    }
    return 0; // fail
}

This function relies on the overflow test in calloc() for the second malloc() allocation. However, this is a static function that’s likely to get inlined, as we saw before. If the program doesn’t actually make use of the counter table, and Clang is able to statically determine this fact, it may eliminate the calloc(). This would also eliminate the overflow test, introducing a vulnerability. If an attacker can control n, then they can overwrite arbitrary memory through that values pointer.

The takeaway

Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.

The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.

My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?

Just before the end it would pause the game and wait… forever.

A Branchless UTF-8 Decoder

2017-10-06T23:29:02Z

This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. You can find the source code here along with a test suite and benchmark:

https://github.com/skeeto/branchless-utf8

In addition to decoding the next code point, it detects any errors and returns a pointer to the next code point. It’s the complete package.

Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.

The usual analogy is laundry. You can have more than one load of laundry in process at a time because laundry is typically a pipelined process. There’s a washing machine stage, dryer stage, and folding stage. One load can be in the washer, a second in the drier, and a third being folded, all at once. This greatly increases throughput because, under ideal circumstances with a full pipeline, an instruction is completed each clock cycle despite any individual instruction taking many clock cycles to complete.

Branches are the enemy of pipelines. The CPU can’t begin work on the next instruction if it doesn’t know which instruction will be executed next. It must finish computing the branch condition before it can know. To deal with this, pipelined CPUs are also equipped with branch predictors. It makes a guess at which branch will be taken and begins executing instructions on that branch. The prediction is initially made using static heuristics, and later those predictions are improved by learning from previous behavior. This even includes predicting the number of iterations of a loop so that the final iteration isn’t mispredicted.

A mispredicted branch has two dire consequences. First, all the progress on the incorrect branch will need to be discarded. Second, the pipeline will be flushed, and the CPU will be inefficient until the pipeline fills back up with instructions on the correct branch. With a sufficiently deep pipeline, it can easily be more efficient to compute and discard an unneeded result than to avoid computing it in the first place. Eliminating branches means eliminating the hazards of misprediction.

Another hazard for pipelines is dependencies. If an instruction depends on the result of a previous instruction, it may have to wait for the previous instruction to make sufficient progress before it can complete one of its stages. This is known as a pipeline stall, and it is an important consideration in instruction set architecture (ISA) design.

For example, on the x86-64 architecture, storing a 32-bit result in a 64-bit register will automatically clear the upper 32 bits of that register. Any further use of that destination register cannot depend on prior instructions since all bits have been set. This particular optimization was missed in the design of the i386: Writing a 16-bit result to 32-bit register leaves the upper 16 bits intact, creating false dependencies.

Dependency hazards are mitigated using out-of-order execution. Rather than execute two dependent instructions back to back, which would result in a stall, the CPU may instead executing an independent instruction further away in between. A good compiler will also try to spread out dependent instructions in its own instruction scheduling.

The effects of out-of-order execution are typically not visible to a single thread, where everything will appear to have executed in order. However, when multiple processes or threads can access the same memory out-of-order execution can be observed. It’s one of the many challenges of writing multi-threaded software.

The focus of my UTF-8 decoder was to be branchless, but there was one interesting dependency hazard that neither GCC nor Clang were able to resolve themselves. More on that later.

What is UTF-8?

Without getting into the history of it, you can generally think of UTF-8 as a method for encoding a series of 21-bit integers (code points) into a stream of bytes.

Shorter integers encode to fewer bytes than larger integers. The shortest available encoding must be chosen, meaning there is one canonical encoding for a given sequence of code points.
Certain code points are off limits: surrogate halves. These are code points U+D800 through U+DFFF. Surrogates are used in UTF-16 to represent code points above U+FFFF and serve no purpose in UTF-8. This has interesting consequences for pseudo-Unicode strings, such “wide” strings in the Win32 API, where surrogates may appear unpaired. Such sequences cannot legally be represented in UTF-8.

Keeping in mind these two rules, the entire format is summarized by this table:

length byte[0]  byte[1]  byte[2]  byte[3]
    0xxxxxxx
    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The x placeholders are the bits of the encoded code point.

UTF-8 has some really useful properties:

It’s backwards compatible with ASCII, which never used the highest bit.
Sort order is preserved. Sorting a set of code point sequences has the same result as sorting their UTF-8 encoding.
No additional zero bytes are introduced. In C we can continue using null terminated char buffers, often without even realizing they hold UTF-8 data.
It’s self-synchronizing. A leading byte will never be mistaken for a continuation byte. This allows for byte-wise substring searches, meaning UTF-8 unaware functions like strstr(3) continue to work without modification (except for normalization issues). It also allows for unambiguous recovery of a damaged stream.

A straightforward approach to decoding might look something like this:

unsigned char *
utf8_simple(unsigned char *s, long *c)
{
    unsigned char *next;
    if (s[0] < 0x80) {
        *c = s[0];
        next = s + 1;
    } else if ((s[0] & 0xe0) == 0xc0) {
        *c = ((long)(s[0] & 0x1f) <<  6) |
             ((long)(s[1] & 0x3f) <<  0);
        next = s + 2;
    } else if ((s[0] & 0xf0) == 0xe0) {
        *c = ((long)(s[0] & 0x0f) << 12) |
             ((long)(s[1] & 0x3f) <<  6) |
             ((long)(s[2] & 0x3f) <<  0);
        next = s + 3;
    } else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4)) {
        *c = ((long)(s[0] & 0x07) << 18) |
             ((long)(s[1] & 0x3f) << 12) |
             ((long)(s[2] & 0x3f) <<  6) |
             ((long)(s[3] & 0x3f) <<  0);
        next = s + 4;
    } else {
        *c = -1; // invalid
        next = s + 1; // skip this byte
    }
    if (*c >= 0xd800 && *c <= 0xdfff)
        *c = -1; // surrogate half
    return next;
}

It branches off on the highest bits of the leading byte, extracts all of those x bits from each byte, concatenates those bits, checks if it’s a surrogate half, and returns a pointer to the next character. (This implementation does not check that the highest two bits of each continuation byte are correct.)

The CPU must correctly predict the length of the code point or else it will suffer a hazard. An incorrect guess will stall the pipeline and slow down decoding.

In real world text this is probably not a serious issue. For the English language, the encoded length is nearly always a single byte. However, even for non-English languages, text is usually accompanied by markup from the ASCII range of characters, and, overall, the encoded lengths will still have consistency. As I said, the CPU predicts branches based on the program’s previous behavior, so this means it will temporarily learn some of the statistical properties of the language being actively decoded. Pretty cool, eh?

Eliminating branches from the decoder side-steps any issues with mispredicting encoded lengths. Only errors in the stream will cause stalls. Since that’s probably the unusual case, the branch predictor will be very successful by continually predicting success. That’s one optimistic CPU.

The branchless decoder

Here’s the interface to my branchless decoder:

void *utf8_decode(void *buf, uint32_t *c, int *e);

I chose void * for the buffer so that it doesn’t care what type was actually chosen to represent the buffer. It could be a uint8_t, char, unsigned char, etc. Doesn’t matter. The encoder accesses it only as bytes.

On the other hand, with this interface you’re forced to use uint32_t to represent code points. You could always change the function to suit your own needs, though.

Errors are returned in e. It’s zero for success and non-zero when an error was detected, without any particular meaning for different values. Error conditions are mixed into this integer, so a zero simply means the absence of error.

This is where you could accuse me of “cheating” a little bit. The caller probably wants to check for errors, and so they will have to branch on e. It seems I’ve just smuggled the branches outside of the decoder.

However, as I pointed out, unless you’re expecting lots of errors, the real cost is branching on encoded lengths. Furthermore, the caller could instead accumulate the errors: count them, or make the error “sticky” by ORing all e values together. Neither of these require a branch. The caller could decode a huge stream and only check for errors at the very end. The only branch would be the main loop (“are we done yet?”), which is trivial to predict with high accuracy.

The first thing the function does is extract the encoded length of the next code point:

    static const char lengths[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
    };

    unsigned char *s = buf;
    int len = lengths[s[0] >> 3];

Looking back to the UTF-8 table above, only the highest 5 bits determine the length. That’s 32 possible values. The zeros are for invalid prefixes. This will later cause a bit to be set in e.

With the length in hand, it can compute the position of the next code point in the buffer.

    unsigned char *next = s + len + !len;

Originally this expression was the return value, computed at the very end of the function. However, after inspecting the compiler’s assembly output, I decided to move it up, and the result was a solid performance boost. That’s because it spreads out dependent instructions. With the address of the next code point known so early, the instructions that decode the next code point can get started early.

The reason for the !len is so that the pointer is advanced one byte even in the face of an error (length of zero). Adding that !len is actually somewhat costly, though I couldn’t figure out why.

    static const int shiftc[] = {0, 18, 12, 6, 0};

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

This reads four bytes regardless of the actual length. Avoiding doing something is branching, so this can’t be helped. The unneeded bits are shifted out based on the length. That’s all it takes to decode UTF-8 without branching.

One important consequence of always reading four bytes is that the caller must zero-pad the buffer to at least four bytes. In practice, this means padding the entire buffer with three bytes in case the last character is a single byte.

The padding must be zero in order to detect errors. Otherwise the padding might look like legal continuation bytes.

    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shifte[] = {0, 6, 4, 2, 0};

    *e  = (*c < mins[len]) << 6;
    *e |= ((*c >> 11) == 0x1b) << 7;  // surrogate half?
    *e |= (s[1] & 0xc0) >> 2;
    *e |= (s[2] & 0xc0) >> 4;
    *e |= (s[3]       ) >> 6;
    *e ^= 0x2a;
    *e >>= shifte[len];

The first line checks if the shortest encoding was used, setting a bit in e if it wasn’t. For a length of 0, this always fails.

The second line checks for a surrogate half by checking for a certain prefix.

The next three lines accumulate the highest two bits of each continuation byte into e. Each should be the bits 10. These bits are “compared” to 101010 (0x2a) using XOR. The XOR clears these bits as long as they exactly match.

Finally the continuation prefix bits that don’t matter are shifted out.

The goal

My primary — and totally arbitrary — goal was to beat the performance of Björn Höhrmann’s DFA-based decoder. Under favorable (and artificial) benchmark conditions I had moderate success. You can try it out on your own system by cloning the repository and running make bench.

With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

The benchmark operates very similarly to my PRNG shootout (e.g. alarm(2)). First a buffer is filled with random UTF-8 data, then the decoder decodes it again and again until the alarm fires. The measurement is the number of bytes decoded.

The number of errors is printed at the end (always 0) in order to force errors to actually get checked for each code point. Otherwise the sneaky compiler omits the error checking from the branchless decoder, making it appear much faster than it really is — a serious letdown once I noticed my error. Since the other decoder is a DFA and error checking is built into its graph, the compiler can’t really omit its error checking.

I called this “favorable” because the buffer being decoded isn’t anything natural. Each time a code point is generated, first a length is chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that length is generated. The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction. Real text has a far more favorable distribution.

uint32_t
randchar(uint64_t *s)
{
    uint32_t r = rand32(s);
    int len = 1 + (r & 0x3);
    r >>= 2;
    switch (len) {
        case 1:
            return r % 128;
        case 2:
            return 128 + r % (2048 - 128);
        case 3:
            return 2048 + r % (65536 - 2048);
        case 4:
            return 65536 + r % (131072 - 65536);
    }
    abort();
}

Given the odd input zero-padding requirement and the artificial parameters of the benchmark, despite the supposed 20% speed boost under GCC, my branchless decoder is not really any better than the DFA decoder in practice. It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.

Update: Bryan Donlan has followed up with a SIMD UTF-8 decoder.

Update 2024: NRK has followed up with parallel extract decoder.

Update 2025: Charles Eckman followed up sharing a branchless encoder, which inspired me to give it a shot.

Finding the Best 64-bit Simulation PRNG

2017-09-21T21:25:00Z

August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256** is now my preferred PRNG.

I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.

Monte Carlo simulations, where PRNGs are used to compute numeric estimates for problems that are difficult or impossible to solve analytically.
Monte Carlo tree search AI, where massive numbers of games are played out randomly in search of an optimal move. This is a specific application of the last item.
Genetic algorithms, where a PRNG creates the initial population, and then later guides in mutation and breeding of selected solutions.
Cryptography, where a cryptographically-secure PRNGs (CSPRNGs) produce output that is predictable for recipients who know a particular secret, but not for anyone else. This article is only concerned with plain PRNGs.

For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:

The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.
PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.

In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.

Depending on the programming language, the PRNGs found in various standard libraries may be of dubious quality. They’re slower than they need to be, or have poorer quality than required. In some cases, such as rand() in C, the algorithm isn’t specified, and you can’t rely on it for anything outside of trivial examples. In other cases the algorithm and behavior is specified, but you could easily do better yourself.

My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”

For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:

64-bit PRNG Shootout

Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.

Shootout results

Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.

                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060

The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.

These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.

Speed benchmark

In the speed benchmark, the PRNG is initialized, a 1-second alarm(1) is set, then the PRNG fills a large volatile buffer of 64-bit unsigned integers again and again as quickly as possible until the alarm fires. The amount of memory written is measured as the PRNG’s speed.

The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.

The purpose for making the buffer volatile is to force the entire output to actually be “consumed” as far as the compiler is concerned. Otherwise the compiler plays nasty tricks to make the program do as little work as possible. Another way to deal with this would be to write(2) buffer, but of course I didn’t want to introduce unnecessary I/O into a benchmark.

On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.

The SIGALRM handler sets a volatile global variable that tells the generator to stop. The PRNG call was unrolled 8 times to avoid the alarm check from significantly impacting the benchmark. You can see the effect for yourself by changing UNROLL to 1 (i.e. “don’t unroll”) in the code. Unrolling beyond 8 times had no measurable effect to my tests.

Due to the PRNGs being inlined, this unrolling makes the benchmark less realistic, and it shows in the results. Using volatile for the buffer helped to counter this effect and reground the results. This is a fuzzy problem, and there’s not really any way to avoid it, but I will also discuss this below.

Statistical benchmark

To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:

prng | dieharder -g200 -a -m4

This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.

There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.

The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.

Individual PRNGs

Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.

Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.

Blowfish

One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.

Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.

The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.

In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.

In CBC mode, each output depends on the previous, but in CTR mode all blocks are independent. Work can begin on the next output before the previous output is complete. The x86 architecture uses out-of-order execution to achieve many of its performance gains: Instructions may be executed in a different order than they appear in the program, though their observable effects must generally be ordered correctly. Breaking dependencies between instructions allows out-of-order execution to be fully exercised. It also gives the compiler more freedom in instruction scheduling, though the volatile accesses cannot be reordered with respect to each other (hence it helping to reground the benchmark).

Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.

On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?

In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.

Mersenne Twister (MT19937-64)

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:

It’s between 1/4 and 1/5 the speed of xoroshiro128+.
It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.
Its implementation is three times bigger than xoroshiro128+, and much more complicated.
It had one statistical failure (dab_filltree2).

Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Permuted Congruential Generator (PCG)

The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.

PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).

Here’s essentially what the simplified version looks like:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.

The two constants are entirely my own devising. It’s two 64-bit primes generated using Emacs’ M-x calc: 2 64 ^ k r k n k p k p k p.

Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.

However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.

Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.

Here’s my doubled-up version:

uint64_t
spcg64(uint64_t s[2])
{
    uint64_t m  = 0x9b60933458e17d7d;
    uint64_t a0 = 0xd737232eeccdf7ed;
    uint64_t a1 = 0x8b260b70b8e98891;
    uint64_t p0 = s[0];
    uint64_t p1 = s[1];
    s[0] = p0 * m + a0;
    s[1] = p1 * m + a1;
    int r0 = 29 - (p0 >> 61);
    int r1 = 29 - (p1 >> 61);
    uint64_t high = p0 >> r0;
    uint32_t low  = p1 >> r1;
    return (high << 32) | low;
}

The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.

In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.

RC4

This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.

xoroshiro128+

xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:

uint64_t
xoroshiro128plus(uint64_t s[2])
{
    uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    uint64_t result = s0 + s1;
    s1 ^= s0;
    s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
    s[1] = (s1 << 36) | (s1 >> 28);
    return result;
}

There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.

xorshift128+ and xorshift64*

These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.

uint64_t
xorshift64star(uint64_t s[1])
{
    uint64_t x = s[0];
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    s[0] = x;
    return x * UINT64_C(0x2545f4914f6cdd1d);
}

However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.

If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.

uint64_t
xorshift128plus(uint64_t s[2])
{
    uint64_t x = s[0];
    uint64_t y = s[1];
    s[0] = y;
    x ^= x << 23;
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
    return s[1] + y;
}

It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.

Conclusion

My own takeaway (until I re-evaluate some years in the future):

The best 64-bit simulation PRNG is xoroshiro128+.
“Simplified” PCG can be useful in a pinch.
When only 32-bit integers are necessary, use PCG.

Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:

                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080

LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.

So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.

OpenMP and pwrite()

2017-03-01T21:22:24Z

The most common way I introduce multi-threading to small C programs is with OpenMP (Open Multi-Processing). It’s typically used as compiler pragmas to parallelize computationally expensive loops — iterations are processed by different threads in some arbitrary order.

Here’s an example that computes the frames of a video in parallel. Despite being computed out of order, each frame is written in order to a large buffer, then written to standard output all at once at the end.

size_t size = sizeof(struct frame) * num_frames;
struct frame *output = malloc(size);
float beta = DEFAULT_BETA;

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    float theta = compute_theta(i);
    compute_frame(&output[i], theta, beta);
}

write(STDOUT_FILENO, output, size);
free(output);

Adding OpenMP to this program is much simpler than introducing low-level threading semantics with, say, Pthreads. With care, there’s often no need for explicit thread synchronization. It’s also fairly well supported by many vendors, even Microsoft (up to OpenMP 2.0), so a multi-threaded OpenMP program is quite portable without #ifdef.

There’s real value this pragma API: The above example would still compile and run correctly even when OpenMP isn’t available. The pragma is ignored and the program just uses a single core like it normally would. It’s a slick fallback.

When a program really does require synchronization there’s omp_lock_t (mutex lock) and the expected set of functions to operate on them. This doesn’t have the nice fallback, so I don’t like to use it. Instead, I prefer #pragma omp critical. It nicely maintains the OpenMP-unsupported fallback.

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    struct frame *frame = malloc(sizeof(*frame));
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    #pragma omp critical
    {
        write(STDOUT_FILENO, frame, sizeof(*frame));
    }
    free(frame);
}

This would append the output to some output file in an arbitrary order. The critical section prevents interleaving of outputs.

There are a couple of problems with this example:

Only one thread can write at a time. If the write takes too long, other threads will queue up behind the critical section and wait.
The output frames will be out of order, which is probably inconvenient for consumers. If the output is seekable this can be solved with lseek(), but that only makes the critical section even more important.

There’s an easy fix for both, and eliminates the need for a critical section: POSIX pwrite().

ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

It’s like write() but has an offset parameter. Unlike lseek() followed by a write(), multiple threads and processes can, in parallel, safely write to the same file descriptor at different file offsets. The catch is that the output must be a file, not a pipe.

#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    size_t size = sizeof(struct frame);
    struct frame *frame = malloc(size);
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    pwrite(STDOUT_FILENO, frame, size, size * i);
    free(frame);
}

There’s no critical section, the writes can interleave, and the output is in order.

If you’re concerned about standard output not being seekable (it often isn’t), keep in mind that it will work just fine when invoked like so:

$ ./compute_frames > frames.ppm

Windows Portability

I talked about OpenMP being really portable, then used POSIX functions. Fortunately the Win32 WriteFile() function has an “overlapped” parameter that works just like pwrite(). Typically rather than call either directly, I’d wrap the write like so:

#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include 

static int
write_frame(struct frame *f, int i)
{
    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD written;
    OVERLAPPED offset = {.Offset = sizeof(*f) * i};
    return WriteFile(out, f, sizeof(*f), &written, &offset);
}

#else /* POSIX */
#include 

static int
write_frame(struct frame *f, int i)
{
    size_t count = sizeof(*f);
    size_t offset = sizeof(*f) * i;
    return pwrite(STDOUT_FILENO, buf, count, offset) == count;
}
#endif

Except for switching to write_frame(), the OpenMP part remains untouched.

Real World Example

Here’s an example in a real program:

julia.c

Notice because of pwrite() there’s no piping directly into ppmtoy4m:

$ ./julia > output.ppm
$ ppmtoy4m -F 60:1 < output.ppm > output.y4m
$ x264 -o output.mp4 output.y4m

output.mp4

How to Write Fast(er) Emacs Lisp

2017-01-30T21:08:19Z

Not everything written in Emacs Lisp needs to be fast. Most of Emacs itself — around 82% — is written in Emacs Lisp because those parts are generally not performance-critical. Otherwise these functions would be built-ins written in C. Extensions to Emacs don’t have a choice and — outside of a few exceptions like dynamic modules and inferior processes — must be written in Emacs Lisp, including their performance-critical bits. Common performance hot spots are automatic indentation, AST parsing, and interactive completion.

Here are 5 guidelines, each very specific to Emacs Lisp, that will result in faster code. The non-intrusive guidelines could be applied at all times as a matter of style — choosing one equally expressive and maintainable form over another just because it performs better.

There’s one caveat: These guidelines are focused on Emacs 25.1 and “nearby” versions. Emacs is constantly evolving. Changes to the virtual machine and byte-code compiler may transform currently-slow expressions into fast code, obsoleting some of these guidelines. In the future I’ll add notes to this article for anything that changes.

(1) Use lexical scope

This guideline refers to the following being the first line of every Emacs Lisp source file you write:

;;; -*- lexical-binding: t; -*-

This point is worth mentioning again and again. Not only will your code be more correct, it will be measurably faster. Dynamic scope is still opt-in through the explicit use of special variables, so there’s absolutely no reason not to be using lexical scope. If you’ve written clean, dynamic scope code, then switching to lexical scope won’t have any effect on its behavior.

Along similar lines, special variables are a lot slower than local, lexical variables. Only use them when necessary.

(2) Prefer built-in functions

Built-in functions are written in C and are, as expected, significantly faster than the equivalent written in Emacs Lisp. Complete as much work as possible inside built-in functions, even if it might mean taking more conceptual steps overall.

For example, what’s the fastest way to accumulate a list of items? That is, new items go on the tail but, for algorithm reasons, the list must be constructed from the head.

You might be tempted to keep track of the tail of the list, appending new elements directly to the tail with setcdr (via setf below).

(defun fib-track-tail (n)
  (let* ((a 0)
         (b 1)
         (head (list 1))
         (tail head))
    (dotimes (_ n head)
      (psetf a b
             b (+ a b))
      (setf (cdr tail) (list b)
            tail (cdr tail)))))

(fib-track-tail 8)
;; => (1 1 2 3 5 8 13 21 34)

Actually, it’s much faster to construct the list in reverse, then destructively reverse it at the end.

(defun fib-nreverse (n)
  (let* ((a 0)
         (b 1)
         (list (list 1)))
    (dotimes (_ n (nreverse list))
      (psetf a b
             b (+ a b))
      (push b list))))

It might not look it, but nreverse is very fast. Not only is it a built-in, it’s got its own opcode. Using push in a loop, then finishing with nreverse is the canonical and fastest way to accumulate a list of items.

In fib-track-tail, the added complexity of tracking the tail in Emacs Lisp is much slower than zipping over the entire list a second time in C.

(3) Avoid unnecessary lambda functions

I’m talking about mapcar and friends.

;; Slower
(defun expt-list (list e)
  (mapcar (lambda (x) (expt x e)) list))

Listen, I know you love dash.el and higher order functions, but this habit ain’t cheap. The byte-code compiler does not know how to inline these lambdas, so there’s an additional per-element function call overhead.

Worse, if you’re using lexical scope like I told you, the above example forms a closure over e. This means a new function object is created (e.g. make-byte-code) each time expt-list is called. To be clear, I don’t mean that the lambda is recompiled each time — the same byte-code string is shared between all instances of the same lambda. A unique function vector (#[...]) and constants vector are allocated and initialized each time expt-list is invoked.

Related mini-guideline: Don’t create any more garbage than strictly necessary in performance-critical code.

Compare to an implementation with an explicit loop, using the nreverse list-accumulation technique.

(defun expt-list-fast (list e)
  (let ((result ()))
    (dolist (x list (nreverse result))
      (push (expt x e) result))))

No unnecessary garbage is created.
No unnecessary per-element function calls.

This is the fastest possible definition for this function, and it’s what you need to use in performance-critical code.

Personally I prefer the list comprehension approach, using cl-loop from cl-lib.

(defun expt-list-fast (list e)
  (cl-loop for x in list
           collect (expt x e)))

The cl-loop macro will expand into essentially the previous definition, making them practically equivalent. It takes some getting used to, but writing efficient loops is a whole lot less tedious with cl-loop.

In Emacs 24.4 and earlier, catch/throw is implemented by converting the body of the catch into a lambda function and calling it. If code inside the catch accesses a variable outside the catch (very likely), then, in lexical scope, it turns into a closure, resulting in the garbage function object like before.

In Emacs 24.5 and later, the byte-code compiler uses a new opcode, pushcatch. It’s a whole lot more efficient, and there’s no longer a reason to shy away from catch/throw in performance-critical code. This is important because it’s often the only way to perform an early bailout.

(4) Prefer using functions with dedicated opcodes

When following the guideline about using built-in functions, you might have several to pick from. Some built-in functions have dedicated virtual machine opcodes, making them much faster to invoke. Prefer these functions when possible.

How can you tell when a function has an assigned opcode? Take a peek at the byte-defop listings in bytecomp.el. Optimization often involves getting into the weeds, so don’t be shy.

For example, the assq and assoc functions search for a matching key in an association list (alist). Both are built-in functions, and the only difference is that the former compares keys with eq (e.g. symbol or integer keys) and the latter with equal (typically string keys). The difference in performance between eq and equal isn’t as important as another factor: assq has its own opcode (158).

This means in performance-critical code you should prefer assq, perhaps even going as far as restructuring your alists specifically to have eq keys. That last step is probably a trade-off, which means you’ll want to make some benchmarks to help with that decision.

Another example is eq, =, eql, and equal. Some macros and functions use eql, especially cl-lib which inherits eql as a default from Common Lisp. Take cl-case, which is like switch from the C family of languages. It compares elements with eql.

(defun op-apply (op a b)
  (cl-case op
    (:norm (+ (* a a) (* b b)))
    (:disp (abs (- a b)))
    (:isin (/ b (sin a)))))

The cl-case expands into a cond. Since Emacs byte-code lacks support for jump tables, there’s not much room for cleverness.

Update: Emacs 26.1, released May 2018, introduced a jump table opcode.

(defun op-apply (op a b)
  (cond
   ((eql op :norm) (+ (* a a) (* b b)))
   ((eql op :disp) (abs (- a b)))
   ((eql op :isin) (/ b (sin a)))))

It turns out eql is pretty much always the worst choice for cl-case. Of the four equality functions I listed, the only one lacking an opcode is eql. A faster definition would use eq. (In theory, cl-case could have done this itself because it knows all the keys are symbols.)

(defun op-apply (op a b)
  (cond
   ((eq op :norm) (+ (* a a) (* b b)))
   ((eq op :disp) (abs (- a b)))
   ((eq op :isin) (/ b (sin a)))))

Fortunately eq can safely compare integers in Emacs Lisp. You only need eql when comparing symbols, integers, and floats all at once, which is unusual.

(5) Unroll loops using and/or

Consider the following function which checks its argument against a list of numbers, bailing out on the first match. I used % instead of mod since the former has an opcode (166) and the latter does not.

(defun detect (x)
  (catch 'found
    (dolist (f '(2 3 5 7 11 13 17 19 23 29 31))
      (when (= 0 (% x f))
        (throw 'found f)))))

The byte-code compiler doesn’t know how to unroll loops. Fortunately that’s something we can do for ourselves using and and or. The compiler will turn this into clean, efficient jumps in the byte-code.

(defun detect-unrolled (x)
  (or (and (= 0 (% x 2)) 2)
      (and (= 0 (% x 3)) 3)
      (and (= 0 (% x 5)) 5)
      (and (= 0 (% x 7)) 7)
      (and (= 0 (% x 11)) 11)
      (and (= 0 (% x 13)) 13)
      (and (= 0 (% x 17)) 17)
      (and (= 0 (% x 19)) 19)
      (and (= 0 (% x 23)) 23)
      (and (= 0 (% x 29)) 29)
      (and (= 0 (% x 31)) 31)))

In Emacs 24.4 and earlier with the old-fashioned lambda-based catch, the unrolled definition is seven times faster. With the faster pushcatch-based catch it’s about twice as fast. This means the loop overhead accounts for about half the work of the first definition of this function.

Update: It was pointed out in the comments that this particular example is equivalent to a cond. That’s literally true all the way down to the byte-code, and it would be a clearer way to express the unrolled code. In real code it’s often not quite equivalent.

Unlike some of the other guidelines, this is certainly something you’d only want to do in code you know for sure is performance-critical. Maintaining unrolled code is tedious and error-prone.

I’ve had the most success with this approach by not by unrolling these loops myself, but by using a macro, or similar, to generate the unrolled form.

(defmacro with-detect (var list)
  (cl-loop for e in list
           collect `(and (= 0 (% ,var ,e)) ,e) into conditions
           finally return `(or ,@conditions)))

(defun detect-unrolled (x)
  (with-detect x (2 3 5 7 11 13 17 19 23 29 31)))

How can I find more optimization opportunities myself?

Use M-x disassemble to inspect the byte-code for your own hot spots. Observe how the byte-code changes in response to changes in your functions. Take note of the sorts of forms that allow the byte-code compiler to produce the best code, and then exploit it where you can.

Domain-Specific Language Compilation in Elfeed

2016-12-27T21:46:30Z

Last night I pushed another performance enhancement for Elfeed, this time reducing the time spent parsing feeds. It’s accomplished by compiling, during macro expansion, a jQuery-like domain-specific language within Elfeed.

Heuristic parsing

Given the nature of the domain — an under-specified standard and a lack of robust adherence — feed parsing is much more heuristic than strict. Sure, everyone’s feed XML is strictly conforming since virtually no feed reader tolerates invalid XML (thank you, XML libraries), but, for the schema, the situation resembles the de facto looseness of HTML. Sometimes important or required information is missing, or is only available in a different namespace. Sometimes, especially in the case of timestamps, it’s in the wrong format, or encoded incorrectly, or ambiguous. It’s real world data.

To get a particular piece of information, Elfeed looks in a number of different places within the feed, starting with the preferred source and stopping when the information is found. For example, to find the date of an Atom entry, Elfeed first searches for elements in this order:

Failing to find any of these elements, or if no parsable date is found, it settles on the current time. Only the updated element is required, but published usually has the desired information, so it goes first. The last three are only valid for another namespace, but are useful fallbacks.

Before Elfeed even starts this search, the XML text is parsed into an s-expression using xml-parse-region — a pure Elisp XML parser included in Emacs. The search is made over the resulting s-expression.

For example, here’s a sample from the Atom specification.

 xmlns="http://www.w3.org/2005/Atom">

  </span>Example Feed<span class="nt">
   href="http://example.org/"/>
  2003-12-13T18:30:02Z
  
    John Doe
  
  urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6

  
    </span>Atom-Powered Robots Run Amok<span class="nt">
     rel="alternate" href="http://example.org/2003/12/13/atom03"/>
    urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a
    2003-12-13T18:30:02Z
    Some text.
  

Which is parsed to into this s-expression.

((feed ((xmlns . "http://www.w3.org/2005/Atom"))
       (title () "Example Feed")
       (link ((href . "http://example.org/")))
       (updated () "2003-12-13T18:30:02Z")
       (author () (name () "John Doe"))
       (id () "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6")
       (entry ()
              (title () "Atom-Powered Robots Run Amok")
              (link ((rel . "alternate")
                     (href . "http://example.org/2003/12/13/atom03")))
              (id () "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a")
              (updated () "2003-12-13T18:30:02Z")
              (summary () "Some text."))))

Each XML element is converted to a list. The first item is a symbol that is the element’s name. The second item is an alist of attributes — cons pairs of symbols and strings. And the rest are its children, both string nodes and other elements. I’ve trimmed the extraneous string nodes from the sample s-expression.

A subtle detail is that xml-parse-region doesn’t just return the root element. It returns a list of elements, which always happens to be a single element list, which is the root element. I don’t know why this is, but I’ve built everything to assume this structure as input.

Elfeed strips all namespaces stripped from both elements and attributes to make parsing simpler. As I said, it’s heuristic rather than strict, so namespaces are treated as noise.

A domain-specific language

Coding up Elfeed’s s-expression searches in straight Emacs Lisp would be tedious, error-prone, and difficult to understand. It’s a lot of loops, assoc, etc. So instead I invented a jQuery-like, CSS selector-like, domain-specific language (DSL) to express these searches concisely and clearly.

For example, all of the entry links are “selected” using this expression:

(feed entry link [rel "alternate"] :href)

Reading right-to-left, this matches every href attribute under every link element with the rel="alternate" attribute, under every entry element, under the feed root element. Symbols match element names, two-element vectors match elements with a particular attribute pair, and keywords (which must come last) narrow the selection to a specific attribute value.

Imagine hand-writing the code to navigate all these conditions for each piece of information that Elfeed requires. The RSS parser makes up to 16 such queries, and the Atom parser makes as many as 24. That would add up to a lot of tedious code.

The package (included with Elfeed) that executes this query is called “xml-query.” It comes in two flavors: xml-query and xml-query-all. The former returns just the first match, and the latter returns all matches. The naming parallels the querySelector() and querySelectorAll() DOM methods in JavaScript.

(let ((xml (elfeed-xml-parse-region)))
  (xml-query-all '(feed entry link [rel "alternate"] :href) xml))

;; => ("http://example.org/2003/12/13/atom03")

That date search I mentioned before looks roughly like this. The * matches text nodes within the selected element. It must come last just like the keyword matcher.

(or (xml-query '(feed entry published *))
    (xml-query '(feed entry updated *))
    (xml-query '(feed entry date *))
    (xml-query '(feed entry modified *))
    (xml-query '(feed entry issued *))
    (current-time))

Over the past three years, Elfeed has gained more and more of these selectors as it collects more and more information from feeds. Most recently, Elfeed collects author and category information provided by feeds. Each new query slows feed parsing a little bit, and it’s a perfect example of a program slowing down as it gains more features and capabilities.

But I don’t want Elfeed to slow down. I want it to get faster!

Optimizing the domain-specific language

Just like the primary jQuery function ($), both xml-query and xml-query-all are functions. The xml-query engine processes the selector from scratch on each invocation. It examines the first element, dispatches on its type/value to apply it to the input, and then recurses on the rest of selector with the narrowed input, stopping when it hits the end of the list. That’s the way it’s worked from the start.

However, every selector argument in Elfeed is a static, quoted list. Unlike user-supplied filters, I know exactly what I want to execute ahead of time. It would be much better if the engine didn’t have to waste time reparsing the DSL for each query.

This is the classic split between interpreters and compilers. An interpreter reads input and immediately executes it, doing what the input tells it to do. A compiler reads input and, rather than execute it, produces output, usually in a simpler language, that, when evaluated, has the same effect as executing the input.

Rather than interpret the selector, it would be better to compile it into Elisp code, compile that into byte-code, and then have the Emacs byte-code virtual machine (VM) execute the query each time it’s needed. The extra work of parsing the DSL is performed ahead of time, the dispatch is entirely static, and the selector ultimately executes on a much faster engine (byte-code VM). This should be a lot faster!

So I wrote a function that accepts a selector expression and emits Elisp source that implements that selector: a compiler for my DSL. Having a readily-available syntax tree is one of the big advantages of homoiconicity, and this sort of function makes perfect sense in a lisp. For the external interface, this compiler function is called by a new pair of macros, xml-query* and xml-query-all*. These macros consume a static selector and expand into the compiled Elisp form of the selector.

To demonstrate, remember that link query from before? Here’s the macro version of that selection, but only returning the first match. Notice the selector is no longer quoted. This is because it’s consumed by the macro, not evaluated.

(xml-query* (feed entry title [rel "alternate"] :href) xml)

This will expand into the following code.

(catch 'done
  (dolist (v xml)
    (when (and (consp v) (eq (car v) 'feed))
      (dolist (v (cddr v))
        (when (and (consp v) (eq (car v) 'entry))
          (dolist (v (cddr v))
            (when (and (consp v) (eq (car v) 'title))
              (let ((value (cdr (assq 'rel (cadr v)))))
                (when (equal value "alternate")
                  (let ((v (cdr (assq 'href (cadr v)))))
                    (when v
                      (throw 'done v))))))))))))

As soon as it finds a match, it’s thrown to the top level and returned. Without the DSL, the expansion is essentially what would have to be written by hand. This is exactly the sort of leverage you should be getting from a compiler. It compiles to around 130 byte-code instructions.

The xml-query-all* form is nearly the same, but instead of a throw, it pushes the result into the return list. Only the prologue (the outermost part) and the epilogue (the innermost part) are different.

Parsing feeds is a hot spot for Elfeed, so I wanted the compiler’s output to be as efficient as possible. I had three goals for this:

No extraneous code. It’s easy for the compiler to emit unnecessary code. The byte-code compiler might be able to eliminate some of it, but I don’t want to rely on that. Except for the identifiers, it should basically look like a human wrote it.
Avoid function calls. I don’t want to pay function call overhead, and, with some care, it’s easy to avoid. In the xml-query* expansion, the only function call is throw, which is unavoidable. The xml-query-all* version makes no function calls whatsoever. Notice that I used assq rather than assoc. First, it only needs to match symbols, so it should be faster. Second, assq has its own byte-code instruction (158) and assoc does not.
No unnecessary memory allocations. The xml-query* expansion makes no allocations. The xml-query-all* version only conses once per output, which is the minimum possible.

The end result is at least as optimal as hand-written code, but without the chance of human error (typos, fat fingering) and sourced from an easy-to-read DSL.

Performance

In my tests, the xml-query macros are a full order of magnitude faster than the functions. Yes, ten times faster! It’s an even bigger gain than I expected.

In the full picture, xml-query is only one part of parsing a feed. Measuring the time starting from raw XML text (as delivered by cURL) to a list of database entry objects, I’m seeing an overall 25% speedup with the macros. The remaining time is dominated by xml-parse-region, which is mostly out of my control.

With xml-query so computationally cheap, I don’t need to worry about using it more often. Compared to parsing XML text, it’s virtually free.

When it came time to validate my DSL compiler, I was really happy that Elfeed had a test suite. I essentially rewrote a core component from scratch, and passing all of the unit tests was a strong sign that it was correct. Many times that test suite has provided confidence in changes made both by me and by others.

I’ll end by describing another possible application: Apply this technique to regular expressions, such that static strings containing regular expressions are compiled into Elisp/byte-code via macro expansion. I wonder if situationally this would be faster than Emacs’ own regular expression engine.

Relocatable Global Data on x86

2016-12-23T22:50:51Z

Relocatable code — program code that executes correctly from any properly-aligned address — is an essential feature for shared libraries. Otherwise all of a system’s shared libraries would need to coordinate their virtual load addresses. Loading programs and libraries to random addresses is also a valuable security feature: Address Space Layout Randomization (ASLR). But how does a compiler generate code for a function that accesses a global variable if that variable’s address isn’t known at compile time?

Consider this simple C code sample.

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return x < 4 ? values[x] : 0.0f;
}

This function needs the base address of values in order to dereference it for values[x]. The easiest way to find out how this works, especially without knowing where to start, is to compile the code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian Jessie).

$ gcc -c -Os -fPIC get_value.c

I optimized for size (-Os) to make the disassembly easier to follow. Next, disassemble this pre-linked code with objdump. Alternatively I could have asked for the compiler’s assembly output with -S, but this will be good reverse engineering practice.

$ objdump -d -Mintel get_value.o
0000000000000000 :
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret

There are a couple of interesting things going on, but let’s start from the beginning.

The ABI specifies that the first integer/pointer argument (the 32-bit integer x) is passed through the edi register. The function compares x to 3, to satisfy x < 4.
The ABI specifies that floating point values are returned through the SSE2 SIMD register xmm0. It’s cleared by XORing the register with itself — the conventional way to clear registers on x86 — setting up for a return value of 0.0f.
It then uses the result of the previous comparison to perform a jump, ja (“jump if after”). That is, jump to the relative address specified by the jump’s operand if the first operand to cmp (edi) comes after the first operand (0x3) as unsigned values. Its cousin, jg (“jump if greater”), is for signed values. If x is outside the array bounds, it jumps straight to ret, returning 0.0f.
If x was in bounds, it uses a lea (“load effective address”) to load something into the 64-bit rax register. This is the complicated bit, and I’ll start by giving the answer: The value loaded into rax is the address of the values array. More on this in a moment.
Finally it uses x as an index into address in rax. The movss (“move scalar single-precision”) instruction loads a 32-bit float into the first lane of xmm0, where the caller expects to find the return value. This is all preceded by a mov edi, edi which looks like a hotpatch nop, but it isn’t. x86-64 always uses 64-bit registers for addressing, meaning it uses rdi not edi. All 32-bit register assignments clear the upper 32 bits, and so this mov zero-extends edi into rdi. This is in case of the unlikely event that the caller left garbage in those upper bits.

Clearing `xmm0`

The first interesting part: xmm0 is cleared even when its first lane is loaded with a value. There are two reasons to do this.

The obvious reason is that the alternative requires additional instructions, and I told GCC to optimize for size. It would need either an extra ret or an conditional jmp over the “else” branch.

The less obvious reason is that it breaks a data dependency. For over 20 years now, x86 micro-architectures have employed an optimization technique called register renaming. Architectural registers (rax, edi, etc.) are just temporary names for underlying physical registers. This disconnect allows for more aggressive out-of-order execution. Two instructions sharing an architectural register can be executed independently so long as there are no data dependencies between these instructions.

For example, take this assembly sample. It assembles to 9 bytes of machine code.

    mov  edi, [rcx]
    mov  ecx, 7
    shl  eax, cl

This reads a 32-bit value from the address stored in rcx, then assigns ecx and uses cl (the lowest byte of rcx) in a shift operation. Without register renaming, the shift couldn’t be performed until the load in the first instruction completed. However, the second instruction is a 32-bit assignment, which, as I mentioned before, also clears the upper 32 bits of rcx, wiping the unused parts of register.

So after the second instruction, it’s guaranteed that the value in rcx has no dependencies on code that comes before it. Because of this, it’s likely a different physical register will be used for the second and third instructions, allowing these instructions to be executed out of order, before the load. Ingenious!

Compare it to this example, where the second instruction assigns to cl instead of ecx. This assembles to just 6 bytes.

    mov  edi, [rcx]
    mov  cl, 7
    shl  eax, cl

The result is 3 bytes smaller, but since it’s not a 32-bit assignment, the upper bits of rcx still hold the original register contents. This creates a false dependency and may prevent out-of-order execution, reducing performance.

By clearing xmm0, instructions in get_value involving xmm0 have the opportunity to be executed prior to instructions in the callee that use xmm0.

RIP-relative addressing

Going back to the instruction that computes the address of values.

   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]

Normally load/store addresses are absolute, based off an address either in a general purpose register, or at some hard-coded base address. The latter is not an option in relocatable code. With RIP-relative addressing that’s still the case, but the register with the absolute address is rip, the instruction pointer. This addressing mode was introduced in x86-64 to make relocatable code more efficient.

That means this instruction copies the instruction pointer (pointing to the next instruction) into rax, plus a 32-bit displacement, currently zero. This isn’t the right way to encode a displacement of zero (unless you want a larger instruction). That’s because the displacement will be filled in later by the linker. The compiler adds a relocation entry to the object file so that the linker knows how to do this.

On platforms that use ELF we can inspect relocations this with readelf.

$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4

The relocation type is R_X86_64_PC32. In the AMD64 Architecture Processor Supplement, this is defined as “S + A - P”.

S: Represents the value of the symbol whose index resides in the relocation entry.
A: Represents the addend used to compute the value of the relocatable field.
P: Represents the place of the storage unit being relocated.

The symbol, S, is .rodata — the final address for this object file’s portion of .rodata (where values resides). The addend, A, is -4 since the instruction pointer points at the next instruction. That is, this will be relative to four bytes after the relocation offset. Finally, the address of the relocation, P, is the address of last four bytes of the lea instruction. These values are all known at link-time, so no run-time support is necessary.

Being “S - P” (overall), this will be the displacement between these two addresses: the 32-bit value is relative. It’s relocatable so long as these two parts of the binary (code and data) maintain a fixed distance from each other. The binary is relocated as a whole, so this assumption holds.

32-bit relocation

Since RIP-relative addressing wasn’t introduced until x86-64, how did this all work on x86? Again, let’s just see what the compiler does. Add the -m32 flag for a 32-bit target, and -fomit-frame-pointer to make it simpler for explanatory purposes.

$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 :
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f 
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 <__x86.get_pc_thunk.cx>:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret

Hmm, this one includes an extra function.

In this calling convention, arguments are passed on the stack. The first instruction loads the argument, x, into eax.
The fldz instruction clears the x87 floating pointer return register, just like clearing xmm0 in the x86-64 version.
Next it calls __x86.get_pc_thunk.cx. The call pushes the instruction pointer, eip, onto the stack. This function reads that value off the stack into ecx and returns. In other words, calling this function copies eip into ecx. It’s setting up to load data at an address relative to the code. Notice the function name starts with two underscores — a name which is reserved for exactly for these sorts of implementation purposes.
Next a 32-bit displacement is added to ecx. In this case it’s 2, but, like before, this is actually going be filled in later by the linker.
Then it’s just like before: a branch to optionally load a value. The floating pointer load (fld) is another relocation.

Let’s look at the relocations. There are three this time:

$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata

The first relocation is the call-site for the thunk. The thunk has external linkage and may be merged with a matching thunk in another object file, and so may be relocated. (Clang inlines its thunk.) Calls are relative, so its type is R_386_PC32: a code-relative displacement just like on x86-64.

The next is of type R_386_GOTPC and sets the second operand in that add ecx. It’s defined as “GOT + A - P” where “GOT” is the address of the Global Offset Table — a table of addresses of the binary’s relocated objects. Since values is static, the GOT won’t actually hold an address for it, but the relative address of the GOT itself will be useful.

The final relocation is of type R_386_GOTOFF. This is defined as “S + A - GOT”. Another displacement between two addresses. This is the displacement in the load, fld. Ultimately the load adds these last two relocations together, canceling the GOT:

  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P

So the GOT isn’t relevant in this case. It’s just a mechanism for constructing a custom relocation type.

Branch optimization

Notice in the x86 version the thunk is called before checking the argument. What if it’s most likely that will x be out of bounds of the array, and the function usually returns zero? That means it’s usually wasting its time calling the thunk. Without profile-guided optimization the compiler probably won’t know this.

The typical way to provide such a compiler hint is with a pair of macros, likely() and unlikely(). With GCC and Clang, these would be defined to use __builtin_expect. Compilers without this sort of feature would have macros that do nothing instead. So I gave it a shot:

#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return unlikely(x < 4) ? values[x] : 0.0f;
}

Unfortunately this makes no difference even in the latest version of GCC. In Clang it changes branch fall-through (for static branch prediction), but still always calls the thunk. It seems compilers have difficulty with optimizing relocatable code on x86.

x86-64 isn’t just about more memory

It’s commonly understood that the advantage of 64-bit versus 32-bit systems is processes having access to more than 4GB of memory. But as this shows, there’s more to it than that. Even programs that don’t need that much memory can really benefit from newer features like RIP-relative addressing.

Some Performance Advantages of Lexical Scope

2016-12-22T02:33:36Z

I recently had a discussion with Xah Lee about lexical scope in Emacs Lisp. The topic was why lexical-binding exists at a file-level when there was already lexical-let (from cl-lib), prompted by my previous article on JIT byte-code compilation. The specific context is Emacs Lisp, but these concepts apply to language design in general.

Until Emacs 24.1 (June 2012), Elisp only had dynamically scoped variables — a feature, mostly by accident, common to old lisp dialects. While dynamic scope has some selective uses, it’s widely regarded as a mistake for local variables, and virtually no other languages have adopted it.

Way back in 1993, Dave Gillespie’s deviously clever lexical-let macro was committed to the cl package, providing a rudimentary form of opt-in lexical scope. The macro walks its body replacing local variable names with guaranteed-unique gensym names: the exact same technique used in macros to create “hygienic” bindings that aren’t visible to the macro body. It essentially “fakes” lexical scope within Elisp’s dynamic scope by preventing variable name collisions.

For example, here’s one of the consequences of dynamic scope.

(defun inner ()
  (setq v :inner))

(defun outer ()
  (let ((v :outer))
    (inner)
    v))

(outer)
;; => :inner

The “local” variable v in outer is visible to its callee, inner, which can access and manipulate it. The meaning of the free variable v in inner depends entirely on the run-time call stack. It might be a global variable, or it might be a local variable for a caller, direct or indirect.

Using lexical-let deconflicts these names, giving the effect of lexical scope.

(defvar v)

(defun lexical-outer ()
  (lexical-let ((v :outer))
    (inner)
    v))

(lexical-outer)
;; => :outer

But there’s more to lexical scope than this. Closures only make sense in the context of lexical scope, and the most useful feature of lexical-let is that lambda expressions evaluate to closures. The macro implements this using a technique called closure conversion. Additional parameters are added to the original lambda function, one for each lexical variable (and not just each closed-over variable), and the whole thing is wrapped in another lambda function that invokes the original lambda function with the additional parameters filled with the closed-over variables — yes, the variables (e.g. symbols) themselves, not just their values, (e.g. pass-by-reference). The last point means different closures can properly close over the same variables, and they can bind new values.

To roughly illustrate how this works, the first lambda expression below, which closes over the lexical variables x and y, would be converted into the latter by lexical-let. The #: is Elisp’s syntax for uninterned variables. So #:x is a symbol x, but not the symbol x (see print-gensym).

;; Before conversion:
(lambda ()
  (+ x y))

;; After conversion:
(lambda (&rest args)
  (apply (lambda (x y)
           (+ (symbol-value x)
              (symbol-value y)))
         '#:x '#:y args))

I’ve said on multiple occasions that lexical-binding: t has significant advantages, both in performance and static analysis, and so it should be used for all future Elisp code. The only reason it’s not the default is because it breaks some old (badly written) code. However, lexical-let doesn’t realize any of these advantages! In fact, it has worse performance than straightforward dynamic scope with let.

New symbol objects are allocated and initialized (make-symbol) on each run-time evaluation, one per lexical variable.
Since it’s just faking it, lexical-let still uses dynamic bindings, which are more expensive than lexical bindings. It varies depending on the C compiler that built Emacs, but dynamic variable accesses (opcode varref) take around 30% longer than lexical variable accesses (opcode stack-ref). Assignment is far worse, where dynamic variable assignment (varset) takes 650% longer than lexical variable assignment (stack-set). How I measured all this is a topic for another article.
The “lexical” variables are accessed using symbol-value, a full function call, so they’re even slower than normal dynamic variables.
Because converted lambda expressions are constructed dynamically at run-time within the body of lexical-let, the resulting closure is only partially byte-compiled even if the code as a whole has been byte-compiled. In contrast, lexical-binding: t closures are fully compiled. How this works is worth its own article.
Converted lambda expressions include the additional internal function invocation, making them slower.

While lexical-let is clever, and occasionally useful prior to Emacs 24, it may come at a hefty performance cost if evaluated frequently. There’s no reason to use it anymore.

Constraints on code generation

Another reason to be weary of dynamic scope is that it puts needless constraints on the compiler, preventing a number of important optimization opportunities. For example, consider the following function, bar:

(defun bar ()
  (let ((x 1)
        (y 2))
    (foo)
    (+ x y)))

Byte-compile this function under dynamic scope (lexical-binding: nil) and disassemble it to see what it looks like.

(byte-compile #'bar)
(disassemble #'bar)

That pops up a buffer with the disassembly listing:

     constant  1
     constant  2
     varbind   y
     varbind   x
     constant  foo
     call      0
     discard
     varref    x
     varref    y
     plus
    unbind    2
    return

It’s 12 instructions, 5 of which deal with dynamic bindings. The byte-compiler doesn’t always produce optimal byte-code, but this just so happens to be nearly optimal byte-code. The discard (a very fast instruction) isn’t necessary, but otherwise no more compiler smarts can improve on this. Since the variables x and y are visible to foo, they must be bound before the call and loaded after the call. While generally this function will return 3, the compiler cannot assume so since it ultimately depends on the behavior foo. Its hands are tied.

Compare this to the lexical scope version (lexical-binding: t):

     constant  1
     constant  2
     constant  foo
     call      0
     discard
     stack-ref 1
     stack-ref 1
     plus
     return

It’s only 8 instructions, none of which are expensive dynamic variable instructions. And this isn’t even close to the optimal byte-code. In fact, as of Emacs 25.1 the byte-compiler often doesn’t produce the optimal byte-code for lexical scope code and still needs some work. Despite not firing on all cylinders, lexical scope still manages to beat dynamic scope in performance benchmarks.

Here’s the optimal byte-code, should the byte-compiler become smarter someday:

     constant  foo
     call      0
     constant  3
     return

It’s down to 4 instructions due to computing the math operation at compile time. Emacs’ byte-compiler only has rudimentary constant folding, so it doesn’t notice that x and y are constants and misses this optimization. I speculate this is due to its roots compiling under dynamic scope. Since x and y are no longer exposed to foo, the compiler has the opportunity to optimize them out of existence. I haven’t measured it, but I would expect this to be significantly faster than the dynamic scope version of this function.

Optional dynamic scope

You might be thinking, “What if I really do want x and y to be dynamically bound for foo?” This is often useful. Many of Emacs’ own functions are designed to have certain variables dynamically bound around them. For example, the print family of functions use the global variable standard-output to determine where to send output by default.

(let ((standard-output (current-buffer)))
  (princ "value = ")
  (prin1 value))

Have no fear: With lexical-binding: t you can have your cake and eat it too. Variables declared with defvar, defconst, or defvaralias are marked as “special” with an internal bit flag (declared_special in C). When the compiler detects one of these variables (special-variable-p), it uses a classical dynamic binding.

Declaring both x and y as special restores the original semantics, reverting bar back to its old byte-code definition (next time it’s compiled, that is). But it would be poor form to mark x or y as special: You’d de-optimize all code (compiled after the declaration) anywhere in Emacs that uses these names. As a package author, only do this with the namespace-prefixed variables that belong to you.

The only way to unmark a special variable is with the undocumented function internal-make-var-non-special. I expected makunbound to do this, but as of Emacs 25.1 it does not. This could possibly be considered a bug.

Accidental closures

I’ve said there are absolutely no advantages to lexical-binding: nil. It’s only the default for the sake of backwards-compatibility. However, there is one case where lexical-binding: t introduces a subtle issue that would otherwise not exist. Take this code for example (and nevermind prin1-to-string for a moment):

;; -*- lexical-binding: t; -*-

(defun function-as-string ()
  (with-temp-buffer
    (prin1 (lambda () :example) (current-buffer))
    (buffer-string)))

This creates and serializes a closure, which is one of Elisp’s unique features. It doesn’t close over any variables, so it should be pretty simple. However, this function will only work correctly under lexical-binding: t when byte-compiled.

(function-as-string)
;; => "(closure ((temp-buffer . #) t) nil :example)"

The interpreter doesn’t analyze the closure, so just closes over everything. This includes the hidden variable temp-buffer created by the with-temp-buffer macro, resulting in an abstraction leak. Buffers aren’t readable, so this will signal an error if an attempt is made to read this function back into an s-expression. The byte-compiler fixes this by noticing temp-buffer isn’t actually closed over and so doesn’t include it in the closure, making it work correctly.

Under lexical-binding: nil it works correctly either way:

(function-as-string)
;; -> "(lambda nil :example)"

This may seem contrived — it’s certainly unlikely — but it has come up in practice. Still, it’s no reason to avoid lexical-binding: t.

Use lexical scope in all new code

As I’ve said again and again, always use lexical-binding: t. Use dynamic variables judiciously. And lexical-let is no replacement. It has virtually none of the benefits, performs worse, and it only applies to let, not any of the other places bindings are created: function parameters, dotimes, dolist, and condition-case.

Faster Elfeed Search Through JIT Byte-code Compilation

2016-12-11T23:16:42Z

Today I pushed an update for Elfeed that doubles the speed of the search filter in the worse case. This is the user-entered expression that dynamically narrows the entry listing to a subset that meets certain criteria: published after a particular date, with/without particular tags, and matching/non-matching zero or more regular expressions. The filter is live, applied to the database as the expression is edited, so it’s important for usability that this search completes under a threshold that the user might notice.

The typical workaround for these kinds of interfaces is to make filtering/searching asynchronous. It’s possible to do this well, but it’s usually a terrible, broken design. If the user acts upon the asynchronous results — say, by typing the query and hitting enter to choose the current or expected top result — then the final behavior is non-deterministic, a race between the user’s typing speed and the asynchronous search. Elfeed will keep its synchronous live search.

For anyone not familiar with Elfeed, here’s a filter that finds all entries from within the past year tagged “youtube” (+youtube) that mention Linux or Linus (linu[sx]), but aren’t tagged “bsd” (-bsd), limited to the most recent 15 entries (#15):

@1-year-old +youtube linu[xs] -bsd #15

The database is primarily indexed over publication date, so filters on publication dates are the most efficient filters. Entries are visited in order starting with the most recently published, and the search can bail out early once it crosses the filter threshold. Time-oriented filters have been encouraged as the solution to keep the live search feeling lively.

Filtering Overview

The first step in filtering is parsing the filter text entered by the user. This string is broken into its components using the elfeed-search-parse-filter function. Date filter components are converted into a unix epoch interval, tags are interned into symbols, regular expressions are gathered up as strings, and the entry limit is parsed into a plain integer. Absence of a filter component is indicated by nil.

(elfeed-search-parse-filter "@1-year-old +youtube linu[xs] -bsd #15")
;; => (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)

Previously, the next step was to apply the elfeed-search-filter function with this structured filter representation to the database. Except for special early-bailout situations, it works left-to-right across the filter, checking each condition against each entry. This is analogous to an interpreter, with the filter being a program.

Thinking about it that way, what if the filter was instead compiled into an Emacs byte-code function and executed directly by the Emacs virtual machine? That’s what this latest update does.

Benchmarks

With six different filter components, the actual filtering routine is a bit too complicated for an article, so I’ll set up a simpler, but roughly equivalent, scenario. With a reasonable cut-off date, the filter was already sufficiently fast, so for benchmarking I’ll focus on the worst case: no early bailout opportunities. An entry will be just a list of tags (symbols), and the filter will have to test every entry.

My real-world Elfeed database currently has 46,772 entries with 36 distinct tags. For my benchmark I’ll round this up to a nice 100,000 entries, and use 26 distinct tags (A–Z), which has the nice alphabet property and more closely reflects the number of tags I still care about.

First, here’s make-random-entry to generate a random list of 1–5 tags (i.e. an entry). The state parameter is the random state, allowing for deterministic benchmarks on a randomly-generated database.

(cl-defun make-random-entry (&key state (min 1) (max 5))
  (cl-loop repeat (+ min (cl-random (1+ (- max min)) state))
           for letter = (+ ?A (cl-random 26 state))
           collect (intern (format "%c" letter))))

The database is just a big list of entries. In Elfeed this is actually an AVL tree. Without dates, the order doesn’t matter.

(cl-defun make-random-database (&key state (count 100000))
  (cl-loop repeat count collect (make-random-entry :state state)))

Here’s my old time macro. An important change I’ve made since years ago is to call garbage-collect before starting the clock, eliminating bad samples from unlucky garbage collection events. Depending on what you want to measure, it may even be worth disabling garbage collection during the measurement by setting gc-cons-threshold to a high value.

(defmacro measure-time (&rest body)
  (declare (indent defun))
  (garbage-collect)
  (let ((start (make-symbol "start")))
    `(let ((,start (float-time)))
       ,@body
       (- (float-time) ,start))))

Finally, the benchmark harness. It uses a hard-coded seed to generate the same pseudo-random database. The test is run against the a filter function, f, 100 times in search for the same 6 tags, and the timing results are averaged.

(cl-defun benchmark (f &optional (n 100) (tags '(A B C D E F)))
  (let* ((state (copy-sequence [cl-random-state-tag -1 30 267466518]))
         (db (make-random-database :state state)))
    (cl-loop repeat n
             sum (measure-time
                   (funcall f db tags))
             into total
             finally return (/ total (float n)))))

The baseline will be memq (test for membership using identity, eq). There are two lists of tags to compare: the list that is the entry, and the list from the filter. This requires a nested loop for each entry, one explicit (cl-loop) and one implicit (memq), both with early bailout.

(defun memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq tag entry)
                    return t)))

Byte-code compiling everything and running the benchmark on my laptop I get:

(benchmark #'memq-count)
;; => 0.041 seconds

That’s actually not too bad. One of the advantages of this definition is that there are no function calls. The memq built-in function has its own opcode (62), and the rest of the definition is special forms and macros expanding to special forms (cl-loop). It’s exactly the thing I need to exploit to make filters faster.

As a sanity check, what would happen if I used member instead of memq? In theory it should be slower because it uses equal for tests instead of eq.

(defun member-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (member tag entry)
                    return t)))

It’s only slightly slower because member, like many other built-ins, also has an opcode (157). It’s just a tiny bit more overhead.

(benchmark #'member-count)
;; => 0.047 seconds

To test function call overhead while still using the built-in (e.g. written in C) memq, I’ll alias it so that the byte-code compiler is forced to emit a function call.

(defalias 'memq-alias 'memq)

(defun memq-alias-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq-alias tag entry)
                    return t)))

To verify that this is doing what I expect, I M-x disassemble the function and inspect the byte-code disassembly. Here’s a simple example.

(disassemble
 (byte-compile (lambda (list) (memq :foo list))))

When compiled under lexical scope (lexical-binding is true), here’s the disassembly. To understand what this means, see Emacs Byte-code Internals.

     constant  :foo
     stack-ref 1
     memq
     return

Notice the memq instruction. Try using memq-alias instead:

(disassemble
 (byte-compile (lambda (list) (memq-alias :foo list))))

Resulting in a function call:

     constant  memq-alias
     constant  :foo
     stack-ref 2
     call      2
     return

And the benchmark:

(benchmark #'memq-alias-count)
;; => 0.052 seconds

So the function call adds about 27% overhead. This means it would be a good idea to avoid calling functions in the filter if I can help it. I should rely on these special opcodes.

Suppose memq was written in Emacs Lisp rather than C. How much would that hurt performance? My version of my-memq below isn’t quite the same since it returns t rather than the sublist, but it’s good enough for this purpose. (I’m using cl-loop because writing early bailout in plain Elisp without recursion is, in my opinion, ugly.)

(defun my-memq (needle haystack)
  (cl-loop for element in haystack
           when (eq needle element)
           return t))

(defun my-memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (my-memq tag entry)
                    return t)))

And the benchmark:

(benchmark #'my-memq-count)
;; => 0.137 seconds

Oof! It’s more than 3 times slower than the opcode. This means I should use built-ins as much as possible in the filter.

Dynamic vs. lexical scope

There’s one last thing to watch out for. Everything so far has been compiled with lexical scope. You should really turn this on by default for all new code that you write. It has three important advantages:

It allows the compiler to catch more mistakes.
It eliminates a class of bugs related to dynamic scope: Local variables are exposed to manipulation by callees.
Lexical scope has better performance.

Here are all the benchmarks with the default dynamic scope:

(benchmark #'memq-count)
;; => 0.065 seconds

(benchmark #'member-count)
;; => 0.070 seconds

(benchmark #'memq-alias-count)
;; => 0.074 seconds

(benchmark #'my-memq-count)
;; => 0.256 seconds

It halves the performance in this benchmark, and for no benefit. Under dynamic scope, local variables use the varref opcode — a global variable lookup — instead of the stack-ref opcode — a simple array index.

(defun norm (a b)
  (* (- a b) (- a b)))

Under dynamic scope, this compiles to:

     varref    a
     varref    b
     diff
     varref    a
     varref    b
     diff
     mult
     return

And under lexical scope (notice the variable names disappear):

     stack-ref 1
     stack-ref 1
     diff
     stack-ref 2
     stack-ref 2
     diff
     mult
     return

JIT-compiled filters

So far I’ve been moving in the wrong direction, making things slower rather than faster. How can I make it faster than the straight memq version? By compiling the filter into byte-code.

I won’t write the byte-code directly, but instead generate Elisp code and use the byte-code compiler on it. This is safer, will work correctly in future versions of Emacs, and leverages the optimizations performed by the byte-compiler. This sort of thing recently got a bad rap on Emacs Horrors, but I was happy to see that this technique is already established.

(defun jit-count (db tags)
  (let* ((memq-list (cl-loop for tag in tags
                             collect `(memq ',tag entry)))
         (function `(lambda (db)
                      (cl-loop for entry in db
                               count (or ,@memq-list))))
         (compiled (byte-compile function)))
    (funcall compiled db)))

It dynamically builds the code as an s-expression, runs that through the byte-code compiler, executes it, and throws it away. It’s “just-in-time,” though compiling to byte-code and not native code. For the benchmark tags of (A B C D E F), this builds the following:

(lambda (db)
  (cl-loop for entry in db
           count (or (memq 'A entry)
                     (memq 'B entry)
                     (memq 'C entry)
                     (memq 'D entry)
                     (memq 'E entry)
                     (memq 'F entry))))

Due to its short-circuiting behavior, or is a special form, so this function is just special forms and memq in its opcode form. It’s as fast as Elisp can get.

Having s-expressions is a real strength for lisp, since the alternative (in, say, JavaScript) would be to assemble the function by concatenating code strings. By contrast, this looks a lot like a regular lisp macro. Invoking the byte-code compiler does add some overhead compared to the interpreted filter, but it’s insignificant.

How much faster is this?

(benchmark #'jit-count)
;; => 0.017s

It’s more than twice as fast! The big gain here is through loop unrolling. The outer loop has been unrolled into the or expression. That section of byte-code looks like this:

     constant  A
     stack-ref 1
     memq
     goto-if-not-nil-else-pop 1
     constant  B
     stack-ref 1
     memq
     goto-if-not-nil-else-pop 1
    constant  C
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  D
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  E
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  F
    stack-ref 1
    memq
1    return

In Elfeed, not only does it unroll these loops, it completely eliminates the overhead for unused filter components. Comparing to this benchmark, I’m seeing roughly matching gains in Elfeed’s worst case. In Elfeed, I also bind lexical-binding around the byte-compile call to force lexical scope, since otherwise it just uses the buffer-local value (usually nil).

Filter compilation can be toggled on and off by setting elfeed-search-compile-filter. If you’re up to date, try out live filters with it both enabled and disabled. See if you can notice the difference.

Result summary

Here are the results in a table, all run with Emacs 24.4 on x86-64.

(ms)      memq      member    memq-alias my-memq   jit
lexical   41        47        52         137       17
dynamic   65        70        74         256       21

And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where I also occasionally use Elfeed, and where I have been very interested in improving performance.

(ms)      memq      member    memq-alias my-memq   jit
lexical   170       235       242        614       79
dynamic   274       340       345        1130      92

And here’s how you can run the benchmarks for yourself, perhaps with different parameters:

jit-bench.el

The header explains how to run the benchmark in batch mode:

$ emacs -Q -batch -f batch-byte-compile jit-bench.el
$ emacs -Q -batch -l jit-bench.elc -f benchmark-batch

Baking Data with Serialization

2016-11-15T05:27:53Z

Suppose you want to bake binary data directly into a program’s executable. It could be image pixel data (PNG, BMP, JPEG), a text file, or some sort of complex data structure. Perhaps the purpose is to build a single executable with no extraneous data files — easier to install and manage, though harder to modify. Or maybe you’re lazy and don’t want to worry about handling the various complications and errors that arise when reading external data: Where to find it, and what to do if you can’t find it or can’t read it. This article is about two different approaches I’ve used a number of times for C programs.

The linker approach

The simpler, less portable option is to have the linker do it. Both the GNU linker and the gold linker (ELF only) can create object files from arbitrary files using the --format (-b) option set to binary (raw data). It’s combined with --relocatable (-r) to make it linkable with the rest of the program. MinGW supports all of this, too, so it’s fairly portable so long as you stick to GNU Binutils.

For example, to create an object file, my_msg.o with the contents of the text file my_msg.txt:

$ ld -r -b binary -o my_file.o my_msg.txt

(Update: You probably also want to use -z noexecstack.)

The object file will have three symbols, each named after the input file. Unfortunately there’s no control over the symbol names, section (.data), alignment, or protections (e.g. read-only). You’re completely at the whim of the linker, short of objcopy tricks.

$ nm my_msg.o
000000000000000e D _binary_my_msg_txt_end
000000000000000e A _binary_my_msg_txt_size
0000000000000000 D _binary_my_msg_txt_start

To access these in C, declare them as global variables like so:

extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;

The size symbol, _binary_my_msg_txt_size, is misleading. The “A” from nm means it’s an absolute symbol, not relocated. It doesn’t refer to an integer that holds the size of the raw data. The value of the symbol itself is the size of the data. That is, take the address of it and cast it to an integer.

size_t size = (size_t)&_binary_my_msg_txt_size;

Alternatively — and this is my own preference — just subtract the other two symbols. It’s cleaner and easier to understand.

size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;

Here’s the “Hello, world” for this approach (hello.c).

#include 

extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;

int
main(void)
{
    size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;
    fwrite(_binary_my_msg_txt_start, size, 1, stdout);
    return 0;
}

The program has to use fwrite() rather than fputs() because the data won’t necessarily be null-terminated. That is, unless a null is intentionally put at the end of the text file itself.

And for the build:

$ cat my_msg.txt
Hello, world!
$ ld -r -b binary -o my_msg.o my_msg.txt
$ gcc -o hello hello.c my_msg.o
$ ./hello
Hello, world!

If this was binary data, such as an image file, the program would instead read the array as if it were a memory mapped file. In fact, that’s what it really is: the raw data memory mapped by the loader before the program started.

How about a data structure dump?

This could be taken further to dump out some kinds of data structures. For example, this program (table_gen.c) fills out a table of the first 90 Fibonacci numbers and dumps it to standard output.

#include 

#define TABLE_SIZE 90

long long table[TABLE_SIZE] = {1, 1};

int
main(void)
{
    for (int i = 2; i < TABLE_SIZE; i++)
        table[i] = table[i - 1] + table[i - 2];
    fwrite(table, sizeof(table), 1, stdout);
    return 0;
}

Build and run this intermediate helper program as part of the overall build.

$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.bin
$ ld -r -b binary -o table.o table.bin

And then the main program (print_fib.c) might look like:

#include 

extern long long _binary_table_bin_start[];
extern long long _binary_table_bin_end[];

int
main(void)
{
    long long *start = _binary_table_bin_start;
    long long *end   = _binary_table_bin_end;
    for (long long *x = start; x < end; x++)
        printf("%lld\n", *x);
    return 0;
}

However, there are some good reasons not to use this feature in this way:

The format of table.bin is specific to the host architecture (byte order, size, padding, etc.). If the host is the same as the target then this isn’t a problem, but it will prohibit cross-compilation.
The linker has no information about the alignment requirements of the data. To the linker it’s just a byte buffer. In the final program the long long array will not necessarily aligned properly for its type, meaning the above program might crash. The Right Way is to never dereference the data directly but rather memcpy() it into a properly-aligned variable, just as if the data was an unaligned buffer.
The data structure cannot use any pointers. Pointer values are meaningless to other processes and will be no different than garbage.

Towards a more portable approach

There’s an easy way to address all three of these problems and eliminate the reliance on GNU linkers: serialize the data into C code. It’s metaprogramming, baby.

In the Fibonacci example, change the fwrite() in table_gen.c to this:

    printf("int table_size = %d;\n", TABLE_SIZE);
    printf("long long table[] = {\n");
    for (int i = 0; i < TABLE_SIZE; i++)
        printf("    %lldLL,\n", table[i]);
    printf("};\n");

The output of the program becomes text:

int table_size = 90;
long long table[] = {
    1LL,
    1LL,
    2LL,
    3LL,
    /* ... */
    1779979416004714189LL,
    2880067194370816120LL,
}

And print_fib.c is changed to:

#include 

extern int table_size;
extern long long table[];

int
main(void)
{
    for (int i = 0; i < table_size; i++)
        printf("%lld\n", table[i]);
    return 0;
}

Putting it all together:

$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.c
$ gcc -std=c99 -o print_fib print_fib.c table.c

Any C compiler and linker could do all of this, no problem, making it more portable. The intermediate metaprogram isn’t a barrier to cross compilation. It would be compiled for the host (typically identified through HOST_CC) and the rest is compiled for the target (e.g. CC).

The output of table_gen.c isn’t dependent on any architecture, making it cross-compiler friendly. There are also no alignment problems because it’s all visible to compiler. The type system isn’t being undermined.

Dealing with pointers

The Fibonacci example doesn’t address the pointer problem — it has no pointers to speak of. So let’s step it up to a trie using the trie from the previous article. As a reminder, here it is:

#define TRIE_ALPHABET_SIZE  4
#define TRIE_TERMINAL_FLAG  (1U << 0)

struct trie {
    struct trie *next[TRIE_ALPHABET_SIZE];
    struct trie *p;
    int i;
    unsigned flags;
};

Dumping these structures out raw would definitely be useless since they’re almost entirely pointer data. So instead, fill out an array of these structures, referencing the array itself to build up the pointers (later filled in by either the linker or the loader). This code uses the in-place breadth-first traversal technique from the previous article.

void
trie_serialize(struct trie *t, const char *name)
{
    printf("struct trie %s[] = {\n", name);
    struct trie *head = t;
    struct trie *tail = t;
    t->p = NULL;
    size_t count = 0;
    while (head) {
        printf("    {​{");
        for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
            struct trie *next = head->next[i];
            const char *comma = i ? ", " : "";
            if (next) {
                /* Add child to the queue. */
                tail->p = next;
                tail = next;
                next->p = NULL;
                /* Print the pointer to the child. */
                printf("%s%s + %zu", comma, name, ++count);
            } else {
                printf("%s0", comma);
            }
        }
        printf("}, 0, 0, %u},\n", head->flags & TRIE_TERMINAL_FLAG);
        head = head->p;
    }
    printf("};\n");
}

Remember that list of strings from before?

AAAAA
ABCD
CAA
CAD
CDBD

Which looks like this?

That serializes to this C code:

struct trie root[] = {
    {​{root + 1, 0, root + 2, 0}, 0, 0, 0},
    {​{root + 3, root + 4, 0, 0}, 0, 0, 0},
    {​{root + 5, 0, 0, root + 6}, 0, 0, 0},
    {​{root + 7, 0, 0, 0}, 0, 0, 0},
    {​{0, 0, root + 8, 0}, 0, 0, 0},
    {​{root + 9, 0, 0, root + 10}, 0, 0, 0},
    {​{0, root + 11, 0, 0}, 0, 0, 0},
    {​{root + 12, 0, 0, 0}, 0, 0, 0},
    {​{0, 0, 0, root + 13}, 0, 0, 0},
    {​{0, 0, 0, 0}, 0, 0, 1},
    {​{0, 0, 0, 0}, 0, 0, 1},
    {​{0, 0, 0, root + 14}, 0, 0, 0},
    {​{root + 15, 0, 0, 0}, 0, 0, 0},
    {​{0, 0, 0, 0}, 0, 0, 1},
    {​{0, 0, 0, 0}, 0, 0, 1},
    {​{0, 0, 0, 0}, 0, 0, 1},
};

This trie can be immediately used at program startup without initialization, and it can even have new nodes inserted into it. It’s not without its downsides, particularly because it’s a trie:

It’s really going to blow up the size of the binary, especially when it holds lots of strings. These nodes are anything but compact.
If the code is compiled to be position-independent (-fPIC), each of those nodes is going to hold multiple dynamic relocations, further exploding the size of the binary and preventing the trie from being shared between processes. It’s 24 bytes per relocation on x86-64. This will also slow down program start up time. With just a few thousand strings, the simple test program was taking 5x longer to start (25ms instead of 5ms) than with an empty trie.
Even without being position-independent, the linker will have to resolve all the compile-time relocations. I was able to overwhelm linker and run it out of memory with just some tens of thousands of strings. This would make for a decent linker stress test.

This technique obviously doesn’t scale well with trie data. You’re better off baking in the flat string list and building the trie at run time — though you could compute the exact number of needed nodes at compile time and statically allocate them (in .bss). I’ve personally had much better luck with other sorts of lookup tables. It’s a useful tool for the C programmer’s toolbelt.

An Array of Pointers vs. a Multidimensional Array

2016-10-27T21:01:33Z

In a C program, suppose I have a table of color names of similar length. There are two straightforward ways to construct this table. The most common would be an array of char *.

char *colors_ptr[] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The other is a two-dimensional char array.

char colors_2d[][7] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The initializers are identical, and the syntax by which these tables are used is the same, but the underlying data structures are very different. For example, suppose I had a lookup() function that searches the table for a particular color.

int
lookup(const char *color)
{
    int ncolors = sizeof(colors) / sizeof(colors[0]);
    for (int i = 0; i < ncolors; i++)
        if (strcmp(colors[i], color) == 0)
            return i;
    return -1;
}

Thanks to array decay — array arguments are implicitly converted to pointers (§6.9.1-10) — it doesn’t matter if the table is char colors[][7] or char *colors[]. It’s a little bit misleading because the compiler generates different code depending on the type.

Memory Layout

Here’s what colors_ptr, a jagged array, typically looks like in memory.

The array of six pointers will point into the program’s string table, usually stored in a separate page. The strings aren’t in any particular order and will be interspersed with the program’s other string constants. The type of the expression colors_ptr[n] is char *.

On x86-64, suppose the base of the table is in rax, the index of the string I want to retrieve is rcx, and I want to put the string’s address back into rax. It’s one load instruction.

mov   rax, [rax + rcx*8]

Contrast this with colors_2d: six 7-byte elements in a row. No pointers or addresses. Only strings.

The strings are in their defined order, packed together. The type of the expression colors_2d[n] is char [7], an array rather than a pointer. If this was a large table used by a hot function, it would have friendlier cache characteristics — both in locality and predictability.

In the same scenario before with x86-64, it takes two instructions to put the string’s address in rax, but neither is a load.

imul  rcx, rcx, 7
add   rax, rcx

In this particular case, the generated code can be slightly improved by increasing the string size to 8 (e.g. char colors_2d[][8]). The multiply turns into a simple shift and the ALU no longer needs to be involved, cutting it to one instruction. This looks like a load due to the LEA (Load Effective Address), but it’s not.

lea   rax, [rax + rcx*8]

Relocation

There’s another factor to consider: relocation. Nearly every process running on a modern system takes advantage of a security feature called Address Space Layout Randomization (ASLR). The virtual address of code and data is randomized at process load time. For shared libraries, it’s not just a security feature, it’s essential to their basic operation. Libraries cannot possibly coordinate their preferred load addresses with every other library on the system, and so must be relocatable.

If the program is compiled with GCC or Clang configured for position independent code — -fPIC (for libraries) or -fpie + -pie (for programs) — extra work has to be done to support colors_ptr. Those are all addresses in the pointer array, but the compiler doesn’t know what those addresses will be. The compiler fills the elements with temporary values and adds six relocation entries to the binary, one for each element. The loader will fill out the array at load time.

However, colors_2d doesn’t have any addresses other than the address of the table itself. The loader doesn’t need to be involved with each of its elements. Score another point for the two-dimensional array.

On x86-64, in both cases the table itself typically doesn’t need a relocation entry because it will be RIP-relative (in the small code model). That is, code that uses the table will be at a fixed offset from the table no matter where the program is loaded. It won’t need to be looked up using the Global Offset Table (GOT).

In case you’re ever reading compiler output, in Intel syntax the assembly for putting the table’s RIP-relative address in rax looks like so:

;; NASM:
lea    rax, [rel address]
;; Some others:
lea    rax, [rip + address]

Or in AT&T syntax:

lea    address(%rip), %rax

Virtual Memory

Besides (trivially) more work for the loader, there’s another consequence to relocations: Pages containing relocations are not shared between processes (except after fork()). When loading a program, the loader doesn’t copy programs and libraries to memory so much as it memory maps their binaries with copy-on-write semantics. If another process is running with the same binaries loaded (e.g. libc.so), they’ll share the same physical memory so long as those pages haven’t been modified by either process. Modifying the page creates a unique copy for that process.

Relocations modify parts of the loaded binary, so these pages aren’t shared. This means colors_2d has the possibility of being shared between processes, but colors_ptr (and its entire page) definitely does not. Shucks.

This is one of the reasons why the Procedure Linkage Table (PLT) exists. The PLT is an array of function stubs for shared library functions, such as those in the C standard library. Sure, the loader could go through the program and fill out the address of every library function call, but this would modify lots and lots of code pages, creating a unique copy of large parts of the program. Instead, the dynamic linker lazily supplies jump addresses for PLT function stubs, one per accessed library function.

However, as I’ve written it above, it’s unlikely that even colors_2d will be shared. It’s still missing an important ingredient: const.

Const

They say const isn’t for optimization but, darnit, this situation keeps coming up. Since colors_ptr and colors_2d are both global, writable arrays, the compiler puts them in the same writable data section of the program, and, in my test program, they end up right next to each other in the same page. The other relocations doom colors_2d to being a local copy.

Fortunately it’s trivial to fix by adding a const:

const char colors_2d[][7] = { /* ... */ };

Writing to this memory is now undefined behavior, so the compiler is free to put it in read-only memory (.rodata) and separate from the dirty relocations. On my system, this is close enough to the code to wind up in executable memory.

Note, the equivalent for colors_ptr requires two const qualifiers, one for the array and another for the strings. (Obviously the const doesn’t apply to the loader.)

const char *const colors_ptr[] = { /* ... */ };

String literals are already effectively const, though the C specification (unlike C++) doesn’t actually define them to be this way. But, like setting your relationship status on Facebook, declaring it makes it official.

It’s just micro-optimization

These little details are all deep down the path of micro-optimization and will rarely ever matter in practice, but perhaps you learned something broader from all this. This stuff fascinates me.

Small-Size Optimization in C

2016-10-07T01:43:12Z

I’ve worked on many programs that frequently require small, short-lived buffers for use as a temporary workspace, perhaps to construct a string or array. In C this is often accomplished with arrays of automatic storage duration (i.e. allocated on the stack). This is dirt cheap — much cheaper than a heap allocation — and, unlike a typical general-purpose allocator, involves no thread contention. However, the catch that there may be no hard bound to the buffer. For correctness, the scratch space must scale appropriately to match its input. Whatever arbitrary buffer size I pick may be too small.

A widespread extension to C is the alloca() pseudo-function. It’s like malloc(), but allocates memory on the stack, just like an automatic variable. The allocation is automatically freed when the function (not its scope!) exits, even with a longjmp() or other non-local exit.

void *alloca(size_t size);

Besides its portability issues, the most dangerous property is the complete lack of error detection. If size is too large, the program simply crashes, or worse.

For example, suppose I have an intern() function that finds or creates the canonical representation/storage for a particular string. My program needs to intern a string composed of multiple values, and will construct a temporary string to do so.

const char *intern(const char *);

const char *
intern_identifier(const char *prefix, long id)
{
    size_t size = strlen(prefix) + 32;
    char *buffer = alloca(size);
    sprintf(buffer, "%s%ld", prefix, id);
    return intern(buffer);
}

I expect the vast majority of these prefix strings to be very small, perhaps on the order of 10 to 80 bytes, and this function will handle them extremely efficiently. But should this function get passed a huge prefix, perhaps by a malicious actor, the program will misbehave without warning.

A portable alternative to alloca() is variable-length arrays (VLA), introduced in C99. Arrays with automatic storage duration need not have a fixed, compile-time size. It’s just like alloca(), having exactly the same dangers, but at least it’s properly scoped. It was rejected for inclusion in C++11 due to this danger.

const char *
intern_identifier(const char *prefix, long id)
{
    char buffer[strlen(prefix) + 32];
    sprintf(buffer, "%s%ld", prefix, id);
    return intern(buffer);
}

There’s a middle-ground to this, using neither VLAs nor alloca(). Suppose the function always allocates a small, fixed size buffer — essentially a free operation — but only uses this buffer if it’s large enough for the job. If it’s not, a normal heap allocation is made with malloc().

const char *
intern_identifier(const char *prefix, long id)
{
    char temp[256];
    char *buffer = temp;
    size_t size = strlen(prefix) + 32;
    if (size > sizeof(temp))
        if (!(buffer = malloc(size)))
            return NULL;
    sprintf(buffer, "%s%ld", prefix, id);
    const char *result = intern(buffer);
    if (buffer != temp)
        free(buffer);
    return result;
}

Since the function can now detect allocation errors, this version has an error condition. Though, intern() itself would presumably return NULL for its own allocation errors, so this is probably transparent to the caller.

We’ve now entered the realm of small-size optimization. The vast majority of cases are small and will therefore be very fast, but we haven’t given up on the odd large case either. In fact, it’s been made a little bit worse (via the unnecessary small allocation), selling it out to make the common case fast. That’s sound engineering.

Visual Studio has a pair of functions that nearly automate this solution: _malloca() and _freea(). It’s like alloca(), but allocations beyond a certain threshold go on the heap. This allocation is freed with _freea(), which does nothing in the case of a stack allocation.

void *_malloca(size_t);
void _freea(void *);

I said “nearly” because Microsoft screwed it up: instead of returning NULL on failure, it generates a stack overflow structured exception (for a heap allocation failure).

I haven’t tried it yet, but I bet something similar to malloca() / freea() could be implemented using a couple of macros.

Toward Structured Small-Size Optimization

CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the talks. I don’t like developing in C++, but I always learn new, interesting concepts from this conference, many of which apply directly to C. I look forward to Chandler Carruth’s talks the most, having learned so much from his past talks. I recommend these all:

Efficiency with Algorithms, Performance with Data Structures (2014)
Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! (2015)
High Performance Code 201: Hybrid Data Structures (2016)
Understanding Compiler Optimization (2015)
Optimizing the Emergent Structures of C++ (2013)

After writing this article, I saw Nicholas Ormrod’s talk, The strange details of std::string at Facebook, which is also highly relevant.

Chandler’s talk this year was the one on hybrid data structures. I’d already been mulling over small-size optimization for months, and the first 5–10 minutes of his talk showed me I was on the right track. In his talk he describes LLVM’s SmallVector class (among others), which is basically a small-size-optimized version of std::vector, which, due to constraints on iterators under std::move() semantics, can’t itself be small-size optimized.

I picked up a new trick from this talk, which I’ll explain in C’s terms. Suppose I have a dynamically growing buffer “vector” of long values. I can keep pushing values into the buffer, doubling the storage in size each time it fills. I’ll call this one “simple.”

struct vec_simple {
    size_t size;
    size_t count;
    long *values;
};

Initialization is obvious. Though for easy overflow checks, and for another reason I’ll explain later, I’m going to require the starting size, hint, to be a power of two. It returns 1 on success and 0 on error.

int
vec_simple_init(struct vec_simple *v, size_t hint)
{
    assert(hint && (hint & (hint - 1)) == 0);  // power of 2
    v->size = hint;
    v->count = 0;
    v->values = malloc(sizeof(v->values[0]) * v->size);
    return !!v->values;
}

Pushing is straightforward, using realloc() when the buffer fills, returning 0 for integer overflow or allocation failure.

int
vec_simple_push(struct vec_simple *v, long x)
{
    if (v->count == v->size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = v->size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; // overflow
        void *new_values = realloc(v->values, new_size * value_size);
        if (!new_values)
            return 0; // out of memory
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

And finally, cleaning up. I hadn’t thought about this before, but if the compiler manages to inline vec_simple_free(), that NULL pointer assignment will probably get optimized out, possibly even in the face of a use-after-free bug.

void
vec_simple_free(struct vec_simple *v)
{
    free(v->values);
    v->values = 0;  // trap use-after-free bugs
}

And finally an example of its use (without checking for errors).

long
example(long (*f)(void *), void *arg)
{
    struct vec_simple v;
    vec_simple_init(&v, 16);
    long n;
    while ((n = f(arg)) > 0)
        vec_simple_push(&v, n);
    // ... process vector ...
    vec_simple_free(&v);
    return result;
}

If the common case is only a handful of long values, and this function is called frequently, we’re doing a lot of heap allocation that could be avoided. Wouldn’t it be nice to put all that on the stack?

Applying Small-Size Optimization

Modify the struct to add this temp field. It’s probably obvious what I’m getting at here. This is essentially the technique in SmallVector.

struct vec_small {
    size_t size;
    size_t count;
    long *values;
    long temp[16];
};

The values field is initially pointed at the small buffer. Notice that unlike the “simple” vector above, this initialization function cannot fail. It’s one less thing for the caller to check. It also doesn’t take a hint since the buffer size is fixed.

void
vec_small_init(struct vec_small *v)
{
    v->size = sizeof(v->temp) / sizeof(v->temp[0]);
    v->count = 0;
    v->values = v->temp;
}

Pushing gets a little more complicated. If it’s the first time the buffer has grown, the realloc() has to be done “manually” with malloc() and memcpy().

int
vec_small_push(struct vec_small *v, long x)
{
    if (v->count == v->size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = v->size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; // overflow

        void  *new_values;
        if (v->temp == v->values) {
            /* First time heap allocation. */
            new_values = malloc(new_size * value_size);
            if (new_values)
                memcpy(new_values, v->temp, sizeof(v->temp));
        } else {
            new_values = realloc(v->values, new_size * value_size);
        }

        if (!new_values)
            return 0; // out of memory
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

Finally, only call free() if the buffer was actually allocated on the heap.

void
vec_small_free(struct vec_small *v)
{
    if (v->values != v->temp)
        free(v->values);
    v->values = 0;
}

If 99% of these vectors never exceed 16 elements, then 99% of the time the heap isn’t touched. That’s much better than before. The 1% case is still covered, too, at what is probably an insignificant cost.

An important difference to SmallVector is that they parameterize the small buffer’s size through the template. In C we’re stuck with fixed sizes or macro hacks. Or are we?

Using a Caller-Provided Buffer

This time remove the temporary buffer, making it look like the simple vector from before.

struct vec_flex {
    size_t size;
    size_t count;
    long *values;
};

The user will provide the initial buffer, which will presumably be an adjacent, stack-allocated array, but whose size is under the user’s control.

void
vec_flex_init(struct vec_flex *v, long *init, size_t nmemb)
{
    assert(nmemb > 1); // we need that low bit!
    assert(nmemb && (nmemb & (nmemb - 1)) == 0); // power of 2
    v->size = nmemb | 1;
    v->count = 0;
    v->values = init;
}

The power of two size, greater than one, means the size will always be an even number. Why is this important? There’s one piece of information missing from the struct: Is the buffer currently heap allocated or not? That’s just one bit of information, but adding just one more bit to the struct will typically pad it out another 31 or 63 more bits. What a waste! Since I’m not using the lowest bit of the size (always being an even number), I can smuggle it in there. Hence the nmemb | 1, the 1 indicating that it’s not heap allocated.

When pushing, the actual_size is extracted by clearing the bottom bit (size & ~1) and the indicator bit is extracted with a 1 bit mask (size & 1). The bit is cleared by virtue of not intentionally setting it again.

int
vec_flex_push(struct vec_flex *v, long x)
{
    size_t actual_size = v->size & ~(size_t)1; // clear bottom bit
    if (v->count == actual_size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = actual_size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; /* overflow */

        void *new_values;
        if (v->size & 1) {
            /* First time heap allocation. */
            new_values = malloc(new_size * value_size);
            if (new_values)
                memcpy(new_values, v->values, actual_size * value_size);
        } else {
            new_values = realloc(v->values, new_size * value_size);
        }

        if (!new_values)
            return 0; /* out of memory */
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

Only free() when it’s been allocated, like before.

void
vec_flex_free(struct vec_flex *v)
{
    if (!(v->size & 1))
        free(v->values);
    v->values = 0;
}

And here’s what it looks like in action.

long
example(long (*f)(void *), void *arg)
{
    struct vec_flex v;
    long buffer[16];
    vec_flex_init(&v, buffer, sizeof(buffer) / sizeof(buffer[0]));
    long n;
    while ((n = f(arg)) > 0)
        vec_flex_push(&v, n);
    // ... process vector ...
    vec_flex_free(&v);
    return result;
}

If you were to log all vector sizes as part of profiling, and the assumption about their typical small number of elements was correct, you could easily tune the array size in each case to remove the vast majority of vector heap allocations.

Now that I’ve learned this optimization trick, I’ll be looking out for good places to apply it. It’s also a good reason for me to stop abusing VLAs.

Const and Optimization in C

2016-07-25T02:06:04Z

Today there was a question on /r/C_Programming about the effect of C’s const on optimization. Variations of this question have been asked many times over the past two decades. Personally, I blame naming of const.

Given this program:

void foo(const int *);

int
bar(void)
{
    int x = 0;
    int y = 0;
    for (int i = 0; i < 10; i++) {
        foo(&x);
        y += x;  // this load not optimized out
    }
    return y;
}

The function foo takes a pointer to const, which is a promise from the author of foo that it won’t modify the value of x. Given this information, it would seem the compiler may assume x is always zero, and therefore y is always zero.

However, inspecting the assembly output of several different compilers shows that x is loaded each time around the loop. Here’s gcc 4.9.2 at -O3, with annotations, for x86-64,

bar:
     push   rbp
     push   rbx
     xor    ebp, ebp              ; y = 0
     mov    ebx, 0xa              ; loop variable i
     sub    rsp, 0x18             ; allocate x
     mov    dword [rsp+0xc], 0    ; x = 0

.L0: lea    rdi, [rsp+0xc]        ; compute &x
     call   foo
     add    ebp, dword [rsp+0xc]  ; y += x  (not optmized?)
     sub    ebx, 1
     jne    .L0

     add    rsp, 0x18             ; deallocate x
     mov    eax, ebp              ; return y
     pop    rbx
     pop    rbp
     ret

The output of clang 3.5 (with -fno-unroll-loops) is the same, except ebp and ebx are swapped, and the computation of &x is hoisted out of the loop, into r14.

Are both compilers failing to take advantage of this useful information? Wouldn’t it be undefined behavior for foo to modify x? Surprisingly, the answer is no. In this situation, this would be a perfectly legal definition of foo.

void
foo(const int *readonly_x)
{
    int *x = (int *)readonly_x;  // cast away const
    (*x)++;
}

The key thing to remember is that const doesn’t mean constant. Chalk it up as a misnomer. It’s not an optimization tool. It’s there to inform programmers — not the compiler — as a tool to catch a certain class of mistakes at compile time. I like it in APIs because it communicates how a function will use certain arguments, or how the caller is expected to handle returned pointers. It’s usually not strong enough for the compiler to change its behavior.

Despite what I just said, occasionally the compiler can take advantage of const for optimization. The C99 specification, in §6.7.3¶5, has one sentence just for this:

If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined.

The original x wasn’t const-qualified, so this rule didn’t apply. And there aren’t any rules against casting away const to modify an object that isn’t itself const. This means the above (mis)behavior of foo isn’t undefined behavior for this call. Notice how the undefined-ness of foo depends on how it was called.

With one tiny tweak to bar, I can make this rule apply, allowing the optimizer do some work on it.

    const int x = 0;

The compiler may now assume that foo modifying x is undefined behavior, therefore it never happens. For better or worse, this is a major part of how a C optimizer reasons about your programs. The compiler is free to assume x never changes, allowing it to optimize out both the per-iteration load and y.

bar:
     push   rbx
     mov    ebx, 0xa            ; loop variable i
     sub    rsp, 0x10           ; allocate x
     mov    dword [rsp+0xc], 0  ; x = 0

.L0: lea    rdi, [rsp+0xc]      ; compute &x
     call   foo
     sub    ebx, 1
     jne    .L0

     add    rsp, 0x10           ; deallocate x
     xor    eax, eax            ; return 0
     pop    rbx
     ret

The load disappears, y is gone, and the function always returns zero.

Curiously, the specification almost allows the compiler to go further. Consider what would happen if x were allocated somewhere off the stack in read-only memory. That transformation would look like this:

static const int __x = 0;

int
bar(void)
{
    for (int i = 0; i < 10; i++)
        foo(&__x);
    return 0;
}

We would see a few more instructions shaved off (-fPIC, small code model):

section .rodata
x:   dd     0

section .text
bar:
     push   rbx
     mov    ebx, 0xa        ; loop variable i

.L0: lea    rdi, [rel x]    ; compute &x
     call   foo
     sub    ebx, 1
     jne    .L0

     xor    eax, eax        ; return 0
     pop    rbx
     ret

Because the address of x is taken and “leaked,” this last transform is not permitted. If bar is called recursively such that a second address is taken for x, that second pointer would compare equally (==) with the first pointer depsite being semantically distinct objects, which is forbidden (§6.5.9¶6).

Even with this special const rule, stick to using const for yourself and for your fellow human programmers. Let the optimizer reason for itself about what is constant and what is not.

Travis Downs nicely summed up this article in the comments:

In general, const declarations can’t help the optimizer, but const definitions can.

Hotpatching a C Function on x86

2016-03-31T23:59:59Z

In this post I’m going to do a silly, but interesting, exercise that should never be done in any program that actually matters. I’m going write a program that changes one of its function definitions while it’s actively running and using that function. Unlike last time, this won’t involve shared libraries, but it will require x86_64 and GCC. Most of the time it will work with Clang, too, but it’s missing an important compiler option that makes it stable.

If you want to see it all up front, here’s the full source: hotpatch.c

Here’s the function that I’m going to change:

void
hello(void)
{
    puts("hello");
}

It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:

void
hello(void)
{
    static int x;
    printf("goodbye %d\n", x++);
}

I was only going change the string, but I figured I should make it a little more interesting.

Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a far more complex problem.

But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:

I want to hotpatch this function while it is being used by another thread without any synchronization. It may even be executing the function at the same time I clobber its first instructions with my jump. If it’s in between these instructions, disaster will strike.

The solution is the ms_hook_prologue function attribute. This tells GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP that I can safely clobber. This idea originated in Microsoft’s Win32 API, hence the “ms” in the name.

The prologue NOP needs to be updated atomically. I can’t let the other thread see a half-written instruction or, again, disaster. On x86 this means I have an alignment requirement. Since I’m overwriting an 8-byte instruction, I’m specifically going to need 8-byte alignment to get an atomic write.

The solution is the aligned function attribute, ensuring the hotpatch prologue is properly aligned.

The final problem is that there must be exactly one copy of this function in the compiled program. It must never be inlined or cloned, since these won’t be hotpatched.

As you might have guessed, this is primarily fixed with the noinline function attribute. Since GCC may also clone the function and call that instead, so it also needs the noclone attribute.

Even further, if GCC determines there are no side effects, it may cache the return value and only ever call the function once. To convince GCC that there’s a side effect, I added an empty inline assembly string (__asm("")). Since puts() has a side effect (output), this isn’t truly necessary for this particular example, but I’m being thorough.

What does the function look like now?

__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))
void
hello(void)
{
    __asm("");
    puts("hello");
}

And what does the assembly look like?

$ objdump -Mintel -d hotpatch
0000000000400848 :
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 

It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction does nothing. It copies rsp into itself and changes no flags. Why not 8 1-byte NOPs? I need to replace exactly one instruction with exactly one other instruction. I can’t have another thread in between those NOPs.

Hotpatching

Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.

void
hotpatch(void *target, void *replacement)
{
    assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
    void *page = (void *)((uintptr_t)target & ~0xfff);
    mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
    uint32_t rel = (char *)replacement - (char *)target - 5;
    union {
        uint8_t bytes[8];
        uint64_t value;
    } instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
    *(uint64_t *)target = instruction.value;
    mprotect(page, 4096, PROT_EXEC);
}

It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.

There are a number of things going on here, so let’s go through them one at a time:

Make the function writeable

The .text segment will not be writeable by default. This is for both security and safety. Before I can hotpatch the function I need to make the function writeable. To make the function writeable, I need to make its page writable. To make its page writeable I need to call mprotect(). If there was another thread monkeying with the page attributes of this page at the same time (another thread calling hotpatch()) I’d be in trouble.

It finds the page by rounding the target address down to the nearest 4096, the assumed page size (sorry hugepages). Warning: I’m being a bad programmer and not checking the result of mprotect(). If it fails, the program will crash and burn. It will always fail systems with W^X enforcement, which will likely become the standard in the future. Under W^X (“write XOR execute”), memory can either be writeable or executable, but never both at the same time.

What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.

At the end of the function, I mprotect() the page back to non-writeable.

Create the instruction

I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:

Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.

To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.

I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.

Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.

    *(uint64_t *)target = instruction.value;

Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.

Trying it out

Here’s what my test program looks like:

void *
worker(void *arg)
{
    (void)arg;
    for (;;) {
        hello();
        usleep(100000);
    }
    return NULL;
}

int
main(void)
{
    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    getchar();
    hotpatch(hello, new_hello);
    pthread_join(thread, NULL);
    return 0;
}

I fire off the other thread to keep it pinging at hello(). In the main thread, it waits until I hit enter to give the program input, after which it calls hotpatch() and changes the function called by the “worker” thread. I’ve now changed the behavior of the worker thread without its knowledge. In a more practical situation, this could be used to update parts of a running program without restarting or even synchronizing.

Small, Freestanding Windows Executables

2016-01-31T22:53:03Z

Update: This is old and was updated in 2023!

Recently I’ve been experimenting with freestanding C programs on Windows. Freestanding refers to programs that don’t link, either statically or dynamically, against a standard library (i.e. libc). This is typical for operating systems and similar, bare metal situations. Normally a C compiler can make assumptions about the semantics of functions provided by the C standard library. For example, the compiler will likely replace a call to a small, fixed-size memmove() with move instructions. Since a freestanding program would supply its own, it may have different semantics.

My usual go to for C/C++ on Windows is Mingw-w64, which has greatly suited my needs the past couple of years. It’s packaged on Debian, and, when combined with Wine, allows me to fully develop Windows applications on Linux. Being GCC, it’s also great for cross-platform development since it’s essentially the same compiler as the other platforms. The primary difference is the interface to the operating system (POSIX vs. Win32).

However, it has one glaring flaw inherited from MinGW: it links against msvcrt.dll, an ancient version of the Microsoft C runtime library that currently ships with Windows. Besides being dated and quirky, it’s not an official part of Windows and never has been, despite its inclusion with every release since Windows 95. Mingw-w64 doesn’t have a C library of its own, instead patching over some of the flaws of msvcrt.dll and linking against it.

Since so much depends on msvcrt.dll despite its unofficial nature, it’s unlikely Microsoft will ever drop it from future releases of Windows. However, if strict correctness is a concern, we must ask Mingw-w64 not to link against it. An alternative would be PlibC, though the LGPL licensing is unfortunate. Another is Cygwin, which is a very complete POSIX environment, but is heavy and GPL-encumbered.

Sometimes I’d prefer to be more direct: skip the C standard library altogether and talk directly to the operating system. On Windows that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only links against system DLLs.

Linux vs. Windows

The most important benefit of a standard library like libc is a portable, uniform interface to the host system. So long as the standard library suits its needs, the same program can run anywhere. Without it, the programs needs an implementation of each host-specific interface.

On Linux, operating system requests at the lowest level are made directly via system calls. This requires a bit of assembly language for each supported architecture (int 0x80 on x86, syscall on x86-64, swi on ARM, etc.). The POSIX functions of the various Linux libc implementations are built on top of this mechanism.

For example, here’s a function for a 1-argument system call on x86-64.

long
syscall1(long n, long arg)
{
    long result;
    __asm__ volatile (
        "syscall"
        : "=a"(result)
        : "a"(n), "D"(arg)
    );
    return result;
}

Then exit() is implemented on top. Note: A real libc would do cleanup before exiting, like calling registered atexit() functions.

#include   // defines SYS_exit

void
exit(int code)
{
    syscall1(SYS_exit, code);
}

The situation is simpler on Windows. Its low level system calls are undocumented and unstable, changing across even minor updates. The formal, stable interface is through the exported functions in kernel32.dll. In fact, kernel32.dll is essentially a standard library on its own (making the term “freestanding” in this case dubious). It includes functions usually found only in user-space, like string manipulation, formatted output, font handling, and heap management (similar to malloc()). It’s not POSIX, but it has analogs to much of the same functionality.

Program Entry

The standard entry for a C program is main(). However, this is not the application’s true entry. The entry is in the C library, which does some initialization before calling your main(). When main() returns, it performs cleanup and exits. Without a C library, programs don’t start at main().

On Linux the default entry is the symbol _start. It’s prototype would look like so:

void _start(void);

Returning from this function leads to a segmentation fault, so it’s up to your application to perform the exit system call rather than return.

On Windows, the entry depends on the type of application. The two relevant subsystems today are the console and windows subsystems. The former is for console applications (duh). These programs may still create windows and such, but must always have a controlling console. The latter is primarily for programs that don’t run in a console, though they can still create an associated console if they like. In Mingw-w64, give -mconsole (default) or -mwindows to the linker to choose the subsystem.

The default entry for each is slightly different.

int WINAPI mainCRTStartup(void);
int WINAPI WinMainCRTStartup(void);

Unlike Linux’s _start, Windows programs can safely return from these functions, similar to main(), hence the int return. The WINAPI macro means the function may have a special calling convention, depending on the platform.

On any system, you can choose a different entry symbol or address using the --entry option to the GNU linker.

Disabling libgcc

One problem I’ve run into is Mingw-w64 generating code that calls __chkstk_ms() from libgcc. I believe this is a long-standing bug, since -ffreestanding should prevent these sorts of helper functions from being used. The workaround I’ve found is to disable the stack probe and pre-commit the whole stack.

-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000

Alternatively you could link against libgcc (statically) with -lgcc, but, again, I’m going for a tiny executable.

A freestanding example

Here’s an example of a Windows “Hello, World” that doesn’t use a C library.

#include 

int WINAPI
mainCRTStartup(void)
{
    char msg[] = "Hello, world!\n";
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(stdout, msg, sizeof(msg), (DWORD[]){0}, NULL);
    return 0;
}

To build it:

x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32

Notice I manually linked against kernel32.dll. The stripped final result is only 4kB, mostly PE padding. There are techniques to trim this down even further, but for a substantial program it wouldn’t make a significant difference.

From here you could create a GUI by linking against user32.dll and gdi32.dll (both also part of Win32) and calling the appropriate functions. I already ported my OpenGL demo to a freestanding .exe, dropping GLFW and directly using Win32 and WGL. It’s much less portable, but the final .exe is only 4kB, down from the original 104kB (static linking against GLFW).

I may go this route for the upcoming 7DRL 2016 in March.

A Basic Just-In-Time Compiler

2015-03-19T04:57:55Z

This article was discussed on Hacker News and on reddit.

Monday’s /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It’s given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it’s an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
…

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there’s less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;
};

To keep things simple here, I’m just assuming the page size is 4kB. In a real program, we’d use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It’s anonymous because it’s not backed by a file.

struct asmbuf *
asmbuf_create(void)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}

Windows doesn’t have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here’s the equivalent in Win32.

struct asmbuf *
asmbuf_create(void)
{
    DWORD type = MEM_RESERVE | MEM_COMMIT;
    return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}

Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

void
asmbuf_finalize(struct asmbuf *buf)
{
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}

Or on Win32 (that last parameter is not allowed to be NULL),

void
asmbuf_finalize(struct asmbuf *buf)
{
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}

Finally, instead of free() it gets unmapped.

void
asmbuf_free(struct asmbuf *buf)
{
    munmap(buf, PAGE_SIZE);
}

And on Win32,

void
asmbuf_free(struct asmbuf *buf)
{
    VirtualFree(buf, 0, MEM_RELEASE);
}

I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We’re only going to be concerned with three of x86-64’s many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we’ll be able to call from C like a normal function. Here’s what it’s prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that’s where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   rax, rdi

There’s a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.

ret

So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.

Fortunately there’s a much easier way. We’ll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to nasm. It will produce a raw binary with the machine code, which we’ll disassemble with nidsasm (the NASM disassembler).

$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret

That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // ret

For each operation, we’ll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator’s instruction if it’s small (32-bits or fewer), but I’m keeping it simple. To sneakily capture the “template” for this instruction I’m going to use 0x0123456789abcdef as the operand.

mov   rdi, 0x0123456789abcdef

Which disassembled with ndisasm is,

00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301

Notice the operand listed little endian immediately after the instruction. That’s also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   rdi, operand
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
    case '+':
        asmbuf_ins(buf, 3, 0x4801f8);   // add   rax, rdi
        break;
    case '-':
        asmbuf_ins(buf, 3, 0x4829f8);   // sub   rax, rdi
        break;
    case '*':
        asmbuf_ins(buf, 4, 0x480fafc7); // imul  rax, rdi
        break;
    case '/':
        asmbuf_ins(buf, 3, 0x4831d2);   // xor   rdx, rdx
        asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  rdi
        break;
}

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we’re all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.

I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

C11 Lock-free Stack

2014-09-02T03:10:01Z

C11, the latest C standard revision, hasn’t received anywhere near the same amount of fanfare as C++11. I’m not sure why this is. Some of the updates to each language are very similar, such as formal support for threading and atomic object access. Three years have passed and some parts of C11 still haven’t been implemented by any compilers or standard libraries yet. Since there’s not yet a lot of discussion online about C11, I’m basing much of this article on my own understanding of the C11 draft. I may be under-using the _Atomic type specifier and not paying enough attention to memory ordering constraints.

Still, this is a good opportunity to break new ground with a demonstration of C11. I’m going to use the new stdatomic.h portion of C11 to build a lock-free data structure. To compile this code you’ll need a C compiler and C library with support for both C11 and the optional stdatomic.h features. As of this writing, as far as I know only GCC 4.9, released April 2014, supports this. It’s in Debian unstable but not in Wheezy.

If you want to take a look before going further, here’s the source. The test code in the repository uses plain old pthreads because C11 threads haven’t been implemented by anyone yet.

https://github.com/skeeto/lstack

I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won’t run just anywhere. I’ll discuss all this below. So sorry for the long article. It’s just a lot more complex a topic than I had anticipated!

Lock-free

A lock-free data structure doesn’t require the use of mutex locks. More generally, it’s a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations — transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it’s usually safer, because there’s no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there’s additional risk of starvation (livelock), where a thread is unable to make progress.

As a demonstration, I’ll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it’s going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there’s only one value to be updated when pushing and popping: the pointer to the head of the list.

Here’s what the API will look like. I’ll define lstack_t shortly. I’m making it an opaque type because its fields should never be accessed directly. The goal is to completely hide the atomic semantics from the users of the stack.

int     lstack_init(lstack_t *lstack, size_t max_size);
void    lstack_free(lstack_t *lstack);
size_t  lstack_size(lstack_t *lstack);
int     lstack_push(lstack_t *lstack, void *value);
void   *lstack_pop (lstack_t *lstack);

Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.

Notice there’s a maximum size declared at initialization time. While lock-free allocation is possible [PDF], C makes no guarantees that malloc() is lock-free, so being truly lock-free means not calling malloc(). An important secondary benefit to pre-allocating the stack’s memory is that this implementation doesn’t require the use of hazard pointers, which would be far more complicated than the stack itself.

The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being “full.”

The int return value of lstack_init() and lstack_push() is for error codes, returning 0 for success. The only way these can fail is by running out of memory. This is an issue regardless of being lock-free: systems can simply run out of memory. In the push case it means the stack is full.

Structures

Here’s the definition for a node in the stack. Neither field needs to be accessed atomically, so they’re not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it’s effectively immutable (outside of reuse). Users never need to touch this structure.

struct lstack_node {
    void *value;
    struct lstack_node *next;
};

Internally a lstack_t is composed of two stacks: the value stack (head) and the free node stack (free). These will be handled identically by the atomic functions, so it’s really a matter of convention which stack is which. All nodes are initially placed on the free stack and the value stack starts empty. Here’s what an internal stack looks like.

struct lstack_head {
    uintptr_t aba;
    struct lstack_node *node;
};

There’s still no atomic declaration here because the struct is going to be handled as an entire unit. The aba field is critically important for correctness and I’ll go over it shortly. It’s declared as a uintptr_t because it needs to be the same size as a pointer. Now, this is not guaranteed by C11 — it’s only guaranteed to be large enough to hold any valid void * pointer, so it could be even larger — but this will be the case on any system that has the required hardware support for this lock-free stack. This struct is therefore the size of two pointers. If that’s not true for any reason, this code will not link. Users will never directly access or handle this struct either.

Finally, here’s the actual stack structure.

typedef struct {
    struct lstack_node *node_buffer;
    _Atomic struct lstack_head head, free;
    _Atomic size_t size;
} lstack_t;

Notice the use of the new _Atomic qualifier. Atomic values may have different size, representation, and alignment requirements in order to satisfy atomic access. These values should never be accessed directly, even just for reading (use atomic_load()).

The size field is for convenience to check the number of elements on the stack. It’s accessed separately from the stack nodes themselves, so it’s not safe to read size and use the information to make assumptions about future accesses (e.g. checking if the stack is empty before popping off an element). Since there’s no way to lock the lock-free stack, there’s otherwise no way to estimate the size of the stack during concurrent access without completely disassembling it via lstack_pop().

There’s no reason to use volatile here. That’s a separate issue from atomic operations. The C11 stdatomic.h macros and functions will ensure atomic values are accessed appropriately.

Stack Functions

As stated before, all nodes are initially placed on the internal free stack. During initialization they’re allocated in one solid chunk, chained together, and pinned on the free pointer. The initial assignments to atomic values are done through ATOMIC_VAR_INIT, which deals with memory access ordering concerns. The aba counters don’t actually need to be initialized. Garbage, indeterminate values are just fine, but not initializing them would probably look like a mistake.

int
lstack_init(lstack_t *lstack, size_t max_size)
{
    struct lstack_head head_init = {0, NULL};
    lstack->head = ATOMIC_VAR_INIT(head_init);
    lstack->size = ATOMIC_VAR_INIT(0);

    /* Pre-allocate all nodes. */
    lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
    if (lstack->node_buffer == NULL)
        return ENOMEM;
    for (size_t i = 0; i < max_size - 1; i++)
        lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
    lstack->node_buffer[max_size - 1].next = NULL;
    struct lstack_head free_init = {0, lstack->node_buffer};
    lstack->free = ATOMIC_VAR_INIT(free_init);
    return 0;
}

The free nodes will not necessarily be used in the same order that they’re placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.

The reverse of lstack_init() is simple, and it’s assumed concurrent access has terminated. The stack is no longer valid, at least not until lstack_init() is used again. This one is declared inline and put in the header.

static inline void
stack_free(lstack_t *lstack)
{
    free(lstack->node_buffer);
}

To read an atomic value we need to use atomic_load(). Give it a pointer to an atomic value, it dereferences the pointer and returns the value. This is used in another inline function for reading the size of the stack.

static inline size_t
lstack_size(lstack_t *lstack)
{
    return atomic_load(&lstack->size);
}

Push and Pop

For operating on the two stacks there will be two internal, static functions, push and pop. These deal directly in nodes, accepting and returning them, so they’re not suitable to expose in the API (users aren’t meant to be aware of nodes). This is the most complex part of lock-free stacks. Here’s pop().

static struct lstack_node *
pop(_Atomic struct lstack_head *head)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        if (orig.node == NULL)
            return NULL;  // empty stack
        next.aba = orig.aba + 1;
        next.node = orig.node->next;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
    return orig.node;
}

It’s centered around the new C11 stdatomic.h function atomic_compare_exchange_weak(). This is an atomic operation more generally called compare-and-swap (CAS). On x86 there’s an instruction specifically for this, cmpxchg. Give it a pointer to the atomic value to be updated (head), a pointer to the value it’s expected to be (orig), and a desired new value (next). If the expected and actual values match, it’s updated to the new value. If not, it reports a failure and updates the expected value to the latest value. In the event of a failure we start all over again, which requires the while loop. This is an optimistic strategy.

The “weak” part means it will sometimes spuriously fail where the “strong” version would otherwise succeed. In exchange for more failures, calling the weak version is faster. Use the weak version when the body of your do ... while loop is fast and the strong version when it’s slow (when trying again is expensive), or if you don’t need a loop at all. You usually want to use weak.

The alternative to CAS is load-link/store-conditional. It’s a stronger primitive that doesn’t suffer from the ABA problem described next, but it’s also not available on x86-64. On other platforms, one or both of atomic_compare_exchange_*() will be implemented using LL/SC, but we still have to code for the worst case (CAS).

The ABA Problem

The aba field is here to solve the ABA problem by counting the number of changes that have been made to the stack. It will be updated atomically alongside the pointer. Reasoning about the ABA problem is where I got stuck last time writing this article.

Suppose aba didn’t exist and it was just a pointer being swapped. Say we have two threads, A and B.

Thread A copies the current head into orig, enters the loop body to update next.node to orig.node->next, then gets preempted before the CAS. The scheduler pauses the thread.
Thread B comes along performs a pop() changing the value pointed to by head. At this point A’s CAS will fail, which is fine. It would reconstruct a new updated value and try again. While A is still asleep, B puts the popped node back on the free node stack.
Some time passes with A still paused. The freed node gets re-used and pushed back on top of the stack, which is likely given that nodes are allocated FIFO. Now head has its original value again, but the head->node->next pointer is pointing somewhere completely new! This is very bad because A’s CAS will now succeed despite next.node having the wrong value.
A wakes up and it’s CAS succeeds. At least one stack value has been lost and at least one node struct was leaked (it will be on neither stack, nor currently being held by a thread). This is the ABA problem.

The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They’ll work with append-only data structures, where pointers are never recycled, but that’s it.

The aba field solves the problem because it’s incremented every time the pointer is updated. Remember that this internal stack struct is two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16 bytes is compared by CAS and they all have to match for it to succeed. Since B, or other threads, will increment aba at least twice (once to remove the node, and once to put it back in place), A will never mistake the recycled pointer for the old one. There’s a special double-width CAS instruction specifically for this purpose, cmpxchg16. This is generally called DWCAS. It’s available on most x86-64 processors. On Linux you can check /proc/cpuinfo for support. It will be listed as cx16.

If it’s not available at compile-time this program won’t link. The function that wraps cmpxchg16 won’t be there. You can tell GCC to assume it’s there with the -mcx16 flag. The same rule here applies to C++11’s new std::atomic.

There’s still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There’s nothing we can do about this, but if you witness this in the wild you need to immediately stop what you’re doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.

Hazard Pointers and Garbage Collection

Another problem in pop() is dereferencing orig.node to access its next field. By the time we get to it, the node pointed to by orig.node may have already been removed from the stack and freed. If the stack was using malloc() and free() for allocations, it may even have had free() called on it. If so, the dereference would be undefined behavior — a segmentation fault, or worse.

There are three ways to deal with this.

Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won’t be a problem. However, if we’re interacting with a garbage collector we’re not really lock-free.
Hazard pointers. Each thread keeps track of what nodes it’s currently accessing and other threads aren’t allowed to free nodes on this list. This is messy and complicated.
Never free nodes. This implementation recycles nodes, but they’re never truly freed until lstack_free(). It’s always safe to dereference a node pointer because there’s always a node behind it. It may point to a node that’s on the free list or one that was even recycled since we got the pointer, but the aba field deals with any of those issues.

Reference counting on the node won’t work here because we can’t get to the counter fast enough (atomically). It too would require dereferencing in order to increment. The reference counter could potentially be packed alongside the pointer and accessed by a DWCAS, but we’re already using those bytes for aba.

Push

Push is a lot like pop.

static void
push(_Atomic struct lstack_head *head, struct lstack_node *node)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        node->next = orig.node;
        next.aba = orig.aba + 1;
        next.node = node;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
}

It’s counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn’t take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.

API Push and Pop

The API push and pop functions are built on these internal atomic functions.

int
lstack_push(lstack_t *lstack, void *value)
{
    struct lstack_node *node = pop(&lstack->free);
    if (node == NULL)
        return ENOMEM;
    node->value = value;
    push(&lstack->head, node);
    atomic_fetch_add(&lstack->size, 1);
    return 0;
}

Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there’s an instant where the stack size is listed as one shorter than it actually is. However, since there’s no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.

Popping is the same thing in reverse.

void *
lstack_pop(lstack_t *lstack)
{
    struct lstack_node *node = pop(&lstack->head);
    if (node == NULL)
        return NULL;
    atomic_fetch_sub(&lstack->size, 1);
    void *value = node->value;
    push(&lstack->free, node);
    return value;
}

Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It’s really simple with the primitive push and pop.

SHA1 Demo

The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It’s not really exercising the library thoroughly because there are no contended pops, but I couldn’t think of a better example at the time.

The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.