Brute Force Incognito Browsing

Both Firefox and Chrome have a feature for creating temporary private browsing sessions. Firefox calls it Private Browsing and Chrome calls it Incognito Mode. Both work essentially the same way. A temporary browsing session is started without carrying over most existing session state (cookies, etc.), and no state (cookies, browsing history, cached data, etc.) is preserved after ending the session. Depending on the configuration, some browser extensions will be enabled in the private session, and their own internal state may be preserved.

The most obvious use is for visiting websites that you don’t want listed in your browsing history. Another use for more savvy users is to visit websites with a fresh, empty cookie file. For example, some news websites use a cookie to track the number visits and require a subscription after a certain number of “free” articles. Manually deleting cookies is a pain (especially without a specialized extension), but opening the same article in a private session is two clicks away.

For web development there’s yet another use. A private session is a way to view your website from the perspective of a first-time visitor. You’ll be logged out and will have little or no existing state.

However, sometimes it just doesn’t go far enough. Some of those news websites have adapted, and in addition to counting the number of visits, they’ve figured out how to detect private sessions and block them. I haven’t looked into how they do this — maybe something to do with local storage, or detecting previously cached content. Sometimes I want a private session that’s truly fully isolated. The existing private session features just aren’t isolated enough or they behave differently, which is how they’re being detected.

Some time ago I put together a couple of scripts to brute force my own private sessions when I need them, generally for testing websites in a guaranteed fresh, fully-functioning instance. It also lets me run multiple such sessions in parallel. My scripts don’t rely on any private session feature of the browser, so the behavior is identical to a real browser, making it undetectable.

The downside is that, for better or worse, no browser extensions are carried over. In some ways this can be considered a feature, but a lot of the time I would like my ad-blocker to carry over. Your ad-blocker is probably the most important security software on your computer, so you should hesitate to give it up.

Another downside is that both Firefox and Chrome have some irritating first-time behaviors that can’t be disabled. The intent is to be newbie-friendly but it just gets in my way. For example, both bug me about logging into their browser platforms. Firefox starts with two tabs. Chrome creates a popup to ask me to configure a printer. Both start with a junk URL in the location bar so I can’t just middle-click paste (i.e. the X11 selection clipboard) into it. It’s definitely not designed for my use case.

Firefox

Here’s my brute force private session script for Firefox:

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/firefox-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
firefox -profile "$TEMP" -no-remote "$@"

It creates a temporary directory under $XDG_CACHE_HOME and tells Firefox to use the profile in that directory. No such profile exists, of course, so Firefox creates a fresh profile.

In theory I could just create a new profile alongside the default within my existing ~/.mozilla directory. However, I’ve never liked Firefox’s profile feature, especially with the intentionally unpredictable way it stores the profile itself: behind random path. I also don’t trust it to be fully isolated and to fully clean up when I’m done.

Before starting Firefox, I register a trap with the shell to clean up the profile directory regardless of what happens. It doesn’t matter if Firefox exits cleanly, if it crashes, or if I CTRL-C it to death.

The -no-remote option prevents the new Firefox instance from joining onto an existing Firefox instance, which it really prefers to do even though it’s technically supposed to be a different profile.

Note the "$@", which passes arguments through to Firefox — most often the URL of the site I want to test.

Chromium

I don’t actually use Chrome but rather the open source version, Chromium. I think this script will also work with Chrome.

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/chromium-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
chromium --user-data-dir="$TEMP" \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

It’s exactly the same as the Firefox script and only the browser arguments have changed. I tell it not to ask about being the default browser, and --no-first-run disables some of the irritating first-time behaviors.

Chromium is very noisy on the command line, so I also redirect all output to /dev/null.

If you’re on Debian like me, its version of Chromium comes with a --temp-profile option that handles the throwaway profile automatically. So the script can be simplified:

#!/bin/sh -e
chromium --temp-profile \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

In my own use case, these scripts have fully replaced the built-in private session features. In fact, since Chromium is not my primary browser, my brute force private session script is how I usually launch it. I only run it to test things, and I always want to test using a fresh profile.

Prospecting for Hash Functions

I recently got an itch to design my own non-cryptographic integer hash function. Firstly, I wanted to better understand how hash functions work, and the best way to learn is to do. For years I’d been treating them like magic, shoving input into it and seeing random-looking, but deterministic, output come out the other end. Just how is the avalanche effect achieved?

Secondly, could I apply my own particular strengths to craft a hash function better than the handful of functions I could find online? Especially the classic ones from Thomas Wang and Bob Jenkins. Instead of struggling with the mathematics, maybe I could software engineer my way to victory, working from the advantage of access to the excessive computational power of today.

Suppose, for example, I wrote tool to generate a random hash function definition, then JIT compile it to a native function in memory, then execute that function across various inputs to evaluate its properties. My tool could rapidly repeat this process in a loop until it stumbled upon an incredible hash function the world had never seen. That’s what I actually did. I call it the Hash Prospector:

https://github.com/skeeto/hash-prospector

It only works on x86-64 because it uses the same JIT compiling technique I’ve discussed before: allocate a page of memory, write some machine instructions into it, set the page to executable, cast the page pointer to a function pointer, then call the generated code through the function pointer.

Generating a hash function

My focus is on integer hash functions: a function that accepts an n-bit integer and returns an n-bit integer. One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.

This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversable, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input.

For example, XOR with a constant is trivially reversible: XOR the output with the same constant to reverse it. Addition with a constant is reversed by subtraction with the same constant. Since the integer operations are modular arithmetic, modulo 2^n for n-bit integers, multiplication by an odd number is reversible. Odd numbers are coprime with the power-of-two modulus, so there is some modular multiplicative inverse that reverses the operation.

Bret Mulvey’s hash function article provides a convenient list of some reversible operations available for constructing integer hash functions. This list was the catalyst for my little project. Here are the ones used by the hash prospector:

x  = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation

I’ve come across a couple more useful operations while studying existing integer hash functions, but I didn’t put these in the prospector.

hash += ~(hash << constant);
hash -= ~(hash << constant);

The prospector picks some operations at random and fills in their constants randomly within their proper constraints. For example, here’s an awful hash function I made it generate as an example:

// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
    x *= UINT32_C(0x1eca7d79);
    x ^= x >> 20;
    x  = (x << 8) | (x >> 24);
    x  = ~x;
    x ^= x << 5;
    x += UINT32_C(0x10afe4e7);
    return x;
}

That function is reversible, and it would be relatively straightforward to define its inverse. However, it has awful biases and poor avalanche. How do I know this?

The measure of a hash function

There are two key properties I’m looking for in randomly generated hash functions.

  1. High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.

  2. Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

Initially I screwed up and only measured the first property. This lead to some hash functions that seemed to be amazing before close inspection, since, for a 32-bit hash function, it was flipping over 15 output bits on average. However, the particular bits being flipped were heavily biased, resulting in obvious patterns in the output.

For example, when hashing a counter starting from zero, the high bits would follow a regular pattern. 15 to 16 bits were being flipped each time, but it was always the same bits.

Conveniently it’s easy to measure both properties at the same time. For an n-bit integer hash function, create an n by n table initialized to zero. The rows are input bits and the columns are output bits. The ith row and jth column track the correlation between the ith input bit and jth output bit.

Then exhaustively iterate over all 2^n inputs, and flip each bit one at a time. Increment the appropriate element in the table if the output bit flips.

When you’re done, ideally each element in the table is exactly 2^(n-1). That is, each output bit was flipped exactly half the time by each input bit. Therefore the bias of the hash function is the distance (the error) of the computed table from the ideal table.

For example, the ideal bias table for an 8-bit hash function would be:

128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128

The hash prospector computes the standard deviation in order to turn this into a single, normalized measurement. Lower scores are better.

However, there’s still one problem: the input space for a 32-bit hash function is over 4 billion values. The full test takes my computer about an hour and a half. Evaluating a 64-bit hash function is right out.

Again, Monte Carlo to the rescue! Rather than sample the entire space, just sample a random subset. This provides a good estimate in less than a second, allowing lots of terrible hash functions to be discarded early. The full test can be saved only for the known good 32-bit candidates. 64-bit functions will only ever receive the estimate.

What did I find?

Once I got the bias issue sorted out, and after hours and hours of running, followed up with some manual tweaking on my part, the prospector stumbled across this little gem:

// DO use this one!
uint32_t
prospector32(uint32_t x)
{
    x ^= x >> 15;
    x *= UINT32_C(0x2c1b3c6d);
    x ^= x >> 12;
    x *= UINT32_C(0x297a2d39);
    x ^= x >> 15;
    return x;
}

According to a full (e.g. not estimated) bias evaluation, this function beats the snot out of most of 32-bit hash functions I could find. It even comes out ahead of this well known hash function that I believe originates from the H2 SQL Database. (Update: Thomas Mueller has confirmed that, indeed, this is his hash function.)

uint32_t
hash32(uint32_t x)
{
    x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
    x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
    x = (x >> 16) ^ x;
    return x;
}

It’s still an excellent hash function, just slightly more biased than mine.

Very briefly, prospector32() was the best 32-bit hash function I could find, and I thought I had a major breakthrough. Then I noticed the finalizer function for the 32-bit variant of MurmurHash3. It’s also a 32-bit hash function:

uint32_t
murmurhash32_mix32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x85ebca6b);
    x ^= x >> 13;
    x *= UINT32_C(0xc2b2ae35);
    x ^= x >> 16;
    return x;
}

This one is just barely less biased than mine. So I still haven’t discovered the best 32-bit hash function, only the second best one. :-)

A pattern emerges

If you’re paying close enough attention, you may have noticed that all three functions above have the same structure. The prospector had stumbled upon it all on its own without knowledge of the existing functions. It may not be so obvious for the second function, but here it is refactored:

uint32_t
hash32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x45d9f3b);
    x ^= x >> 16;
    x *= UINT32_C(0x45d9f3b);
    x ^= x >> 16;
    return x;
}

I hadn’t noticed this until after the prospector had come across it on its own. The pattern for all three is XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift. There’s something particularly useful about this multiply-xorshift construction. The XOR-right-shift diffuses bits rightward and the multiply diffuses bits leftward. I like to think it’s “sloshing” the bits right, left, right, left.

It seems that multiplication is particularly good at diffusion, so it makes perfect sense to exploit it in non-cryptographic hash functions, especially since modern CPUs are so fast at it. Despite this, it’s not used much in cryptography due to issues with completing it in constant time.

I like to think of this construction in terms of a five-tuple. For the three functions it’s the following:

(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()

The prospector actually found lots of decent functions following this pattern, especially where the middle shift is smaller than the outer shift. Thinking of it in terms of this tuple, I specifically directed it to try different tuple constants. That’s what I meant by “tweaking.” Eventually my new function popped out with its really low bias.

The prospector has a template option (-p) if you want to try it yourself:

$ ./prospector -p xorr,mul,xorr,mul,xorr

If you really have your heart set on certain constants, such as my specific selection of shifts, you can lock those in while randomizing the other constants:

$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15

Or the other way around:

$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr

My function seems a little strange using shifts of 15 bits rather than a nice, round 16 bits. However, changing those constants to 16 increases the bias. Similarly, neither of the two 32-bit constants is a prime number, but nudging those constants to the nearest prime increases the bias. These parameters really do seem to be a local minima in the bias, and using prime numbers isn’t important.

What about 64-bit integer hash functions?

So far I haven’t been able to improve on 64-bit hash functions. The main function to beat is SplittableRandom / SplitMix64:

uint64_t
splittable64(uint64_t x)
{
    x ^= x >> 30;
    x *= UINT64_C(0xbf58476d1ce4e5b9);
    x ^= x >> 27;
    x *= UINT64_C(0x94d049bb133111eb);
    x ^= x >> 31;
    return x;
}

I also came across this one:

uint64_t
hash64(uint64_t x)
{
    x ^= x >> 32;
    x *= UINT64_C(0xd6e8feb86659fd93);
    x ^= x >> 32;
    x *= UINT64_C(0xd6e8feb86659fd93);
    x ^= x >> 32;
    return x;
}

Again, these follow the same construction as before. There really is something special about it, and many other people have noticed, too.

Both functions have about the same bias. (Remember, I can only estimate the bias for 64-bit hash functions.) The prospector has found lots of functions with about the same bias, but nothing provably better. Until it does, I have no new 64-bit integer hash functions to offer.

String hash

I’m also experimenting with using my hash function as a sort of primitive for a string hash function. Here I’m using my function in the loop to mix in one byte at a time, and finishing it with the same finalizer as MurmurHash3.

uint32_t
prospector32s(const void *buf, uint32_t len, uint32_t key)
{
    uint32_t hash = key;
    const unsigned char *p = buf;
    for (uint32_t i = 0; i < len; i++) {
        hash += p[i];
        hash ^= hash >> 15;
        hash *= UINT32_C(0x2c1b3c6d);
        hash ^= hash >> 12;
        hash *= UINT32_C(0x297a2d39);
        hash ^= hash >> 15;
    }
    hash ^= len;
    hash ^= hash >> 16;
    hash *= UINT32_C(0x85ebca6b);
    hash ^= hash >> 13;
    hash *= UINT32_C(0xc2b2ae35);
    hash ^= hash >> 16;
    return hash + key;
}

It has the typical amount of collisions when running it on a large dictionary, so it seems decent enough but I don’t know if this hash function is worth much. More experimentation needed.

Right now the prospector does a completely random, unstructured search hoping to stumble upon something good by chance. Perhaps it would be worth using a genetic algorithm to breed those 5-tuples towards optimum? Others have had success in this area with simulated annealing.

There’s probably more to exploit from the multiply-xorshift construction that keeps popping up. If anything, the prospector is searching too broadly, looking at constructions that could never really compete no matter what the constants. In addition to everything above, I’ve been looking for good 32-bit hash functions that don’t use any 32-bit constants, but I’m really not finding any with a competitively low bias.

Update after one week

About one week after publishing this article I found an even better hash function. I believe this is the least biased 32-bit integer hash function of this form ever devised. It’s even less biased than the MurmurHash3 finalizer.

// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x7feb352d);
    x ^= x >> 15;
    x *= UINT32_C(0x846ca68b);
    x ^= x >> 16;
    return x;
}

// inverse
uint32_t
lowbias32_r(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x43021123);
    x ^= x >> 15 ^ x >> 30;
    x *= UINT32_C(0x1d69e2a5);
    x ^= x >> 16;
    return x;
}

If you’re willing to use an additional round of multiply-xorshift, this next function actually reaches the theoretical bias limit (bias = ~0.021) as exhibited by a perfect integer hash function:

// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
    x ^= x >> 17;
    x *= UINT32_C(0xed5ad4bb);
    x ^= x >> 11;
    x *= UINT32_C(0xac4c1b51);
    x ^= x >> 15;
    x *= UINT32_C(0x31848bab);
    x ^= x >> 14;
    return x;
}

It’s statistically indistinguishable from a random permutation of all 32-bit integers.

The Value of Undefined Behavior

In several places, the C and C++ language specifications use a curious, and fairly controversial, phrase: undefined behavior. For certain program constructs, the specification prescribes no specific behavior, instead allowing anything to happen. Such constructs are considered erroneous, and so the result depends on the particulars of the platform and implementation. The original purpose of undefined behavior was for implementation flexibility. In other words, it’s slack that allows a compiler to produce appropriate and efficient code for its target platform.

Specifying a particular behavior would have put unnecessary burden on implementations — especially in the earlier days of computing — making for inefficient programs on some platforms. For example, if the result of dereferencing a null pointer was defined to trap — to cause the program to halt with an error — then platforms that do not have hardware trapping, such as those without virtual memory, would be required to instrument, in software, each pointer dereference.

In the 21st century, undefined behavior has taken on a somewhat different meaning. Optimizers use it — or abuse it depending on your point of view — to lift constraints that would otherwise inhibit more aggressive optimizations. It’s not so much a fundamentally different application of undefined behavior, but it does take the concept to an extreme.

The reasoning works like this: A program that evaluates a construct whose behavior is undefined cannot, by definition, have any meaningful behavior, and so that program would be useless. As a result, compilers assume programs never invoke undefined behavior and use those assumptions to prove its optimizations.

Under this newer interpretation, mistakes involving undefined behavior are more punishing and surprising than before. Programs that seem to make some sense when run on a particular architecture may actually compile into a binary with a security vulnerability due to conclusions reached from an analysis of its undefined behavior.

This can be frustrating if your programs are intended to run on a very specific platform. In this situation, all behavior really could be locked down and specified in a reasonable, predictable way. Such a language would be like an extended, less portable version of C or C++. But your toolchain still insists on running your program on the abstract machine rather than the hardware you actually care about. However, even in this situation undefined behavior can still be desirable. I will provide a couple of examples in this article.

Signed integer overflow

To start things off, let’s look at one of my all time favorite examples of useful undefined behavior, a situation involving signed integer overflow. The result of a signed integer overflow isn’t just unspecified, it’s undefined behavior. Full stop.

This goes beyond a simple matter of whether or not the underlying machine uses a two’s complement representation. From the perspective of the abstract machine, just the act a signed integer overflowing is enough to throw everything out the window, even if the overflowed result is never actually used in the program.

On the other hand, unsigned integer overflow is defined — or, more accurately, defined to wrap, not overflow. Both the undefined signed overflow and defined unsigned overflow are useful in different situations.

For example, here’s a fairly common situation, much like what actually happened in bzip2. Consider this function that does substring comparison:

int
cmp_signed(int i1, int i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

In this function, the indices i1 and i2 will always be some small, non-negative value. Since it’s non-negative, it should be unsigned, right? Not necessarily. That puts an extra constraint on code generation and, at least on x86-64, makes for a less efficient function. Most of the time you actually don’t want overflow to be defined, and instead allow the compiler to assume it just doesn’t happen.

The constraint is that the behavior of i1 or i2 overflowing as an unsigned integer is defined, and the compiler is obligated to implement that behavior. On x86-64, where int is 32 bits, the result of the operation must be truncated to 32 bits one way or another, requiring extra instructions inside the loop.

In the signed case, incrementing the integers cannot overflow since that would be undefined behavior. This permits the compiler to perform the increment only in 64-bit precision without truncation if it would be more efficient, which, in this case, it is.

Here’s the output of Clang 6.0.0 with -Os on x86-64. Pay close attention to the main loop, which I named .loop:

cmp_signed:
        movsxd rdi, edi             ; use i1 as a 64-bit integer
        mov    al, [rdx + rdi]
        movsxd rsi, esi             ; use i2 as a 64-bit integer
        mov    cl, [rdx + rsi]
        jmp    .check

.loop:  mov    al, [rdx + rdi + 1]
        mov    cl, [rdx + rsi + 1]
        inc    rdx                  ; increment only the base pointer
.check: cmp    al, cl
        je     .loop

        movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx             ; return c1 - c2
        ret

cmp_unsigned:
        mov    eax, edi
        mov    al, [rdx + rax]
        mov    ecx, esi
        mov    cl, [rdx + rcx]
        cmp    al, cl
        jne    .ret
        inc    edi
        inc    esi

.loop:  mov    eax, edi             ; truncated i1 overflow
        mov    al, [rdx + rax]
        mov    ecx, esi             ; truncated i2 overflow
        mov    cl, [rdx + rcx]
        inc    edi                  ; increment i1
        inc    esi                  ; increment i2
        cmp    al, cl
        je     .loop

.ret:   movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx
        ret

As unsigned values, i1 and i2 can overflow independently, so they have to be handled as independent 32-bit unsigned integers. As signed values they can’t overflow, so they’re treated as if they were 64-bit integers and, instead, the pointer, buf, is incremented without concern for overflow. The signed loop is much more efficient (5 instructions versus 8).

The signed integer helps to communicate the narrow contract of the function — the limited range of i1 and i2 — to the compiler. In a variant of C where signed integer overflow is defined (i.e. -fwrapv), this capability is lost. In fact, using -fwrapv deoptimizes the signed version of this function.

Side note: Using size_t (an unsigned integer) is even better on x86-64 for this example since it’s already 64 bits and the function doesn’t need the initial sign/zero extension. However, this might simply move the sign extension out to the caller.

Strict aliasing

Another controversial undefined behavior is strict aliasing. This particular term doesn’t actually appear anywhere in the C specification, but it’s the popular name for C’s aliasing rules. In short, variables with types that aren’t compatible are not allowed to alias through pointers.

Here’s the classic example:

int
foo(int *a, int *b)
{
    *b = 0;    // store
    *a = 1;    // store
    return *b; // load
}

Naively one might assume the return *b could be optimized to a simple return 0. However, since a and b have the same type, the compiler must consider the possibility that they alias — that they point to the same place in memory — and must generate code that works correctly under these conditions.

If foo has a narrow contract that forbids a and b to alias, we have a couple of options for helping our compiler.

First, we could manually resolve the aliasing issue by returning 0 explicitly. In more complicated functions this might mean making local copies of values, working only with those local copies, then storing the results back before returning. Then aliasing would no longer matter.

int
foo(int *a, int *b)
{
    *b = 0;
    *a = 1;
    return 0;
}

Second, C99 introduced a restrict qualifier to communicate to the compiler that pointers passed to functions cannot alias. For example, the pointers to memcpy() are qualified with restrict as of C99. Passing aliasing pointers through restrict parameters is undefined behavior, e.g. this doesn’t ever happen as far as a compiler is concerned.

int foo(int *restrict a, int *restrict b);

The third option is to design an interface that uses incompatible types, exploiting strict aliasing. This happens all the time, usually by accident. For example, int and long are never compatible even when they have the same representation.

int foo(int *a, long *b);

If you use an extended or modified version of C without strict aliasing (-fno-strict-aliasing), then the compiler must assume everything aliases all the time, generating a lot more precautionary loads than necessary.

What irritates a lot of people is that compilers will still apply the strict aliasing rule even when it’s trivial for the compiler to prove that aliasing is occurring:

/* note: forbidden */
long a;
int *b = (int *)&a;

It’s not just a simple matter of making exceptions for these cases. The language specification would need to define all the rules about when and where incompatible types are permitted to alias, and developers would have to understand all these rules if they wanted to take advantage of the exceptions. It can’t just come down to trusting that the compiler is smart enough to see the aliasing when it’s sufficiently simple. It would need to be carefully defined.

Besides, there are probably conforming, portable solutions that, with contemporary compilers, will safely compile to the efficient code you actually want anyway.

There is one special exception for strict aliasing: char * is allowed to alias with anything. This is important to keep in mind both when you intentionally want aliasing, but also when you want to avoid it. Writing through a char * pointer could force the compiler to generate additional, unnecessary loads.

In fact, there’s a whole dimension to strict aliasing that, even today, no compiler yet exploits: uint8_t is not necessarily unsigned char. That’s just one possible typedef definition for it. It could instead typedef to, say, some internal __byte type.

In other words, technically speaking, uint8_t does not have the strict aliasing exemption. If you wanted to write bytes to a buffer without worrying the compiler about aliasing issues with other pointers, this would be the tool to accomplish it. Unfortunately there’s far too much existing code that violates this part of strict aliasing that no toolchain is willing to exploit it for optimization purposes.

Other undefined behaviors

Some kinds of undefined behavior don’t have performance or portability benefits. They’re only there to make the compiler’s job a little simpler. Today, most of these are caught trivially at compile time as syntax or semantic issues (i.e. a pointer cast to a float).

Some others are obvious about their performance benefits and don’t require much explanation. For example, it’s undefined behavior to index out of bounds (with some special exceptions for one past the end), meaning compilers are not obligated to generate those checks, instead relying on the programmer to arrange, by whatever means, that it doesn’t happen.

Undefined behavior is like nitro, a dangerous, volatile substance that makes things go really, really fast. You could argue that it’s too dangerous to use in practice, but the aggressive use of undefined behavior is not without merit.

Intercepting and Emulating Linux System Calls with Ptrace

The ptrace(2) (“process trace”) system call is usually associated with debugging. It’s the primary mechanism through which native debuggers monitor debuggees on unix-like systems. It’s also the usual approach for implementing strace — system call trace. With Ptrace, tracers can pause tracees, inspect and set registers and memory, monitor system calls, or even intercept system calls.

By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.

The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.

For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.

You can find runnable code for the examples in this article here:

https://github.com/skeeto/ptrace-examples

strace

Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.

Ptrace has never been standardized. Its interface is similar across different operating systems, especially in its core functionality, but it’s still subtly different from system to system. The ptrace(2) prototype generally looks something like this, though the specific types may be different.

long ptrace(int request, pid_t pid, void *addr, void *data);

The pid is the tracee’s process ID. While a tracee can have only one tracer attached at a time, a tracer can be attached to many tracees.

The request field selects a specific Ptrace function, just like the ioctl(2) interface. For strace, only two are needed:

The other two fields, addr and data, serve as generic arguments for the selected Ptrace function. One or both are often ignored, in which case I pass zero.

The strace interface is essentially a prefix to another command.

$ strace [strace options] program [arguments]

My minimal strace doesn’t have any options, so the first thing to do — assuming it has at least one argument — is fork(2) and exec(2) the tracee process on the tail of argv. But before loading the target program, the new process will inform the kernel that it’s going to be traced by its parent. The tracee will be paused by this Ptrace system call.

pid_t pid = fork();
switch (pid) {
    case -1: /* error */
        FATAL("%s", strerror(errno));
    case 0:  /* child */
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        execvp(argv[1], argv + 1);
        FATAL("%s", strerror(errno));
}

The parent waits for the child’s PTRACE_TRACEME using wait(2). When wait(2) returns, the child will be paused.

waitpid(pid, 0, 0);

Before allowing the child to continue, we tell the operating system that the tracee should be terminated along with its parent. A real strace implementation may want to set other options, such as PTRACE_O_TRACEFORK.

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);

All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:

  1. Wait for the process to enter the next system call.
  2. Print a representation of the system call.
  3. Allow the system call to execute and wait for the return.
  4. Print the system call return value.

The PTRACE_SYSCALL request is used in both waiting for the next system call to begin, and waiting for that system call to exit. As before, a wait(2) is needed to wait for the tracee to enter the desired state.

ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);

When wait(2) returns, the registers for the thread that made the system call are filled with the system call number and its arguments. However, the operating system has not yet serviced this system call. This detail will be important later.

The next step is to gather the system call information. This is where it gets architecture specific. On x86-64, the system call number is passed in rax, and the arguments (up to 6) are passed in rdi, rsi, rdx, r10, r8, and r9. Reading the registers is another Ptrace call, though there’s no need to wait(2) since the tracee isn’t changing state.

struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, &regs);
long syscall = regs.orig_rax;

fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
        syscall,
        (long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
        (long)regs.r10, (long)regs.r8,  (long)regs.r9);

There’s one caveat. For internal kernel purposes, the system call number is stored in orig_rax rather than rax. All the other system call arguments are straightforward.

Next it’s another PTRACE_SYSCALL and wait(2), then another PTRACE_GETREGS to fetch the result. The result is stored in rax.

ptrace(PTRACE_GETREGS, pid, 0, &regs);
fprintf(stderr, " = %ld\n", (long)regs.rax);

The output from this simple program is very crude. There is no symbolic name for the system call and every argument is printed numerically, even if it’s a pointer to a buffer. A more complete strace would know which arguments are pointers and use process_vm_readv(2) to read those buffers from the tracee in order to print them appropriately.

However, this does lay the groundwork for system call interception.

System call interception

Suppose we want to use Ptrace to implement something like OpenBSD’s pledge(2), in which a process pledges to use only a restricted set of system calls. The idea is that many programs typically have an initialization phase where they need lots of system access (opening files, binding sockets, etc.). After initialization they enter a main loop in which they processing input and only a small set of system calls are needed.

Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.

Using the same strace model, rather than print out all system calls, we could either block certain system calls or simply terminate the tracee when it misbehaves. Termination is easy: just call exit(2) in the tracer. Since it’s configured to also terminate the tracee. Blocking the system call and allowing the child to continue is a little trickier.

The tricky part is that there’s no way to abort a system call once it’s started. When tracer returns from wait(2) on the entrance to the system call, the only way to stop a system call from happening is to terminate the tracee.

However, not only can we mess with the system call arguments, we can change the system call number itself, converting it to a system call that doesn’t exist. On return we can report a “friendly” EPERM error in errno via the normal in-band signaling.

for (;;) {
    /* Enter next system call */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    /* Is this system call permitted? */
    int blocked = 0;
    if (is_syscall_blocked(regs.orig_rax)) {
        blocked = 1;
        regs.orig_rax = -1; // set to invalid syscall
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }

    /* Run system call and stop on exit */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    if (blocked) {
        /* errno = EPERM */
        regs.rax = -EPERM; // Operation not permitted
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }
}

This simple example only checks against a whitelist or blacklist of system calls. And there’s no nuance, such as allowing files to be opened (open(2)) read-only but not as writable, allowing anonymous memory maps but not non-anonymous mappings, etc. There’s also no way to the tracee to dynamically drop privileges.

How could the tracee communicate to the tracer? Use an artificial system call!

Creating an artificial system call

For my new pledge-like system call — which I call xpledge() to distinguish it from the real thing — I picked system call number 10000, a nice high number that’s unlikely to ever be used for a real system call.

#define SYS_xpledge 10000

Just for demonstration purposes, I put together a minuscule interface that’s not good for much in practice. It has little in common with OpenBSD’s pledge(2), which uses a string interface. Actually designing robust and secure sets of privileges is really complicated, as the pledge(2) manpage shows. Here’s the entire interface and implementation of the system call for the tracee:

#define _GNU_SOURCE
#include <unistd.h>

#define XPLEDGE_RDWR  (1 << 0)
#define XPLEDGE_OPEN  (1 << 1)

#define xpledge(arg) syscall(SYS_xpledge, arg)

If it passes zero for the argument, only a few basic system calls are allowed, including those used to allocate memory (e.g. brk(2)). The PLEDGE_RDWR bit allows various read and write system calls (read(2), readv(2), pread(2), preadv(2), etc.). The PLEDGE_OPEN bit allows open(2).

To prevent privileges from being escalated back, pledge() blocks itself — though this also prevents dropping more privileges later down the line.

In the xpledge tracer, I just need to check for this system call:

/* Handle entrance */
switch (regs.orig_rax) {
    case SYS_pledge:
        register_pledge(regs.rdi);
        break;
}

The operating system will return ENOSYS (Function not implemented) since this isn’t a real system call. So on the way out I overwrite this with a success (0).

/* Handle exit */
switch (regs.orig_rax) {
    case SYS_pledge:
        ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
        break;
}

I wrote a little test program that opens /dev/urandom, makes a read, tries to pledge, then tries to open /dev/urandom a second time, then confirms it can read from the original /dev/urandom file descriptor. Running without a pledge tracer, the output looks like this:

$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604

Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:

$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4

The pledge succeeds but the second fopen(3) does not since the tracer blocked it with EPERM.

This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.

Foreign system emulation

Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.

You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.

The Linux version of PTrace has had a more efficient operation for this technique since 2005: PTRACE_SYSEMU. PTrace stops only once per a system call, and it’s up to the tracer to service that system call before allowing the tracee to continue.

for (;;) {
    ptrace(PTRACE_SYSEMU, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    switch (regs.orig_rax) {
        case OS_read:
            /* ... */

        case OS_write:
            /* ... */

        case OS_open:
            /* ... */

        case OS_exit:
            /* ... */

        /* ... and so on ... */
    }
}

To run binaries for the same architecture from any system with a stable (enough) system call ABI, you just need this PTRACE_SYSEMU tracer, a loader (to take the place of exec(2)), and whatever system libraries the binary needs (or only run static binaries).

In fact, this sounds like a fun weekend project.

See also

Minimalist C Libraries

In the past year I’ve written a number of minimalist C libraries, particularly header libraries. The distinction for “minimalist” is, of course, completely arbitrary and subjective. My definition in this context isn’t about the library’s functionality being stupidly trivial or even necessarily simple. I’m talking about interface (API) complexity and the library’s run time requirements. Complex functionality can, in some cases, be tucked behind a simple interface.

In this article I’ll give my definition for minimalist C API, then take you through some of my own recent examples.

Minimalist properties

A minimalist C library would generally have these properties.

(1) Small number of functions, perhaps even as little as one.

This one’s pretty obvious. More functions means more surface area in the interface. Since these functions typically interact, the relationship between complexity and number of functions will be superlinear.

(2) No dynamic memory allocations.

The library mustn’t call malloc() internally. It’s up to the caller to allocate memory for the library. What’s nice about this is that it’s completely up to the application exactly how memory is allocated. Maybe it’s using a custom allocator, or it’s not linked against the standard library.

A common approach is for the application to provide allocation functions to the library — e.g. function pointers at run time, or define functions with specific, expected names. The library would call these instead of malloc() and free(). While that’s perfectly reasonable, it’s not really minimalist, so I’m not including this technique.

Instead a minimalist API is designed such that it’s natural for the application to make the allocations itself. Perhaps the library only needs a single, fixed allocation for all its operations. Or maybe the application specifies its requirements and the library communicates how much memory is needed to meet those requirements. I’ll give specific examples shortly.

One nice result of this property is that it eliminates one of the common failure conditions: the out of memory error. If the library doesn’t allocate memory, then it can’t run out of it!

Another convenient, minor outcome is the lack of casts from void * to the appropriate type (e.g. on the return from malloc()). These casts are implicit in C but must be made explicit in C++. Often, completely by accident, my minimalist C libraries can be compiled as C++ without any changes. This is only a minor benefit since these casts could be made explicit in C, too, if C++ compatibility was desired. It’s just ugly.

(3) No input or output.

In simple terms, the library mustn’t use functions from stdio.h — with the exception of the sprintf() family. Like with memory allocation, it leaves input and output to the application, letting it decide exactly how, where, and when information comes and goes.

Like with memory allocation, maybe the application prefers not to use the C standard library’s buffered IO. Perhaps the application is using cooperative or green threads, and it would be bad for the library to block internally on IO.

Also like avoiding memory allocation, a library that doesn’t perform IO can’t have IO errors. Combined, this means it’s quite possible that a minimalist library may have no error cases at all. Eliminating those error handling paths makes the library a lot simpler. The one major error condition left that’s difficult to eliminate are those pesky integer overflow checks.

Communicating IO preferences to libraries can be a real problem with C, since the standard library lacks generic input and output. Putting FILE * pointers directly into an API mingles it with the C standard library in potentially bad ways. Passing file names as strings is an option, but this limits IO to files — versus, say, sockets. On POSIX systems, at least it could talk about IO in terms of file descriptors, but even that’s not entirely flexible — e.g. output to a memory buffer, or anything not sufficiently file-like.

Again, a common way to deal with this is for the application to provide IO function pointers to the library. But a minimalist library’s API would be designed such that not even this is needed, instead operating strictly on buffers. I’ll also have a couple examples of this shortly.

With IO and memory allocation out of the picture, another frequent, accidental result is no dependency on the C standard library. The only significant functionality left in the standard library are the mathematical functions (math.h), float parsing, and a few of the string functions (string.h), like memset() and memmove(). These are valuable since they’re handled specially by the compiler.

(4) Define at most one structure, and perhaps even none.

More types means more complexity, perhaps even more so than having lots of functions. Some minimalist libraries can be so straightforward that they can operate solely on simple, homogeneous buffers. I’ll show some examples of this, too.

As I said initially, minimalism is about interface, not implementation. The library is free to define as many structures internally as it needs since the application won’t be concerned with them.

One common way to avoid complicated types in an API is to make them opaque. The structures aren’t defined in the API, and instead the application only touches pointers, making them like handles.

struct foo;

struct foo *foo_create(...);
int         foo_method(struct foo *, ...);
void        foo_destroy(struct foo *);

However, this is difficult to pull off when the library doesn’t allocate its own memory.

Bitmap library

The first example is a library for creating bitmap (BMP) images. As you may already know, I strongly prefer Netpbm, which is so simple that it doesn’t even need a library. But nothing is quite so universally supported as BMP.

24-bit BMP (Bitmap) ANSI C header library

This library is a perfect example of minimalist properties 2, 3, and 4. It also doesn’t use any of the C standard library, though only by accident.

It’s not a general purpose BMP library. It only supports 24-bit true color, ignoring most BMP features such as palettes. Color is represented as a 24-bit integer, packed 0xRRGGBB.

unsigned long bmp_size(long width, long height);
void          bmp_init(void *, long width, long height);
void          bmp_set(void *, long x, long y, unsigned long color);
unsigned long bmp_get(const void *, long x, long y);

Strictly speaking, even the bmp_get() function could be tossed since the library is not intended to load external bitmap images. The application really shouldn’t need to read back previously set pixels.

There is no allocation, no IO, and no data structures. The application indicates the dimensions of image it wants to create, and the library says how large of a buffer it needs. The remaining functions all operate on this opaque buffer. To write the image out, the application only needs to dump the buffer to a file.

Here’s a complete, strict error checking example of its usage:

#define RED   0xff0000UL
#define BLUE  0x0000ffUL

unsigned long size = bmp_size(width, height);
if (!size || size > SIZE_MAX) die("invalid dimensions");

void *bmp = calloc(size, 1);
if (!bmp) die("out of memory");
bmp_init(bmp, width, height);

/* Checkerboard pattern */
for (long y = 0; y < height; y++)
    for (long x = 0; x < width; x++)
        bmp_set(bmp, x, y, x % 2 == y % 2 ? RED : BLUE);

if (!fwrite(bmp, size, 1, out))
    die("output error");

free(bmp);

The only library function that can fail is bmp_size(). When the given image dimensions would overflow one of the BMP header fields, it returns zero to indicate as such.

In bmp_set(), how does it know the dimensions of the image so that it can find the pixel? It reads that from the buffer just like a BMP reader would — and in a endian-agnostic manner. There are no bounds checks — that’s the caller’s job — so it only needs to read the image’s width in order to find the pixel’s location.

Since IO is under control of the application, it can always choose load the original buffer contents back from a file, allowing a minimal sort of BMP loading. However, this only works for trusted input as there are no validation checks on the buffer.

32-bit integer hash set library

The second example is an integer hash set library. It uses closed hashing. I initially wrote this for r/dailyprogrammer solution and then formalized it into a little reusable library.

C99 32-bit integer hash set header library

Here’s the entire API:

int  set32_z(uint32_t max);
void set32_insert(uint32_t *table, int z, uint32_t v);
void set32_remove(uint32_t *table, int z, uint32_t v);
int  set32_contains(uint32_t *table, int z, uint32_t v);

Again, it’s a good example of properties 2, 3, and 4. Like the BMP library, the application indicates the maximum number of integers it will store in the hash set, and the library returns the power of two number of uint32_t it needs to allocate (and zero-initialize).

In this API I’m just barely skirting not defining a data structure. The caller must pass both the table pointer and the power of two size, and these two values would normally be bundled together into a structure.

int z = set32_z(max);
unsigned long long n = 1ULL << z;
if (n > SIZE_MAX) die("table too large");
uint32_t *table = calloc(sizeof(*table), n);
if (!table) die("out of memory");

set32_insert(table, z, value);

if (set32_contains(table, z, value))
    /* ... */;

set32_remove(table, z, value);

free(table);

Iteration is straightforward, which is why it’s not in the API: visit each element in the allocated buffer. Zeroes are empty slots.

If a different maximum number of elements is needed, the application initializes a new, separate table, then iterates over the old table inserting each integer in turn.

Perhaps the most interesting part of the API is that it has no errors. No function can fail.

Also, like the BMP library, it accidentally doesn’t use the standard library, except for a typedef from stdint.h.

Fantasy name generator

Nearly a decade ago I cloned in Perl the RinkWorks Fantasy Name Generator. This version was slow and terrible, and I’m sometimes tempted to just delete it.

A few years later I rewrote it in JavaScript using an entirely different approach. In order to improve performance, it has a template compilation step. The compiled template is a hierarchical composition of simple generator objects. It’s much faster, and easily enabled some extensions to the syntax.

Germán Méndez Bravo ported the JavaScript version to C++. This C++ implementation was recently adopted into IVAN, a roguelike game.

This recent commotion made me realize something: I hadn’t yet implemented it in C! So I did.

Fantasy name generator ANSI C header library

The entire API is just a single function with four possible return values. It’s a perfect example of minimalist property 1.

#define NAMEGEN_SUCCESS    0
#define NAMEGEN_TRUNCATED  1  /* Output was truncated */
#define NAMEGEN_INVALID    2  /* Pattern is invalid */
#define NAMEGEN_TOO_DEEP   3  /* Exceeds maximum nesting depth */

int namegen(char *dest,
            size_t len,
            const char *pattern,
            unsigned long *seed);

There’s no template compilation step, and it generates names straight from the template.

There are three kinds of errors.

  1. If the output buffer wasn’t large enough, it warns about the name being truncated.

  2. The template could be invalid — e.g. incorrectly paired brackets.

  3. The template could have too much nesting. I decided to hard code the maximum nesting depth to a generous 32 levels. This limitation makes the generator a lot simpler without any practical impact. It also protects against unbounded memory usage — particularly stack overflows — by arbitrarily complex patterns. This means it’s perfectly safe to generate names from untrusted, arbitrarily long input patterns.

Here’s a usage example:

char name[64];
unsigned long seed = 0xb9584b61UL;
namegen(name, sizeof(name), "!sV'i (the |)!id", &seed);
/* name = "Engia'pin the Doltolph" */

The generator supports UTF-8, almost by accident. (I’d have to go out of my way not to support it.)

Despite the lack of a compilation step, which requires the parsing the template for each generated name, it’s an order of magnitude faster than the C++ version, which caught me by surprise. The high performance is due to name generation being a single pass over the template using reservoir sampling.

Internally it maintains a stack of “reset” pointers, each pointing into the output buffer where the current nesting level began its output. Each time it hits an alternation (|), it generates a random number and decides whether or not to use the new option. The first time it’s a 1/2 chance it chooses the new option. The second time, a 1/3 chance. The third time a 1/4 chance, and so on. When the new option is selected, the reset pointer is used to “undo” any previous output for the current nesting level.

The reservoir sampling means it needs to generate more random numbers (once per option) than the JavaScript and C++ version (once per nesting level). However, it uses its own, fast internal PRNG rather than rand(). Generating these random numbers is basically free.

Not using rand() means that, like the previous libraries, it doesn’t need anything from the standard library. It also has better quality results since the typical standard library rand() is total rubbish, both in terms of speed and quality (and typically has a PLT penalty). Finally it means the results are identical across all platforms for the same template and seed, which is one reason it’s part of the API.

Another slight performance boost comes from the representation of pattern substitutions, i.e. i will select a random “idiot” name from a fixed selection of strings. The obvious representation is an array of string pointers, as seen in the C++ version. However, there are a lot of these little strings, which makes for a lot of pointers cluttering up the relocation table. Instead, I packed it all into few small pointerless tables, which on x86-64 are accessed efficiently via RIP-relative addressing. It’s efficient, though not friendly to modification.

I’m very happy with how this library turned out.

UTF-7 encoder and decoder

The last example is a UTF-7 encoder and decoder. UTF-7 is a method for encoding arbitrary Unicode text within ASCII text, created as a nasty hack to allow Unicode messages to be sent over ASCII-limited email infrastructure. The gist of it is that the Unicode parts of a message are encoded as UTF-16, then base64 encoded, then interpolated into the ASCII stream between delimiters.

Einstein (allegedly) said “If you can’t explain it to a six year old, you don’t understand it yourself.” The analog for programming is to replace the six year old with a computer, and explaining an idea to a computer is done by writing a program. I wanted to understand UTF-7, so I implemented it.

A UTF-7 stream encoder and decoder in ANSI C

Here’s the entire API. It’s modeled a little after the zlib API.

/* utf7_encode() special code points */
#define UTF7_FLUSH       -1L

/* return codes */
#define UTF7_OK          -1
#define UTF7_FULL        -2
#define UTF7_INCOMPLETE  -3
#define UTF7_INVALID     -4

struct utf7 {
    char *buf;
    size_t len;
    /* then some "private" internal fields */
};

void utf7_init(struct utf7 *, const char *indirect);
int  utf7_encode(struct utf7 *, long codepoint);
long utf7_decode(struct utf7 *);

Finally a library that defines a structure! The other fields (not shown) hold important state information, but the application is only concerned with buf and len: an input or output buffer. The same structure is used for encoding and decoding, though only for one task at a time.

Following the minimalist library principle, there is no memory allocation. When encoding a UTF-7 stream, the application’s job is to point buf to an output buffer, indicating its length with len. Then it feeds code points one at a time into the encoder. When the output is full, it returns UTF7_FULL. The application must provide a new buffer and try again.

This example usage is more complicated than I anticipated it would be. Properly pumping code points through the encoder requires a loop (or at least a second attempt).

char buffer[1024];
struct utf7 ctx;

utf7_init(&ctx, 0);
ctx.buf = buffer;
ctx.len = sizeof(buffer));

/* Assumes "wide character" input is Unicode */
for (;;) {
    wint_t c = fgetwc(stdin);
    if (c == WEOF)
        break;

    while (utf7_encode(ctx, c) != UTF7_OK) {
        /* Flush output and reset buffer */
        fwrite(buffer, sizeof(buffer), 1, stdout);
        ctx.buf = buffer;
        ctx.len = sizeof(buffer));
    }
}

/* Flush all pending output */
while (utf7_encode(ctx, UTF7_FLUSH) != UTF7_OK) {
    fwrite(buffer, sizeof(buffer), 1, stdout);
    ctx.buf = buffer;
    ctx.len = sizeof(buffer));
}

/* Write remaining output */
fwrite(buffer, sizeof(buffer) - ctx.len, 1, stdout);

/* Check for errors */
if (fflush(stdout))
    die("output error");
if (ferror(stdin))
    die("input error");

Flushing (UTF7_FLUSH) is necessary since, due to base64 encoding, adjacent Unicode characters usually share a base64 character. Just because a code point was absorbed into the encoder doesn’t mean it was written into the output buffer. The encoding for that character may depend on the next character to come. The special “flush” input forces this out. It’s valid to flush in the middle of a stream, though this may penalize encoding efficiency (e.g. the output may be larger than necessary).

It’s not possible for the encoder to fail, so there are no error conditions to worry about from the library.

Decoding is a different matter. It works almost in reverse from the encoder: buf points to the input and the decoder is pumped to return one code point at a time. It returns one of:

  1. A non-negative value: a valid code point (including ASCII).

  2. UTF7_OK: Input was exhausted. Stopping here would be valid. This is what you should get when there’s no more input.

  3. UTF7_INVALID: The input was invalid. buf points at the invalid byte.

  4. UTF7_INCOMPLETE: Input was exhausted, but more is expected. If there is no more input, then the input must have been truncated, which is an error.

So there are two possible errors for two kinds of invalid input. Parsing errors are unavoidable when parsing input.

Again, this library accidentally doesn’t require the standard library. It doesn’t even depend on the compiler’s locale being compatible with ASCII since none of its internal tables use string or character literals. It behaves exactly the same across all conforming platforms.

More examples

I had a few more examples in mind, but this article has gone on long enough.

Instead I’ll save these for other articles!

Emacs 26 Brings Generators and Threads

Emacs 26.1 was recently released. As you would expect from a major release, it comes with lots of new goodies. Being a bit of an Emacs Lisp enthusiast, the two most interesting new features are generators (iter) and native threads (thread).

Correction: Generators were actually introduced in Emacs 25.1 (Sept. 2016), not Emacs 26.1. Doh!

Generators

Generators are one of those cool language features that provide a lot of power at a small implementation cost. They’re like a constrained form of coroutines, but, unlike coroutines, they’re typically built entirely on top of first-class functions (e.g. closures). This means no additional run-time support is needed in order to add generators to a language. The only complication is the changes the compiler. Generators are not compiled the same way as normal functions despite looking so similar.

What’s perhaps coolest of all about lisp-family generators, including Emacs Lisp, is that the compiler component can be implemented entirely with macros. The compiler need not be modified at all, making generators no more than a library, and not actually part of the language. That’s exactly how they’ve been implemented in Emacs Lisp (emacs-lisp/generator.el).

So what’s a generator? It’s a function that returns an iterator object. When an iterator object is invoked (e.g. iter-next) it evaluates the body of the generator. Each iterator is independent. What makes them unusual (and useful) is that the evaluation is paused in the middle of the body to return a value, saving all the internal state in the iterator. Normally pausing in the middle of functions isn’t possible, which is what requires the special compiler support.

Emacs Lisp generators appear to be most closely modeled after Python generators, though it also shares some similarities to JavaScript generators. What makes it most like Python is the use of signals for flow control — something I’m not personally enthused about (though see also). When a Python generator completes, it throws a StopItertion exception. In Emacs Lisp, it’s an iter-end-of-sequence signal. A signal is out-of-band and avoids the issue relying on some special in-band value to communicate the end of iteration.

In contrast, JavaScript’s solution is to return a “rich” object wrapping the actual yield value. This object has a done field that communicates whether iteration has completed. This avoids the use of exceptions for flow control, but the caller has to unpack the rich object.

Fortunately the flow control issue isn’t normally exposed to Emacs Lisp code. Most of the time you’ll use the iter-do macro or (my preference) the new cl-loop keyword iter-by.

To illustrate how a generator works, here’s a really simple iterator that iterates over a list:

(iter-defun walk (list)
  (while list
    (iter-yield (pop list))))

Here’s how it might be used:

(setf i (walk '(:a :b :c)))

(iter-next i)  ; => :a
(iter-next i)  ; => :b
(iter-next i)  ; => :c
(iter-next i)  ; error: iter-end-of-sequence

The iterator object itself is opaque and you shouldn’t rely on any part of its structure. That being said, I’m a firm believer that we should understand how things work underneath the hood so that we can make the most effective use of at them. No program should rely on the particulars of the iterator object internals for correctness, but a well-written program should employ them in a way that best exploits their expected implementation.

Currently iterator objects are closures, and iter-next invokes the closure with its own internal protocol. It asks the closure to return the next value (:next operation), and iter-close asks it to clean itself up (:close operation).

Since they’re just closures, another really cool thing about Emacs Lisp generators is that iterator objects are generally readable. That is, you can serialize them out with print and bring them back to life with read, even in another instance of Emacs. They exist independently of the original generator function. This will not work if one of the values captured in the iterator object is not readable (e.g. buffers).

How does pausing work? Well, one of other exciting new features of Emacs 26 is the introduction of a jump table opcode, switch. I’d lamented in the past that large cond and cl-case expressions could be a lot more efficient if Emacs’ byte code supported jump tables. It turns an O(n) sequence of comparisons into an O(1) lookup and jump. It’s essentially the perfect foundation for a generator since it can be used to jump straight back to the position where evaluation was paused.

Buuut, generators do not currently use jump tables. The generator library predates the new switch opcode, and, being independent of it, its author, Daniel Colascione, went with the best option at the time. Chunks of code between yields are packaged as individual closures. These closures are linked together a bit like nodes in a graph, creating a sort of state machine. To get the next value, the iterator object invokes the closure representing the next state.

I’ve manually macro expanded the walk generator above into a form that roughly resembles the expansion of iter-defun:

(defun walk (list)
  (let (state)
    (cl-flet* ((state-2 ()
                 (signal 'iter-end-of-sequence nil))
               (state-1 ()
                 (prog1 (pop list)
                   (when (null list)
                     (setf state #'state-2))))
               (state-0 ()
                 (if (null list)
                     (state-2)
                   (setf state #'state-1)
                   (state-1))))
      (setf state #'state-0)
      (lambda ()
        (funcall state)))))

This omits the protocol I mentioned, and it doesn’t have yield results (values passed to the iterator). The actual expansion is a whole lot messier and less optimal than this, but hopefully my hand-rolled generator is illustrative enough. Without the protocol, this iterator is stepped using funcall rather than iter-next.

The state variable keeps track of where in the body of the generator this iterator is currently “paused.” Continuing the iterator is therefore just a matter of invoking the closure that represents this state. Each state closure may update state to point to a new part of the generator body. The terminal state is obviously state-2. Notice how state transitions occur around branches.

I had said generators can be implemented as a library in Emacs Lisp. Unfortunately theres a hole in this: unwind-protect. It’s not valid to yield inside an unwind-protect form. Unlike, say, a throw-catch, there’s no mechanism to trap an unwinding stack so that it can be restarted later. The state closure needs to return and fall through the unwind-protect.

A jump table version of the generator might look like the following. I’ve used cl-labels since it allows for recursion.

(defun walk (list)
  (let ((state 0))
    (cl-labels
        ((closure ()
           (cl-case state
             (0 (if (null list)
                    (setf state 2)
                  (setf state 1))
                (closure))
             (1 (prog1 (pop list)
                  (when (null list)
                    (setf state 2))))
             (2 (signal 'iter-end-of-sequence nil)))))
      #'closure)))

When byte compiled on Emacs 26, that cl-case is turned into a jump table. This “switch” form is closer to how generators are implemented in other languages.

Iterator objects can share state between themselves if they close over a common environment (or, of course, use the same global variables).

(setf foo
      (let ((list '(:a :b :c)))
        (list
         (funcall
          (iter-lambda ()
            (while list
              (iter-yield (pop list)))))
         (funcall
          (iter-lambda ()
            (while list
              (iter-yield (pop list))))))))

(iter-next (nth 0 foo))  ; => :a
(iter-next (nth 1 foo))  ; => :b
(iter-next (nth 0 foo))  ; => :c

For years there has been a very crude way to “pause” a function and allow other functions to run: accept-process-output. It only works in the context of processes, but five years ago this was sufficient for me to build primitives on top of it. Unlike this old process function, generators do not block threads, including the user interface, which is really important.

Threads

Emacs 26 also bring us threads, which have been attached in a very bolted on fashion. It’s not much more than a subset of pthreads: shared memory threads, recursive mutexes, and condition variables. The interfaces look just like they do in pthreads, and there hasn’t been much done to integrate more naturally into the Emacs Lisp ecosystem.

This is also only the first step in bringing threading to Emacs Lisp. Right now there’s effectively a global interpreter lock (GIL), and threads only run one at a time cooperatively. Like with generators, the Python influence is obvious. In theory, sometime in the future this interpreter lock will be removed, making way for actual concurrency.

This is, again, where I think it’s useful to contrast with JavaScript, which was also initially designed to be single-threaded. Low-level threading primitives weren’t exposed — though mostly because JavaScript typically runs sandboxed and there’s no safe way to expose those primitives. Instead it got a web worker API that exposes concurrency at a much higher level, along with an efficient interface for thread coordination.

For Emacs Lisp, I’d prefer something safer, more like the JavaScript approach. Low-level pthreads are now a great way to wreck Emacs with deadlocks (with no C-g escape). Playing around with the new threading API for just a few days, I’ve already had to restart Emacs a bunch of times. Bugs in Emacs Lisp are normally a lot more forgiving.

One important detail that has been designed well is that dynamic bindings are thread-local. This is really essential for correct behavior. This is also an easy way to create thread-local storage (TLS): dynamically bind variables in the thread’s entrance function.

;;; -*- lexical-binding: t; -*-

(defvar foo-counter-tls)
(defvar foo-path-tls)

(defun foo-make-thread (path)
  (make-thread
   (lambda ()
     (let ((foo-counter-tls 0)
           (foo-name-tls path))
       ...))))

However, cl-letf “bindings” are not thread-local, which makes this otherwise incredibly useful macro quite dangerous in the presence of threads. This is one way that the new threading API feels bolted on.

Building generators on threads

In my stack clashing article I showed a few different ways to add coroutine support to C. One method spawned per-coroutine threads, and coordinated using semaphores. With the new threads API in Emacs, it’s possible to do exactly the same thing.

Since generators are just a limited form of coroutines, this means threads offer another, very different way to implement them. The threads API doesn’t provide semaphores, but condition variables can fill in for them. To “pause” in the middle of the generator, just wait on a condition variable.

So, naturally, I just had to see if I could make it work. I call it a “thread iterator” or “thriter.” The API is very similar to iter:

https://github.com/skeeto/thriter

This is merely a proof of concept so don’t actually use this library for anything. These thread-based generators are about 5x slower than iter generators, and they’re a lot more heavy-weight, needing an entire thread per iterator object. This makes thriter-close all the more important. On the other hand, these generators have no problem yielding inside unwind-protect.

Originally this article was going to dive into the details of how these thread-iterators worked, but thriter turned out to be quite a bit more complicated than I anticipated, especially as I worked towards feature matching iter.

The gist of it is that each side of a next/yield transaction gets its own condition variable, but share a common mutex. Values are passed between the threads using slots on the iterator object. The side that isn’t currently running waits on a condition variable until the other side frees it, after which the releaser waits on its own condition variable for the result. This is similar to asynchronous requests in Emacs dynamic modules.

Rather than use signals to indicate completion, I modeled it after JavaScript generators. Iterators return a cons cell. The car indicates continuation and the cdr holds the yield result. To terminate an iterator early (thriter-close or garbage collection), thread-signal is used to essentially “cancel” the thread and knock it off the condition variable.

Since threads aren’t (and shouldn’t be) garbage collected, failing to run a thread-iterator to completion would normally cause a memory leak, as the thread sits there forever waiting on a “next” that will never come. To deal with this, there’s a finalizer is attached to the iterator object in such a way that it’s not visible to the thread. A lost iterator is eventually cleaned up by the garbage collector, but, as usual with finalizers, this is only a last resort.

The future of threads

This thread-iterator project was my initial, little experiment with Emacs Lisp threads, similar to why I connected a joystick to Emacs using a dynamic module. While I don’t expect the current thread API to go away, it’s not really suitable for general use in its raw form. Bugs in Emacs Lisp programs should virtually never bring down Emacs and require a restart. Outside of threads, the few situations that break this rule are very easy to avoid (and very obvious that something dangerous is happening). Dynamic modules are dangerous by necessity, but concurrency doesn’t have to be.

There really needs to be a safe, high-level API with clean thread isolation. Perhaps this higher-level API will eventually build on top of the low-level threading API.

When FFI Function Calls Beat Native C

Update: There’s a good discussion on Hacker News.

Over on GitHub, David Yu has an interesting performance benchmark for function calls of various Foreign Function Interfaces (FFI):

https://github.com/dyu/ffi-overhead

He created a shared object (.so) file containing a single, simple C function. Then for each FFI he wrote a bit of code to call this function many times, measuring how long it took.

For the C “FFI” he used standard dynamic linking, not dlopen(). This distinction is important, since it really makes a difference in the benchmark. There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure.

The most surprising result of the benchmark is that LuaJIT’s FFI is substantially faster than C. It’s about 25% faster than a native C function call to a shared object function. How could a weakly and dynamically typed scripting language come out ahead on a benchmark? Is this accurate?

It’s actually quite reasonable. The benchmark was run on Linux, so the performance penalty we’re seeing comes the Procedure Linkage Table (PLT). I’ve put together a really simple experiment to demonstrate the same effect in plain old C:

https://github.com/skeeto/dynamic-function-benchmark

Here are the results on an Intel i7-6700 (Skylake):

plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call

These are three different types of function calls:

  1. Through the PLT
  2. An indirect function call (via dlsym(3))
  3. A direct function call (via a JIT-compiled function)

As shown, the last one is the fastest. It’s typically not an option for C programs, but it’s natural in the presence of a JIT compiler, including, apparently, LuaJIT.

In my benchmark, the function being called is named empty():

void empty(void) { }

And to compile it into a shared object:

$ cc -shared -fPIC -Os -o empty.so empty.c

Just as in my PRNG shootout, the benchmark calls this function repeatedly as many times as possible before an alarm goes off.

Procedure Linkage Tables

When a program or library calls a function in another shared object, the compiler cannot know where that function will be located in memory. That information isn’t known until run time, after the program and its dependencies are loaded into memory. These are usually at randomized locations — e.g. Address Space Layout Randomization (ASLR).

How is this resolved? Well, there are a couple of options.

One option is to make a note about each such call in the binary’s metadata. The run-time dynamic linker can then patch in the correct address at each call site. How exactly this would work depends on the particular code model used when compiling the binary.

The downside to this approach is slower loading, larger binaries, and less sharing of code pages between different processes. It’s slower loading because every dynamic call site needs to be patched before the program can begin execution. The binary is larger because each of these call sites needs an entry in the relocation table. And the lack of sharing is due to the code pages being modified.

On the other hand, the overhead for dynamic function calls would be eliminated, giving JIT-like performance as seen in the benchmark.

The second option is to route all dynamic calls through a table. The original call site calls into a stub in this table, which jumps to the actual dynamic function. With this approach the code does not need to be patched, meaning it’s trivially shared between processes. Only one place needs to be patched per dynamic function: the entries in the table. Even more, these patches can be performed lazily, on the first function call, making the load time even faster.

On systems using ELF binaries, this table is called the Procedure Linkage Table (PLT). The PLT itself doesn’t actually get patched — it’s mapped read-only along with the rest of the code. Instead the Global Offset Table (GOT) gets patched. The PLT stub fetches the dynamic function address from the GOT and indirectly jumps to that address. To lazily load function addresses, these GOT entries are initialized with an address of a function that locates the target symbol, updates the GOT with that address, and then jumps to that function. Subsequent calls use the lazily discovered address.

The downside of a PLT is extra overhead per dynamic function call, which is what shows up in the benchmark. Since the benchmark only measures function calls, this appears to be pretty significant, but in practice it’s usually drowned out in noise.

Here’s the benchmark:

/* Cleared by an alarm signal. */
volatile sig_atomic_t running;

static long
plt_benchmark(void)
{
    long count;
    for (count = 0; running; count++)
        empty();
    return count;
}

Since empty() is in the shared object, that call goes through the PLT.

Indirect dynamic calls

Another way to dynamically call functions is to bypass the PLT and fetch the target function address within the program, e.g. via dlsym(3).

void *h = dlopen("path/to/lib.so", RTLD_NOW);
void (*f)(void) = dlsym("f");
f();

Once the function address is obtained, the overhead is smaller than function calls routed through the PLT. There’s no intermediate stub function and no GOT access. (Caveat: If the program has a PLT entry for the given function then dlsym(3) may actually return the address of the PLT stub.)

However, this is still an indirect function call. On conventional architectures, direct function calls have an immediate relative address. That is, the target of the call is some hard-coded offset from the call site. The CPU can see well ahead of time where the call is going.

An indirect function call has more overhead. First, the address has to be stored somewhere. Even if that somewhere is just a register, it increases register pressure by using up a register. Second, it provokes the CPU’s branch predictor since the call target isn’t static, making for extra bookkeeping in the CPU. In the worst case the function call may even cause a pipeline stall.

Here’s the benchmark:

volatile sig_atomic_t running;

static long
indirect_benchmark(void (*f)(void))
{
    long count;
    for (count = 0; running; count++)
        f();
    return count;
}

The function passed to this benchmark is fetched with dlsym(3) so the compiler can’t do something tricky like convert that indirect call back into a direct call.

If the body of the loop was complicated enough that there was register pressure, thereby requiring the address to be spilled onto the stack, this benchmark might not fare as well against the PLT benchmark.

Direct function calls

The first two types of dynamic function calls are simple and easy to use. Direct calls to dynamic functions is trickier business since it requires modifying code at run time. In my benchmark I put together a little JIT compiler to generate the direct call.

There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB range due to a signed 32-bit immediate. This means the JIT code has to be placed virtually nearby the target function, empty(). If the JIT code needed to call two different dynamic functions separated by more than 2GB, then it’s not possible for both to be direct.

To keep things simple, my benchmark isn’t precise or very careful about picking the JIT code address. After being given the target function address, it blindly subtracts 4MB, rounds down to the nearest page, allocates some memory, and writes code into it. To do this correctly would mean inspecting the program’s own memory mappings to find space, and there’s no clean, portable way to do this. On Linux this requires parsing virtual files under /proc.

Here’s what my JIT’s memory allocation looks like. It assumes reasonable behavior for uintptr_t casts:

static void
jit_compile(struct jit_func *f, void (*empty)(void))
{
    uintptr_t addr = (uintptr_t)empty;
    void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK);
    /* ... */
    unsigned char *p = mmap(desired, len, prot, flags, fd, 0);
    /* ... */
}

It allocates two pages, one writable and the other containing non-writable code. Similar to my closure library, the lower page is writable and holds the running variable that gets cleared by the alarm. It needed to be nearby the JIT code in order to be an efficient RIP-relative access, just like the other two benchmark functions. The upper page contains this assembly:

jit_benchmark:
        push  rbx
        xor   ebx, ebx
.loop:  mov   eax, [rel running]
        test  eax, eax
        je    .done
        call  empty
        inc   ebx
        jmp   .loop
.done:  mov   eax, ebx
        pop   rbx
        ret

The call empty is the only instruction that is dynamically generated — necessary to fill out the relative address appropriately (the minus 5 is because it’s relative to the end of the instruction):

    // call empty
    uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5;
    *p++ = 0xe8;
    *p++ = rel >>  0;
    *p++ = rel >>  8;
    *p++ = rel >> 16;
    *p++ = rel >> 24;

If empty() wasn’t in a shared object and instead located in the same binary, this is essentially the direct call that the compiler would have generated for plt_benchmark(), assuming somehow it didn’t inline empty().

Ironically, calling the JIT-compiled code requires an indirect call (e.g. via a function pointer), and there’s no way around this. What are you going to do, JIT compile another function that makes the direct call? Fortunately this doesn’t matter since the part being measured in the loop is only a direct call.

It’s no mystery

Given these results, it’s really no mystery that LuaJIT can generate more efficient dynamic function calls than a PLT, even if they still end up being indirect calls. In my benchmark, the non-PLT indirect calls were 28% faster than the PLT, and the direct calls 43% faster than the PLT. That’s a small edge that JIT-enabled programs have over plain old native programs, though it comes at the cost of absolutely no code sharing between processes.

When the Compiler Bites

Update: There are discussions on Reddit and on Hacker News.

So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.

In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.

x86-64 ABI ambiguity

The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:

double
zero_multiply(double x)
{
    return x * 0.0;
}

The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)

As a result, both GCC and Clang perform the multiply:

zero_multiply:
    xorpd  xmm1, xmm1
    mulsd  xmm0, xmm1
    ret

The -ffast-math option relaxes the C standard floating point rules, permitting an optimization at the cost of conformance and consistency:

zero_multiply:
    xorps  xmm0, xmm0
    ret

Side note: -ffast-math doesn’t necessarily mean “less precise.” Sometimes it will actually improve precision.

Here’s a modified version of the function that’s a little more interesting. I’ve changed the argument to a short:

double
zero_multiply_short(short x)
{
    return x * 0.0;
}

It’s no longer possible for the argument to be one of those special values. The short will be promoted to one of 65,535 possible double values, each of which results in 0.0 when multiplied by 0.0. GCC misses this optimization (-Os):

zero_multiply_short:
    movsx     edi, di       ; sign-extend 16-bit argument
    xorps     xmm1, xmm1    ; xmm1 = 0.0
    cvtsi2sd  xmm0, edi     ; convert int to double
    mulsd     xmm0, xmm1
    ret

Clang also misses this optimization:

zero_multiply_short:
    cvtsi2sd xmm1, edi
    xorpd    xmm0, xmm0
    mulsd    xmm0, xmm1
    ret

But hang on a minute. This is shorter by one instruction. What happened to the sign-extension (movsx)? Clang is treating that short argument as if it were a 32-bit value. Why do GCC and Clang differ? Is GCC doing something unnecessary?

It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.

Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.

I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.

Floating point precision

Without looking it up or trying it, what does this function return? Think carefully.

int
float_compare(void)
{
    float x = 1.3f;
    return x == 1.3f;
}

Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.

$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1

So what’s going on here? The original ANSI C specification wasn’t clear about how intermediate floating point values get rounded, and implementations all did it differently. The C99 specification cleaned this all up and introduced FLT_EVAL_METHOD. Implementations can still differ, but at least you can now determine at compile-time what the compiler would do by inspecting that macro.

Back in the late 1980’s or early 1990’s when the GCC developers were deciding how GCC should implement floating point arithmetic, the trend at the time was to use as much precision as possible. On the x86 this meant using its support for 80-bit extended precision floating point arithmetic. Floating point operations are performed in long double precision and truncated afterward (FLT_EVAL_METHOD == 2).

In float_compare() the left-hand side is truncated to a float by the assignment, but the right-hand side, despite being a float literal, is actually “1.3” at 80 bits of precision as far as GCC is concerned. That’s pretty unintuitive!

The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.

The trend reversed once SIMD hardware became widely available and there were huge performance gains to be had. Multiple values could be computed at once, side by side, at lower precision. So on x86-64, this became the default (FLT_EVAL_METHOD == 0). The young Clang compiler wasn’t around until well after this trend reversed, so it behaves differently than the backwards compatible GCC on the old x86.

I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.

Built-in Function Elimination

I’ve saved this one for last since it’s my favorite. Suppose we have this little function, new_image(), that allocates a greyscale image for, say, some multimedia library.

static unsigned char *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, shade, w * h);
        }
    }
    return p;
}

It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.

I write a unit test to make sure it detects overflow. This function should return 0.

/* expected return == 0 */
int
test_new_image_overflow(void)
{
    void *p = new_image(2, SIZE_MAX, 0);
    return !!p;
}

So far my test passes. Good.

I’d also like to make sure it correctly returns NULL — or, more specifically, that it doesn’t crash — if the allocation fails. But how can I make malloc() fail? As a hack I can pass image dimensions that I know cannot ever practically be allocated. Essentially I want to force a malloc(SIZE_MAX), e.g. allocate every available byte in my virtual address space. For a conventional 64-bit machine, that’s 16 exibytes of memory, and it leaves space for nothing else, including the program itself.

/* expected return == 0 */
int
test_new_image_oom(void)
{
    void *p = new_image(1, SIZE_MAX, 0xff);
    return !!p;
}

I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?

Disassembling the test reveals what’s going on:

test_new_image_overflow:
    xor  eax, eax
    ret

test_new_image_oom:
    mov  eax, 1
    ret

The first test is actually being evaluated at compile time by the compiler. The function being tested was inlined into the unit test itself. This permits the compiler to collapse the whole thing down to a single instruction. The path with malloc() became dead code and was trivially eliminated.

In the second test, Clang correctly determined that the image buffer is not actually being used, despite the memset(), so it eliminated the allocation altogether and then simulated a successful allocation despite it being absurdly large. Allocating memory is not an observable side effect as far as the language specification is concerned, so it’s allowed to do this. My thinking was wrong, and the compiler outsmarted me.

I soon realized I can take this further and trick Clang into performing an invalid optimization, revealing a bug. Consider this slightly-optimized version that uses calloc() when the shade is zero (black). The calloc() function does its own overflow check, so new_image() doesn’t need to do it.

static void *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (shade == 0) { // shortcut
        p = calloc(w, h);
    } else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, color, w * h);
        }
    }
    return p;
}

With this change, my overflow unit test is now also failing. The situation is even worse than before. The calloc() is being eliminated despite the overflow, and replaced with a simulated success. This time it’s actually a bug in Clang. While failing a unit test is mostly harmless, this could introduce a vulnerability in a real program. The OpenBSD folks are so worried about this sort of thing that they’ve disabled this optimization.

Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).

struct table {
    unsigned *counter;
    unsigned *values;
};

static int
table_init(struct table *t, size_t n)
{
    t->counter = calloc(n, sizeof(*t->counter));
    if (t->counter) {
        /* Overflow already tested above */
        t->values = malloc(n * sizeof(*t->values));
        if (!t->values) {
            free(t->counter);
            return 0; // fail
        }
        return 1; // success
    }
    return 0; // fail
}

This function relies on the overflow test in calloc() for the second malloc() allocation. However, this is a static function that’s likely to get inlined, as we saw before. If the program doesn’t actually make use of the counter table, and Clang is able to statically determine this fact, it may eliminate the calloc(). This would also eliminate the overflow test, introducing a vulnerability. If an attacker can control n, then they can overwrite arbitrary memory through that values pointer.

The takeaway

Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.

The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.

My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?

Just before the end it would pause the game and wait… forever.

null program

Chris Wellons