Manual Control Flow Guard in C

Recent versions of Windows have a new exploit mitigation feature called Control Flow Guard (CFG). Before an indirect function call — e.g. function pointers and virtual functions — the target address checked against a table of valid call addresses. If the address isn’t the entry point of a known function, then the program is aborted.

If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.

The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.

For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.

CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.

While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.

An exploit example

Consider this trivial C program, demo.c:

int
main(void)
{
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    return 0;
}

It reads a name into a buffer and prints it back out with a greeting. While trivial, it’s far from innocent. That naive call to gets() doesn’t check the bounds of the buffer, introducing an exploitable buffer overflow. It’s so obvious that both the compiler and linker will yell about it.

For simplicity, suppose the program also contains a dangerous function.

void
self_destruct(void)
{
    puts("**** GO BOOM! ****");
}

The attacker can use the buffer overflow to call this dangerous function.

To make this attack simpler for the sake of the article, assume the program isn’t using ASLR (e.g. compiled without -fpie and -pie on GCC/Clang). First, find the address of self_destruct().

$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct

This is on x86-64, so it’s a 64-bit address. The size of the name buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes allocated above, so there’s 16 bytes to fill, then 8 bytes to overwrite the return pointer with the address of self_destruct.

$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault

With this input I’ve successfully exploited the buffer overflow to divert control to self_destruct(). When main tries to return into libc, it instead jumps to the dangerous function, and then crashes when that function tries to return — though, presumably, the system would have self-destructed already. Turning on the stack protector stops this exploit.

$ gcc -Os -fstack-protector -o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...

The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.

The stack protector transformed the program into something that looks like the following:

int
main(void)
{
    long __canary = __get_thread_canary();
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    if (__canary != __get_thread_canary())
        abort();
    return 0;
}

However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.

Function pointers and virtual functions

After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:

void
self_destruct(char *password)
{
    if (strcmp(password, "12345") == 0)
        puts("**** GO BOOM! ****");
}

The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.

Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.

struct greeter {
    char name[8];
    void (*greet)(struct greeter *);
};

void
greet_hello(struct greeter *g)
{
    printf("Hello, %s.\n", g->name);
}

void
greet_aloha(struct greeter *g)
{
    printf("Aloha, %s.\n", g->name);
}

There’s now a greeter object and the function pointer makes its behavior polymorphic. Think of it as a hand-coded virtual function for C. Here’s the new (contrived) main:

int
main(void)
{
    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    greeter.greet(&greeter);
    return 0;
}

(In a real program, something else provides greeter and picks its own function pointer for greet.)

Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.

$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct

We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).

$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****

Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.

This is where the Control Flow Guard comes into play. With CFG enabled, the compiler inserts a check before calling the greet() function pointer. It must point to the beginning of a known function, otherwise it will abort just like the stack protector. Since the middle of self_destruct() isn’t the beginning of a function, it would abort if this exploit is attempted.

However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.

Function address bitmap

As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.

For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.

These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.

Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.

Here are the CFG parameters. I’ve made them macros so that they can easily be tuned at compile-time. The cfg_bits is the integer type backing the bitmap array. The CFG_RESOLUTION is the number of bits dropped, so “3” is a granularity of 8 bytes.

typedef unsigned long cfg_bits;
#define CFG_RESOLUTION  3
#define CFG_BITS        24

Given a function pointer f, this macro extracts the bitmap index.

#define CFG_INDEX(f) \
    (((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))

The CFG bitmap is just an array of integers. Zero it to initialize.

struct cfg {
    cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];
};

Functions are manually registered in the bitmap using cfg_register().

void
cfg_register(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    cfg->bitmap[i / z] |= 1UL << (i % z);
}

Because functions are registered at run-time, it’s fully compatible with ASLR. If ASLR is enabled, the bitmap will be a little different each run. On the same note, it may be worth XORing each bitmap element with a random, run-time value — along the same lines as the stack canary value — to make it harder for an attacker to manipulate the bitmap should he get the ability to overwrite it by a vulnerability. Alternatively the bitmap could be switched to read-only (e.g. mprotect()) once everything is registered.

And finally, the check function, used immediately before indirect calls. It ensures f was previously passed to cfg_register() (except for false positives, as discussed). Since it will be invoked often, it needs to be fast and simple.

void
cfg_check(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    if (!((cfg->bitmap[i / z] >> (i % z)) & 1))
        abort();
}

And that’s it! Now augment main to make use of it:

struct cfg cfg;

int
main(void)
{
    cfg_register(&cfg, self_destruct);  // to prove this works
    cfg_register(&cfg, greet_hello);
    cfg_register(&cfg, greet_aloha);

    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    cfg_check(&cfg, greeter.greet);
    greeter.greet(&greeter);
    return 0;
}

And now attempting the exploit:

$ ./demo < boom
Aborted

Normally self_destruct() wouldn’t be registered since it’s not a legitimate target of an indirect call, but the exploit still didn’t work because it called into the middle of self_destruct(), which isn’t a valid address in the bitmap. The check aborts the program before it can be exploited.

In a real application I would have a global cfg bitmap for the whole program, and define cfg_check() in a header as an inline function.

Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.

C Closures as a Library

A common idiom is C is the callback function pointer, either to deliver information (i.e. a visitor or handler) or to customize the function’s behavior (e.g. a comparator). Examples of the latter in the C standard library are qsort() and bsearch(), each requiring a comparator function in order to operate on arbitrary types.

void qsort(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *));

void *bsearch(const void *key, const void *base,
              size_t nmemb, size_t size,
              int (*compar)(const void *, const void *));

A problem with these functions is that there’s no way to pass context to the callback. The callback may need information beyond the two element pointers when making its decision, or to update a result. For example, suppose I have a structure representing a two-dimensional coordinate, and a coordinate distance function.

struct coord {
    float x;
    float y;
};

static inline float
distance(const struct coord *a, const struct coord *b)
{
    float dx = a->x - b->x;
    float dy = a->y - b->y;
    return sqrtf(dx * dx + dy * dy);
}

If I have an array of coordinates and I want to sort them based on their distance from some target, the comparator needs to know the target. However, the qsort() interface has no way to directly pass this information. Instead it has to be passed by another means, such as a global variable.

struct coord *target;

int
coord_cmp(const void *a, const void *b)
{
    float dist_a = distance(a, target);
    float dist_b = distance(b, target);
    if (dist_a < dist_b)
        return -1;
    else if (dist_a > dist_b)
        return 1;
    else
        return 0;
}

And its usage:

    size_t ncoords = /* ... */;
    struct coords *coords = /* ... */;
    struct current_target = { /* ... */ };
    // ...
    target = &current_target
    qsort(coords, ncoords, sizeof(coords[0]), coord_cmp);

Potential problems are that it’s neither thread-safe nor re-entrant. Two different threads cannot use this comparator at the same time. Also, on some platforms and configurations, repeatedly accessing a global variable in a comparator may have a significant cost. A common workaround for thread safety is to make the global variable thread-local by allocating it in thread-local storage (TLS):

_Thread_local struct coord *target;       // C11
__thread struct coord *target;            // GCC and Clang
__declspec(thread) struct coord *target;  // Visual Studio

This makes the comparator thread-safe. However, it’s still not re-entrant (usually unimportant) and accessing thread-local variables on some platforms is even more expensive — which is the situation for Pthreads TLS, though not a problem for native x86-64 TLS.

Modern libraries usually provide some sort of “user data” pointer — a generic pointer that is passed to the callback function as an additional argument. For example, the GNU C Library has long had qsort_r(): re-entrant qsort.

void qsort_r(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *, void *),
           void *arg);

The new comparator looks like this:

int
coord_cmp_r(const void *a, const void *b, void *target)
{
    float dist_a = distance(a, target);
    float dist_b = distance(b, target);
    if (dist_a < dist_b)
        return -1;
    else if (dist_a > dist_b)
        return 1;
    else
        return 0;
}

And its usage:

    void *arg = &current_target;
    qsort_r(coords, ncoords, sizeof(coords[0]), coord_cmp_r, arg);

User data arguments are thread-safe, re-entrant, performant, and perfectly portable. They completely and cleanly solve the entire problem with virtually no drawbacks. If every library did this, there would be nothing left to discuss and this article would be boring.

The closure solution

In order to make things more interesting, suppose you’re stuck calling a function in some old library that takes a callback but doesn’t support a user data argument. A global variable is insufficient, and the thread-local storage solution isn’t viable for one reason or another. What do you do?

The core problem is that a function pointer is just an address, and it’s the same address no matter the context for any particular callback. On any particular call, the callback has three ways to distinguish this call from other calls. These align with the three solutions above:

  1. Inspect some global state: the global variable solution. The caller will change this state for some other calls.
  2. Query its unique thread ID: the thread-local storage solution. Calls on different threads will have different thread IDs.
  3. Examine a context argument: the user pointer solution.

A wholly different approach is to use a unique function pointer for each callback. The callback could then inspect its own address to differentiate itself from other callbacks. Imagine defining multiple instances of coord_cmp each getting their context from a different global variable. Using a unique copy of coord_cmp on each thread for each usage would be both re-entrant and thread-safe, and wouldn’t require TLS.

Taking this idea further, I’d like to generate these new functions on demand at run time akin to a JIT compiler. This can be done as a library, mostly agnostic to the implementation of the callback. Here’s an example of what its usage will be like:

void *closure_create(void *f, int nargs, void *userdata);
void  closure_destroy(void *);

The callback to be converted into a closure is f and the number of arguments it takes is nargs. A new closure is allocated and returned as a function pointer. This closure takes nargs - 1 arguments, and it will call the original callback with the additional argument userdata.

So, for example, this code uses a closure to convert coord_cmp_r into a function suitable for qsort():

int (*closure)(const void *, const void *);
closure = closure_create(coord_cmp_r, 3, &current_target);

qsort(coords, ncoords, sizeof(coords[0]), closure);

closure_destroy(closure);

Caveat: This API is utterly insufficient for any sort of portability. The number of arguments isn’t nearly enough information for the library to generate a closure. For practically every architecture and ABI, it’s going to depend on the types of each of those arguments. On x86-64 with the System V ABI — where I’ll be implementing this — this argument will only count integer/pointer arguments. To find out what it takes to do this properly, see the libjit documentation.

Memory design

This implementation will be for x86-64 Linux, though the high level details will be the same for any program running in virtual memory. My closures will span exactly two consecutive pages (typically 8kB), though it’s possible to use exactly one page depending on the desired trade-offs. The reason I need two pages are because each page will have different protections.

Native code — the thunk — lives in the upper page. The user data pointer and callback function pointer lives at the high end of the lower page. The two pointers could really be anywhere in the lower page, and they’re only at the end for aesthetic reasons. The thunk code will be identical for all closures of the same number of arguments.

The upper page will be executable and the lower page will be writable. This allows new pointers to be set without writing to executable thunk memory. In the future I expect operating systems to enforce W^X (“write xor execute”), and this code will already be compliant. Alternatively, the pointers could be “baked in” with the thunk page and immutable, but since creating closure requires two system calls, I figure it’s better that the pointers be mutable and the closure object reusable.

The address for the closure itself will be the upper page, being what other functions will call. The thunk will load the user data pointer from the lower page as an additional argument, then jump to the actual callback function also given by the lower page.

Thunk assembly

The x86-64 thunk assembly for a 2-argument closure calling a 3-argument callback looks like this:

user:  dq 0
func:  dq 0
;; --- page boundary here ---
thunk2:
        mov  rdx, [rel user]
        jmp  [rel func]

As a reminder, the integer/pointer argument register order for the System V ABI calling convention is: rdi, rsi, rdx, rcx, r8, r9. The third argument is passed through rdx, so the user pointer is loaded into this register. Then it jumps to the callback address with the original arguments still in place, plus the new argument. The user and func values are loaded RIP-relative (rel) to the address of the code. The thunk is using the callback address (its own address) to determine the context.

The assembled machine code for the thunk is just 13 bytes:

unsigned char thunk2[16] = {
    // mov  rdx, [rel user]
    0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
    // jmp  [rel func]
    0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}

All closure_create() has to do is allocate two pages, copy this buffer into the upper page, adjust the protections, and return the address of the thunk. Since closure_create() will work for nargs number of arguments, there will actually be 6 slightly different thunks, one for each of the possible register arguments (rdi through r9).

static unsigned char thunk[6][13] = {
    {
        0x48, 0x8b, 0x3d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x35, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x48, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x4C, 0x8b, 0x05, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    }, {
        0x4C, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
        0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
    },
};

Given a closure pointer returned from closure_create(), here are the setter functions for setting the closure’s two pointers.

void
closure_set_data(void *closure, void *data)
{
    void **p = closure;
    p[-2] = data;
}

void
closure_set_function(void *closure, void *f)
{
    void **p = closure;
    p[-1] = f;
}

In closure_create(), allocation is done with an anonymous mmap(), just like in my JIT compiler. It’s initially mapped writable in order to copy the thunk, then the thunk page is set to executable.

void *
closure_create(void *f, int nargs, void *userdata)
{
    long page_size = sysconf(_SC_PAGESIZE);
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    char *p = mmap(0, page_size * 2, prot, flags, -1, 0);
    if (p == MAP_FAILED)
        return 0;

    void *closure = p + page_size;
    memcpy(closure, thunk[nargs - 1], sizeof(thunk[0]));
    mprotect(closure, page_size, PROT_READ | PROT_EXEC);

    closure_set_function(closure, f);
    closure_set_data(closure, userdata);
    return closure;
}

Destroying a closure is done by computing the lower page address and calling munmap() on it:

void
closure_destroy(void *closure)
{
    long page_size = sysconf(_SC_PAGESIZE);
    munmap((char *)closure - page_size, page_size * 2);
}

And that’s it! You can see the entire demo here:

It’s a lot simpler for x86-64 than it is for x86, where there’s no RIP-relative addressing and arguments are passed on the stack. The arguments must all be copied back onto the stack, above the new argument, and it cannot be a tail call since the stack has to be fixed before returning. Here’s what the thunk looks like for a 2-argument closure:

data:	dd 0
func:	dd 0
;; --- page boundary here ---
thunk2:
        call .rip2eax
.rip2eax:
        pop eax
        push dword [eax - 13]
        push dword [esp + 12]
        push dword [esp + 12]
        call [eax - 9]
        add esp, 12
        ret

Exercise for the reader: Port the closure demo to a different architecture or to the the Windows x64 ABI.

Domain-Specific Language Compilation in Elfeed

Last night I pushed another performance enhancement for Elfeed, this time reducing the time spent parsing feeds. It’s accomplished by compiling, during macro expansion, a jQuery-like domain-specific language within Elfeed.

Heuristic parsing

Given the nature of the domain — an under-specified standard and a lack of robust adherence — feed parsing is much more heuristic than strict. Sure, everyone’s feed XML is strictly conforming since virtually no feed reader tolerates invalid XML (thank you, XML libraries), but, for the schema, the situation resembles the de facto looseness of HTML. Sometimes important or required information is missing, or is only available in a different namespace. Sometimes, especially in the case of timestamps, it’s in the wrong format, or encoded incorrectly, or ambiguous. It’s real world data.

To get a particular piece of information, Elfeed looks in a number of different places within the feed, starting with the preferred source and stopping when the information is found. For example, to find the date of an Atom entry, Elfeed first searches for elements in this order:

  1. <published>
  2. <updated>
  3. <date>
  4. <modified>
  5. <issued>

Failing to find any of these elements, or if no parsable date is found, it settles on the current time. Only the updated element is required, but published usually has the desired information, so it goes first. The last three are only valid for another namespace, but are useful fallbacks.

Before Elfeed even starts this search, the XML text is parsed into an s-expression using xml-parse-region — a pure Elisp XML parser included in Emacs. The search is made over the resulting s-expression.

For example, here’s a sample from the Atom specification.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Example Feed</title>
  <link href="http://example.org/"/>
  <updated>2003-12-13T18:30:02Z</updated>
  <author>
    <name>John Doe</name>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>

  <entry>
    <title>Atom-Powered Robots Run Amok</title>
    <link rel="alternate" href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
  </entry>

</feed>

Which is parsed to into this s-expression.

((feed ((xmlns . "http://www.w3.org/2005/Atom"))
       (title () "Example Feed")
       (link ((href . "http://example.org/")))
       (updated () "2003-12-13T18:30:02Z")
       (author () (name () "John Doe"))
       (id () "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6")
       (entry ()
              (title () "Atom-Powered Robots Run Amok")
              (link ((rel . "alternate")
                     (href . "http://example.org/2003/12/13/atom03")))
              (id () "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a")
              (updated () "2003-12-13T18:30:02Z")
              (summary () "Some text."))))

Each XML element is converted to a list. The first item is a symbol that is the element’s name. The second item is an alist of attributes — cons pairs of symbols and strings. And the rest are its children, both string nodes and other elements. I’ve trimmed the extraneous string nodes from the sample s-expression.

A subtle detail is that xml-parse-region doesn’t just return the root element. It returns a list of elements, which always happens to be a single element list, which is the root element. I don’t know why this is, but I’ve built everything to assume this structure as input.

For Elfeed, all namespaces stripped from both elements and attributes to make parsing simpler. As I said, it’s heuristic rather than strict, so namespaces are treated as noise.

A domain-specific language

Coding up Elfeed’s s-expression searches in straight Emacs Lisp would be tedious, error-prone, and difficult to understand. It’s a lot of loops, assoc, etc. So instead I invented a jQuery-like, CSS selector-like, domain-specific language (DSL) to express these searches concisely and clearly.

For example, all of the entry links are “selected” using this expression:

(feed entry link [rel "alternate"] :href)

Reading right-to-left, this matches every href attribute under every link element with the rel="alternate" attribute, under every entry element, under the feed root element. Symbols match element names, two-element vectors match elements with a particular attribute pair, and keywords (which must come last) narrow the selection to a specific attribute value.

Imagine hand-writing the code to navigate all these conditions for each piece of information that Elfeed requires. The RSS parser makes up to 16 such queries, and the Atom parser makes as many as 24. That would add up to a lot of tedious code.

The package (included with Elfeed) that executes this query is called “xml-query.” It comes in two flavors: xml-query and xml-query-all. The former returns just the first match, and the latter returns all matches. The naming parallels the querySelector() and querySelectorAll() DOM methods in JavaScript.

(let ((xml (elfeed-xml-parse-region)))
  (xml-query-all '(feed entry link [rel "alternate"] :href) xml))

;; => ("http://example.org/2003/12/13/atom03")

That date search I mentioned before looks roughly like this. The * matches text nodes within the selected element. It must come last just like the keyword matcher.

(or (xml-query '(feed entry published *))
    (xml-query '(feed entry updated *))
    (xml-query '(feed entry date *))
    (xml-query '(feed entry modified *))
    (xml-query '(feed entry issued *))
    (current-time))

Over the past three years, Elfeed has gained more and more of these selectors as it collects more and more information from feeds. Most recently, Elfeed collects author and category information provided by feeds. Each new query slows feed parsing a little bit, and it’s a perfect example of a program slowing down as it gains more features and capabilities.

But I don’t want Elfeed to slow down. I want it to get faster!

Optimizing the domain-specific language

Just like the primary jQuery function ($), both xml-query and xml-query-all are functions. The xml-query engine processes the selector from scratch on each invocation. It examines the first element, dispatches on its type/value to apply it to the input, and then recurses on the rest of selector with the narrowed input, stopping when it hits the end of the list. That’s the way it’s worked from the start.

However, every selector argument in Elfeed is a static, quoted list. Unlike user-supplied filters, I know exactly what I want to execute ahead of time. It would be much better if the engine didn’t have to waste time reparsing the DSL for each query.

This is the classic split between interpreters and compilers. An interpreter reads input and immediately executes it, doing what the input tells it to do. A compiler reads input and, rather than execute it, produces output, usually in a simpler language, that, when evaluated, has the same effect as executing the input.

Rather than interpret the selector, it would be better to compile it into Elisp code, compile that into byte-code, and then have the Emacs byte-code virtual machine (VM) execute the query each time it’s needed. The extra work of parsing the DSL is performed ahead of time, the dispatch is entirely static, and the selector ultimately executes on a much faster engine (byte-code VM). This should be a lot faster!

So I wrote a function that accepts a selector expression and emits Elisp source that implements that selector: a compiler for my DSL. Having a readily-available syntax tree is one of the big advantages of homoiconicity, and this sort of function makes perfect sense in a lisp. For the external interface, this compiler function is called by a new pair of macros, xml-query* and xml-query-all*. These macros consume a static selector and expand into the compiled Elisp form of the selector.

To demonstrate, remember that link query from before? Here’s the macro version of that selection, but only returning the first match. Notice the selector is no longer quoted. This is because it’s consumed by the macro, not evaluated.

(xml-query* (feed entry title [rel "alternate"] :href) xml)

This will expand into the following code.

(catch 'done
  (dolist (v xml)
    (when (and (consp v) (eq (car v) 'feed))
      (dolist (v (cddr v))
        (when (and (consp v) (eq (car v) 'entry))
          (dolist (v (cddr v))
            (when (and (consp v) (eq (car v) 'title))
              (let ((value (cdr (assq 'rel (cadr v)))))
                (when (equal value "alternate")
                  (let ((v (cdr (assq 'href (cadr v)))))
                    (when v
                      (throw 'done v))))))))))))

As soon as it finds a match, it’s thrown to the top level and returned. Without the DSL, the expansion is essentially what would have to be written by hand. This is exactly the sort of leverage you should be getting from a compiler. It compiles to around 130 byte-code instructions.

The xml-query-all* form is nearly the same, but instead of a throw, it pushes the result into the return list. Only the prologue (the outermost part) and the epilogue (the innermost part) are different.

Parsing feeds is a hot spot for Elfeed, so I wanted the compiler’s output to be as efficient as possible. I had three goals for this:

The end result is at least as optimal as hand-written code, but without the chance of human error (typos, fat fingering) and sourced from an easy-to-read DSL.

Performance

In my tests, the xml-query macros are a full order of magnitude faster than the functions. Yes, ten times faster! It’s an even bigger gain than I expected.

In the full picture, xml-query is only one part of parsing a feed. Measuring the time starting from raw XML text (as delivered by cURL) to a list of database entry objects, I’m seeing an overall 25% speedup with the macros. The remaining time is dominated by xml-parse-region, which is mostly out of my control.

With xml-query so computationally cheap, I don’t need to worry about using it more often. Compared to parsing XML text, it’s virtually free.

When it came time to validate my DSL compiler, I was really happy that Elfeed had a test suite. I essentially rewrote a core component from scratch, and passing all of the unit tests was a strong sign that it was correct. Many times that test suite has provided confidence in changes made both by me and by others.

I’ll end by describing another possible application: Apply this technique to regular expressions, such that static strings containing regular expressions are compiled into Elisp/byte-code via macro expansion. I wonder if situationally this would be faster than Emacs’ own regular expression engine.

Relocatable Global Data on x86

Relocatable code — program code that executes correctly from any properly-aligned address — is an essential feature for shared libraries. Otherwise all of a system’s shared libraries would need to coordinate their virtual load addresses. Loading programs and libraries to random addresses is also a valuable security feature: Address Space Layout Randomization (ASLR). But how does a compiler generate code for a function that accesses a global variable if that variable’s address isn’t known at compile time?

Consider this simple C code sample.

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return x < 4 ? values[x] : 0.0f;
}

This function needs the base address of values in order to dereference it for values[x]. The easiest way to find out how this works, especially without knowing where to start, is to compile the code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian Jessie).

$ gcc -c -Os -fPIC get_value.c

I optimized for size (-Os) to make the disassembly easier to follow. Next, disassemble this pre-linked code with objdump. Alternatively I could have asked for the compiler’s assembly output with -S, but this will be good reverse engineering practice.

$ objdump -d -Mintel get_value.o
0000000000000000 <get_value>:
   0:   83 ff 03                cmp    edi,0x3
   3:   0f 57 c0                xorps  xmm0,xmm0
   6:   77 0e                   ja     16 <get_value+0x16>
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]
   f:   89 ff                   mov    edi,edi
  11:   f3 0f 10 04 b8          movss  xmm0,DWORD PTR [rax+rdi*4]
  16:   c3                      ret

There are a couple of interesting things going on, but let’s start from the beginning.

  1. The ABI specifies that the first integer/pointer argument (the 32-bit integer x) is passed through the edi register. The function compares x to 3, to satisfy x < 4.

  2. The ABI specifies that floating point values are returned through the SSE2 SIMD register xmm0. It’s cleared by XORing the register with itself — the conventional way to clear registers on x86 — setting up for a return value of 0.0f.

  3. It then uses the result of the previous comparison to perform a jump, ja (“jump if after”). That is, jump to the relative address specified by the jump’s operand if the first operand to cmp (edi) comes after the first operand (0x3) as unsigned values. Its cousin, jg (“jump if greater”), is for signed values. If x is outside the array bounds, it jumps straight to ret, returning 0.0f.

  4. If x was in bounds, it uses a lea (“load effective address”) to load something into the 64-bit rax register. This is the complicated bit, and I’ll start by giving the answer: The value loaded into rax is the address of the values array. More on this in a moment.

  5. Finally it uses x as an index into address in rax. The movss (“move scalar single-precision”) instruction loads a 32-bit float into the first lane of xmm0, where the caller expects to find the return value. This is all preceded by a mov edi, edi which looks like a hotpatch nop, but it isn’t. x86-64 always uses 64-bit registers for addressing, meaning it uses rdi not edi. All 32-bit register assignments clear the upper 32 bits, and so this mov zero-extends edi into rdi. This is in case of the unlikely event that the caller left garbage in those upper bits.

Clearing xmm0

The first interesting part: xmm0 is cleared even when its first lane is loaded with a value. There are two reasons to do this.

The obvious reason is that the alternative requires additional instructions, and I told GCC to optimize for size. It would need either an extra ret or an conditional jmp over the “else” branch.

The less obvious reason is that it breaks a data dependency. For over 20 years now, x86 micro-architectures have employed an optimization technique called register renaming. Architectural registers (rax, edi, etc.) are just temporary names for underlying physical registers. This disconnect allows for more aggressive out-of-order execution. Two instructions sharing an architectural register can be executed independently so long as there are no data dependencies between these instructions.

For example, take this assembly sample. It assembles to 9 bytes of machine code.

    mov  edi, [rcx]
    mov  ecx, 7
    shl  eax, cl

This reads a 32-bit value from the address stored in rcx, then assigns ecx and uses cl (the lowest byte of rcx) in a shift operation. Without register renaming, the shift couldn’t be performed until the load in the first instruction completed. However, the second instruction is a 32-bit assignment, which, as I mentioned before, also clears the upper 32 bits of rcx, wiping the unused parts of register.

So after the second instruction, it’s guaranteed that the value in rcx has no dependencies on code that comes before it. Because of this, it’s likely a different physical register will be used for the second and third instructions, allowing these instructions to be executed out of order, before the load. Ingenious!

Compare it to this example, where the second instruction assigns to cl instead of ecx. This assembles to just 6 bytes.

    mov  edi, [rcx]
    mov  cl, 7
    shl  eax, cl

The result is 3 bytes smaller, but since it’s not a 32-bit assignment, the upper bits of rcx still hold the original register contents. This creates a false dependency and may prevent out-of-order execution, reducing performance.

By clearing xmm0, instructions in get_value involving xmm0 have the opportunity to be executed prior to instructions in the callee that use xmm0.

RIP-relative addressing

Going back to the instruction that computes the address of values.

   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]

Normally load/store addresses are absolute, based off an address either in a general purpose register, or at some hard-coded base address. The latter is not an option in relocatable code. With RIP-relative addressing that’s still the case, but the register with the absolute address is rip, the instruction pointer. This addressing mode was introduced in x86-64 to make relocatable code more efficient.

That means this instruction copies the instruction pointer (pointing to the next instruction) into rax, plus a 32-bit displacement, currently zero. This isn’t the right way to encode a displacement of zero (unless you want a larger instruction). That’s because the displacement will be filled in later by the linker. The compiler adds a relocation entry to the object file so that the linker knows how to do this.

On platforms that use ELF we can inspect relocations this with readelf.

$ readelf -r get_value.o

Relocation section '.rela.text' at offset 0x270 contains 1 entries:
  Offset          Info           Type       Sym. Value
00000000000b  000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4

The relocation type is R_X86_64_PC32. In the AMD64 Architecture Processor Supplement, this is defined as “S + A - P”.

The symbol, S, is .rodata — the final address for this object file’s portion of .rodata (where values resides). The addend, A, is -4 since the instruction pointer points at the next instruction. That is, this will be relative to four bytes after the relocation offset. Finally, the address of the relocation, P, is the address of last four bytes of the lea instruction. These values are all known at link-time, so no run-time support is necessary.

Being “S - P” (overall), this will be the displacement between these two addresses: the 32-bit value is relative. It’s relocatable so long as these two parts of the binary (code and data) maintain a fixed distance from each other. The binary is relocated as a whole, so this assumption holds.

32-bit relocation

Since RIP-relative addressing wasn’t introduced until x86-64, how did this all work on x86? Again, let’s just see what the compiler does. Add the -m32 flag for a 32-bit target, and -fomit-frame-pointer to make it simpler for explanatory purposes.

$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 <get_value>:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   d9 ee                   fldz
   6:   e8 fc ff ff ff          call   7 <get_value+0x7>
   b:   81 c1 02 00 00 00       add    ecx,0x2
  11:   83 f8 03                cmp    eax,0x3
  14:   77 09                   ja     1f <get_value+0x1f>
  16:   dd d8                   fstp   st(0)
  18:   d9 84 81 00 00 00 00    fld    DWORD PTR [ecx+eax*4+0x0]
  1f:   c3                      ret

Disassembly of section .text.__x86.get_pc_thunk.cx:

00000000 <__x86.get_pc_thunk.cx>:
   0:   8b 0c 24                mov    ecx,DWORD PTR [esp]
   3:   c3                      ret

Hmm, this one includes an extra function.

  1. In this calling convention, arguments are passed on the stack. The first instruction loads the argument, x, into eax.

  2. The fldz instruction clears the x87 floating pointer return register, just like clearing xmm0 in the x86-64 version.

  3. Next it calls __x86.get_pc_thunk.cx. The call pushes the instruction pointer, eip, onto the stack. This function reads that value off the stack into ecx and returns. In other words, calling this function copies eip into ecx. It’s setting up to load data at an address relative to the code. Notice the function name starts with two underscores — a name which is reserved for exactly for these sorts of implementation purposes.

  4. Next a 32-bit displacement is added to ecx. In this case it’s 2, but, like before, this is actually going be filled in later by the linker.

  5. Then it’s just like before: a branch to optionally load a value. The floating pointer load (fld) is another relocation.

Let’s look at the relocations. There are three this time:

$ readelf -r get_value.o

Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
 Offset     Info    Type        Sym.Value  Sym. Name
00000007  00000e02 R_386_PC32    00000000   __x86.get_pc_thunk.cx
0000000d  00000f0a R_386_GOTPC   00000000   _GLOBAL_OFFSET_TABLE_
0000001b  00000709 R_386_GOTOFF  00000000   .rodata

The first relocation is the call-site for the thunk. The thunk has external linkage and may be merged with a matching thunk in another object file, and so may be relocated. (Clang inlines its thunk.) Calls are relative, so its type is R_386_PC32: a code-relative displacement just like on x86-64.

The next is of type R_386_GOTPC and sets the second operand in that add ecx. It’s defined as “GOT + A - P” where “GOT” is the address of the Global Offset Table — a table of addresses of the binary’s relocated objects. Since values is static, the GOT won’t actually hold an address for it, but the relative address of the GOT itself will be useful.

The final relocation is of type R_386_GOTOFF. This is defined as “S + A - GOT”. Another displacement between two addresses. This is the displacement in the load, fld. Ultimately the load adds these last two relocations together, canceling the GOT:

  (GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P

So the GOT isn’t relevant in this case. It’s just a mechanism for constructing a custom relocation type.

Branch optimization

Notice in the x86 version the thunk is called before checking the argument. What if it’s most likely that will x be out of bounds of the array, and the function usually returns zero? That means it’s usually wasting its time calling the thunk. Without profile-guided optimization the compiler probably won’t know this.

The typical way to provide such a compiler hint is with a pair of macros, likely() and unlikely(). With GCC and Clang, these would be defined to use __builtin_expect. Compilers without this sort of feature would have macros that do nothing instead. So I gave it a shot:

#define likely(x)    __builtin_expect((x),1)
#define unlikely(x)  __builtin_expect((x),0)

static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};

float get_value(unsigned x)
{
    return unlikely(x < 4) ? values[x] : 0.0f;
}

Unfortunately this makes no difference even in the latest version of GCC. In Clang it changes branch fall-through (for static branch prediction), but still always calls the thunk. It seems compilers have difficulty with optimizing relocatable code on x86.

x86-64 isn’t just about more memory

It’s commonly understood that the advantage of 64-bit versus 32-bit systems is processes having access to more than 4GB of memory. But as this shows, there’s more to it than that. Even programs that don’t need that much memory can really benefit from newer features like RIP-relative addressing.

Some Performance Advantages of Lexical Scope

I recently had a discussion with Xah Lee about lexical scope in Emacs Lisp. The topic was why lexical-binding exists at a file-level when there was already lexical-let (from cl-lib), prompted by my previous article on JIT byte-code compilation. The specific context is Emacs Lisp, but these concepts apply to language design in general.

Until Emacs 24.1 (June 2012), Elisp only had dynamically scoped variables — a feature, mostly by accident, common to old lisp dialects. While dynamic scope has some selective uses, it’s widely regarded as a mistake for local variables, and virtually no other languages have adopted it.

Way back in 1993, Dave Gillespie’s deviously clever lexical-let macro was committed to the cl package, providing a rudimentary form of opt-in lexical scope. The macro walks its body replacing local variable names with guaranteed-unique gensym names: the exact same technique used in macros to create “hygienic” bindings that aren’t visible to the macro body. It essentially “fakes” lexical scope within Elisp’s dynamic scope by preventing variable name collisions.

For example, here’s one of the consequences of dynamic scope.

(defun inner ()
  (setq v :inner))

(defun outer ()
  (let ((v :outer))
    (inner)
    v))

(outer)
;; => :inner

The “local” variable v in outer is visible to its callee, inner, which can access and manipulate it. The meaning of the free variable v in inner depends entirely on the run-time call stack. It might be a global variable, or it might be a local variable for a caller, direct or indirect.

Using lexical-let deconflicts these names, giving the effect of lexical scope.

(defvar v)

(defun lexical-outer ()
  (lexical-let ((v :outer))
    (inner)
    v))

(lexical-outer)
;; => :outer

But there’s more to lexical scope than this. Closures only make sense in the context of lexical scope, and the most useful feature of lexical-let is that lambda expressions evaluate to closures. The macro implements this using a technique called closure conversion. Additional parameters are added to the original lambda function, one for each lexical variable (and not just each closed-over variable), and the whole thing is wrapped in another lambda function that invokes the original lambda function with the additional parameters filled with the closed-over variables — yes, the variables (e.g. symbols) themselves, not just their values, (e.g. pass-by-reference). The last point means different closures can properly close over the same variables, and they can bind new values.

To roughly illustrate how this works, the first lambda expression below, which closes over the lexical variables x and y, would be converted into the latter by lexical-let. The #: is Elisp’s syntax for uninterned variables. So #:x is a symbol x, but not the symbol x (see print-gensym).

;; Before conversion:
(lambda ()
  (+ x y))

;; After conversion:
(lambda (&rest args)
  (apply (lambda (x y)
           (+ (symbol-value x)
              (symbol-value y)))
         '#:x '#:y args))

I’ve said on multiple occasions that lexical-binding: t has significant advantages, both in performance and static analysis, and so it should be used for all future Elisp code. The only reason it’s not the default is because it breaks some old (badly written) code. However, lexical-let doesn’t realize any of these advantages! In fact, it has worse performance than straightforward dynamic scope with let.

  1. New symbol objects are allocated and initialized (make-symbol) on each run-time evaluation, one per lexical variable.

  2. Since it’s just faking it, lexical-let still uses dynamic bindings, which are more expensive than lexical bindings. It varies depending on the C compiler that built Emacs, but dynamic variable accesses (opcode varref) take around 30% longer than lexical variable accesses (opcode stack-ref). Assignment is far worse, where dynamic variable assignment (varset) takes 650% longer than lexical variable assignment (stack-set). How I measured all this is a topic for another article.

  3. The “lexical” variables are accessed using symbol-value, a full function call, so they’re even slower than normal dynamic variables.

  4. Because converted lambda expressions are constructed dynamically at run-time within the body of lexical-let, the resulting closure is only partially byte-compiled even if the code as a whole has been byte-compiled. In contrast, lexical-binding: t closures are fully compiled. How this works is worth its own article.

  5. Converted lambda expressions include the additional internal function invocation, making them slower.

While lexical-let is clever, and occasionally useful prior to Emacs 24, it may come at a hefty performance cost if evaluated frequently. There’s no reason to use it anymore.

Constraints on code generation

Another reason to be weary of dynamic scope is that it puts needless constraints on the compiler, preventing a number of important optimization opportunities. For example, consider the following function, bar:

(defun bar ()
  (let ((x 1)
        (y 2))
    (foo)
    (+ x y)))

Byte-compile this function under dynamic scope (lexical-binding: nil) and disassemble it to see what it looks like.

(byte-compile #'bar)
(disassemble #'bar)

That pops up a buffer with the disassembly listing:

0       constant  1
1       constant  2
2       varbind   y
3       varbind   x
4       constant  foo
5       call      0
6       discard
7       varref    x
8       varref    y
9       plus
10      unbind    2
11      return

It’s 12 instructions, 5 of which deal with dynamic bindings. The byte-compiler doesn’t always produce optimal byte-code, but this just so happens to be nearly optimal byte-code. The discard (a very fast instruction) isn’t necessary, but otherwise no more compiler smarts can improve on this. Since the variables x and y are visible to foo, they must be bound before the call and loaded after the call. While generally this function will return 3, the compiler cannot assume so since it ultimately depends on the behavior foo. Its hands are tied.

Compare this to the lexical scope version (lexical-binding: t):

0       constant  1
1       constant  2
2       constant  foo
3       call      0
4       discard
5       stack-ref 1
6       stack-ref 1
7       plus
8       return

It’s only 8 instructions, none of which are expensive dynamic variable instructions. And this isn’t even close to the optimal byte-code. In fact, as of Emacs 25.1 the byte-compiler often doesn’t produce the optimal byte-code for lexical scope code and still needs some work. Despite not firing on all cylinders, lexical scope still manages to beat dynamic scope in performance benchmarks.

Here’s the optimal byte-code, should the byte-compiler become smarter someday:

0       constant  foo
1       call      0
2       constant  3
3       return

It’s down to 4 instructions due to computing the math operation at compile time. Emacs’ byte-compiler only has rudimentary constant folding, so it doesn’t notice that x and y are constants and misses this optimization. I speculate this is due to its roots compiling under dynamic scope. Since x and y are no longer exposed to foo, the compiler has the opportunity to optimize them out of existence. I haven’t measured it, but I would expect this to be significantly faster than the dynamic scope version of this function.

Optional dynamic scope

You might be thinking, “What if I really do want x and y to be dynamically bound for foo?” This is often useful. Many of Emacs’ own functions are designed to have certain variables dynamically bound around them. For example, the print family of functions use the global variable standard-output to determine where to send output by default.

(let ((standard-output (current-buffer)))
  (princ "value = ")
  (prin1 value))

Have no fear: With lexical-binding: t you can have your cake and eat it too. Variables declared with defvar, defconst, or defvaralias are marked as “special” with an internal bit flag (declared_special in C). When the compiler detects one of these variables (special-variable-p), it uses a classical dynamic binding.

Declaring both x and y as special restores the original semantics, reverting bar back to its old byte-code definition (next time it’s compiled, that is). But it would be poor form to mark x or y as special: You’d de-optimize all code (compiled after the declaration) anywhere in Emacs that uses these names. As a package author, only do this with the namespace-prefixed variables that belong to you.

The only way to unmark a special variable is with the undocumented function internal-make-var-non-special. I expected makunbound to do this, but as of Emacs 25.1 it does not. This could possibly be considered a bug.

Accidental closures

I’ve said there are are absolutely no advantages to lexical-binding: nil. It’s only the default for the sake of backwards-compatibility. However, there is one case where lexical-binding: t introduces a subtle issue that would otherwise not exist. Take this code for example (and nevermind prin1-to-string for a moment):

;; -*- lexical-binding: t; -*-

(defun function-as-string ()
  (with-temp-buffer
    (prin1 (lambda () :example) (current-buffer))
    (buffer-string)))

This creates and serializes a closure, which is one of Elisp’s unique features. It doesn’t close over any variables, so it should be pretty simple. However, this function will only work correctly under lexical-binding: t when byte-compiled.

(function-as-string)
;; => "(closure ((temp-buffer . #<buffer  *temp*>) t) nil :example)"

The interpreter doesn’t analyze the closure, so just closes over everything. This includes the hidden variable temp-buffer created by the with-temp-buffer macro, resulting in an abstraction leak. Buffers aren’t readable, so this will signal an error if an attempt is made to read this function back into an s-expression. The byte-compiler fixes this by noticing temp-buffer isn’t actually closed over and so doesn’t include it in the closure, making it work correctly.

Under lexical-binding: nil it works correctly either way:

(function-as-string)
;; -> "(lambda nil :example)"

This may seem contrived — it’s certainly unlikely — but it has come up in practice. Still, it’s no reason to avoid lexical-binding: t.

Use lexical scope in all new code

As I’ve said again and again, always use lexical-binding: t. Use dynamic variables judiciously. And lexical-let is no replacement. It has virtually none of the benefits, performs worse, and it only applies to let, not any of the other places bindings are created: function parameters, dotimes, dolist, and condition-case.

Faster Elfeed Search Through JIT Byte-code Compilation

Today I pushed an update for Elfeed that doubles the speed of the search filter in the worse case. This is the user-entered expression that dynamically narrows the entry listing to a subset that meets certain criteria: published after a particular date, with/without particular tags, and matching/non-matching zero or more regular expressions. The filter is live, applied to the database as the expression is edited, so it’s important for usability that this search completes under a threshold that the user might notice.

The typical workaround for these kinds of interfaces is to make filtering/searching asynchronous. It’s possible to do this well, but it’s usually a terrible, broken design. If the user acts upon the asynchronous results — say, by typing the query and hitting enter to choose the current or expected top result — then the final behavior is non-deterministic, a race between the user’s typing speed and the asynchronous search. Elfeed will keep its synchronous live search.

For anyone not familiar with Elfeed, here’s a filter that finds all entries from within the past year tagged “youtube” (+youtube) that mention Linux or Linus (linu[sx]), but aren’t tagged “bsd” (-bsd), limited to the most recent 15 entries (#15):

@1-year-old +youtube linu[xs] -bsd #15

The database is primarily indexed over publication date, so filters on publication dates are the most efficient filters. Entries are visited in order starting with the most recently published, and the search can bail out early once it crosses the filter threshold. Time-oriented filters have been encouraged as the solution to keep the live search feeling lively.

Filtering Overview

The first step in filtering is parsing the filter text entered by the user. This string is broken into its components using the elfeed-search-parse-filter function. Date filter components are converted into a unix epoch interval, tags are interned into symbols, regular expressions are gathered up as strings, and the entry limit is parsed into a plain integer. Absence of a filter component is indicated by nil.

(elfeed-search-parse-filter "@1-year-old +youtube linu[xs] -bsd #15")
;; => (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)

Previously, the next step was to apply the elfeed-search-filter function with this structured filter representation to the database. Except for special early-bailout situations, it works left-to-right across the filter, checking each condition against each entry. This is analogous to an interpreter, with the filter being a program.

Thinking about it that way, what if the filter was instead compiled into an Emacs byte-code function and executed directly by the Emacs virtual machine? That’s what this latest update does.

Benchmarks

With six different filter components, the actual filtering routine is a bit too complicated for an article, so I’ll set up a simpler, but roughly equivalent, scenario. With a reasonable cut-off date, the filter was already sufficiently fast, so for benchmarking I’ll focus on the worst case: no early bailout opportunities. An entry will be just a list of tags (symbols), and the filter will have to test every entry.

My real-world Elfeed database currently has 46,772 entries with 36 distinct tags. For my benchmark I’ll round this up to a nice 100,000 entries, and use 26 distinct tags (A–Z), which has the nice alphabet property and more closely reflects the number of tags I still care about.

First, here’s make-random-entry to generate a random list of 1–5 tags (i.e. an entry). The state parameter is the random state, allowing for deterministic benchmarks on a randomly-generated database.

(cl-defun make-random-entry (&key state (min 1) (max 5))
  (cl-loop repeat (+ min (cl-random (1+ (- max min)) state))
           for letter = (+ ?A (cl-random 26 state))
           collect (intern (format "%c" letter))))

The database is just a big list of entries. In Elfeed this is actually an AVL tree. Without dates, the order doesn’t matter.

(cl-defun make-random-database (&key state (count 100000))
  (cl-loop repeat count collect (make-random-entry :state state)))

Here’s my old time macro. An important change I’ve made since years ago is to call garbage-collect before starting the clock, eliminating bad samples from unlucky garbage collection events. Depending on what you want to measure, it may even be worth disabling garbage collection during the measurement by setting gc-cons-threshold to a high value.

(defmacro measure-time (&rest body)
  (declare (indent defun))
  (garbage-collect)
  (let ((start (make-symbol "start")))
    `(let ((,start (float-time)))
       ,@body
       (- (float-time) ,start))))

Finally, the benchmark harness. It uses a hard-coded seed to generate the same pseudo-random database. The test is run against the a filter function, f, 100 times in search for the same 6 tags, and the timing results are averaged.

(cl-defun benchmark (f &optional (n 100) (tags '(A B C D E F)))
  (let* ((state (copy-sequence [cl-random-state-tag -1 30 267466518]))
         (db (make-random-database :state state)))
    (cl-loop repeat n
             sum (measure-time
                   (funcall f db tags))
             into total
             finally return (/ total (float n)))))

The baseline will be memq (test for membership using identity, eq). There are two lists of tags to compare: the list that is the entry, and the list from the filter. This requires a nested loop for each entry, one explicit (cl-loop) and one implicit (memq), both with early bailout.

(defun memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq tag entry)
                    return t)))

Byte-code compiling everything and running the benchmark on my laptop I get:

(benchmark #'memq-count)
;; => 0.041 seconds

That’s actually not too bad. One of the advantages of this definition is that there are no function calls. The memq built-in function has its own opcode (62), and the rest of the definition is special forms and macros expanding to special forms (cl-loop). It’s exactly the thing I need to exploit to make filters faster.

As a sanity check, what would happen if I used member instead of memq? In theory it should be slower because it uses equal for tests instead of eq.

(defun member-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (member tag entry)
                    return t)))

It’s only slightly slower because member, like many other built-ins, also has an opcode (157). It’s just a tiny bit more overhead.

(benchmark #'member-count)
;; => 0.047 seconds

To test function call overhead while still using the built-in (e.g. written in C) memq, I’ll alias it so that the byte-code compiler is forced to emit a function call.

(defalias 'memq-alias 'memq)

(defun memq-alias-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq-alias tag entry)
                    return t)))

To verify that this is doing what I expect, I M-x disassemble the function and inspect the byte-code disassembly. Here’s a simple example.

(disassemble
 (byte-compile (lambda (list) (memq :foo list))))

When compiled under lexical scope (lexical-binding is true), here’s the disassembly. To understand what this means, see Emacs Byte-code Internals.

0       constant  :foo
1       stack-ref 1
2       memq
3       return

Notice the memq instruction. Try using memq-alias instead:

(disassemble
 (byte-compile (lambda (list) (memq-alias :foo list))))

Resulting in a function call:

0       constant  memq-alias
1       constant  :foo
2       stack-ref 2
3       call      2
4       return

And the benchmark:

(benchmark #'memq-alias-count)
;; => 0.052 seconds

So the function call adds about 27% overhead. This means it would be a good idea to avoid calling functions in the filter if I can help it. I should rely on these special opcodes.

Suppose memq was written in Emacs Lisp rather than C. How much would that hurt performance? My version of my-memq below isn’t quite the same since it returns t rather than the sublist, but it’s good enough for this purpose. (I’m using cl-loop because writing early bailout in plain Elisp without recursion is, in my opinion, ugly.)

(defun my-memq (needle haystack)
  (cl-loop for element in haystack
           when (eq needle element)
           return t))

(defun my-memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (my-memq tag entry)
                    return t)))

And the benchmark:

(benchmark #'my-memq-count)
;; => 0.137 seconds

Oof! It’s more than 3 times slower than the opcode. This means I should use built-ins as much as possible in the filter.

Dynamic vs. lexical scope

There’s one last thing to watch out for. Everything so far has been compiled with lexical scope. You should really turn this on by default for all new code that you write. It has three important advantages:

  1. It allows the compiler to catch more mistakes.
  2. It eliminates a class of bugs related to dynamic scope: Local variables are exposed to manipulation by callees.
  3. Lexical scope has better performance.

Here are all the benchmarks with the default dynamic scope:

(benchmark #'memq-count)
;; => 0.065 seconds

(benchmark #'member-count)
;; => 0.070 seconds

(benchmark #'memq-alias-count)
;; => 0.074 seconds

(benchmark #'my-memq-count)
;; => 0.256 seconds

It halves the performance in this benchmark, and for no benefit. Under dynamic scope, local variables use the varref opcode — a global variable lookup — instead of the stack-ref opcode — a simple array index.

(defun norm (a b)
  (* (- a b) (- a b)))

Under dynamic scope, this compiles to:

0       varref    a
1       varref    b
2       diff
3       varref    a
4       varref    b
5       diff
6       mult
7       return

And under lexical scope (notice the variable names disappear):

0       stack-ref 1
1       stack-ref 1
2       diff
3       stack-ref 2
4       stack-ref 2
5       diff
6       mult
7       return

JIT-compiled filters

So far I’ve been moving in the wrong direction, making things slower rather than faster. How can I make it faster than the straight memq version? By compiling the filter into byte-code.

I won’t write the byte-code directly, but instead generate Elisp code and use the byte-code compiler on it. This is safer, will work correctly in future versions of Emacs, and leverages the optimizations performed by the byte-compiler. This sort of thing recently got a bad rap on Emacs Horrors, but I was happy to see that this technique is already established.

(defun jit-count (db tags)
  (let* ((memq-list (cl-loop for tag in tags
                             collect `(memq ',tag entry)))
         (function `(lambda (db)
                      (cl-loop for entry in db
                               count (or ,@memq-list))))
         (compiled (byte-compile function)))
    (funcall compiled db)))

It dynamically builds the code as an s-expression, runs that through the byte-code compiler, executes it, and throws it away. It’s “just-in-time,” though compiling to byte-code and not native code. For the benchmark tags of (A B C D E F), this builds the following:

(lambda (db)
  (cl-loop for entry in db
           count (or (memq 'A entry)
                     (memq 'B entry)
                     (memq 'C entry)
                     (memq 'D entry)
                     (memq 'E entry)
                     (memq 'F entry))))

Due to its short-circuiting behavior, or is a special form, so this function is just special forms and memq in its opcode form. It’s as fast as Elisp can get.

Having s-expressions is a real strength for lisp, since the alternative (in, say, JavaScript) would be to assemble the function by concatenating code strings. By contrast, this looks a lot like a regular lisp macro. Invoking the byte-code compiler does add some overhead compared to the interpreted filter, but it’s insignificant.

How much faster is this?

(benchmark #'jit-count)
;; => 0.017s

It’s more than twice as fast! The big gain here is through loop unrolling. The outer loop has been unrolled into the or expression. That section of byte-code looks like this:

0       constant  A
1       stack-ref 1
2       memq
3       goto-if-not-nil-else-pop 1
6       constant  B
7       stack-ref 1
8       memq
9       goto-if-not-nil-else-pop 1
12      constant  C
13      stack-ref 1
14      memq
15      goto-if-not-nil-else-pop 1
18      constant  D
19      stack-ref 1
20      memq
21      goto-if-not-nil-else-pop 1
24      constant  E
25      stack-ref 1
26      memq
27      goto-if-not-nil-else-pop 1
30      constant  F
31      stack-ref 1
32      memq
33:1    return

In Elfeed, not only does it unroll these loops, it completely eliminates the overhead for unused filter components. Comparing to this benchmark, I’m seeing roughly matching gains in Elfeed’s worst case. In Elfeed, I also bind lexical-binding around the byte-compile call to force lexical scope, since otherwise it just uses the buffer-local value (usually nil).

Filter compilation can be toggled on and off by setting elfeed-search-compile-filter. If you’re up to date, try out live filters with it both enabled and disabled. See if you can notice the difference.

Result summary

Here are the results in a table, all run with Emacs 24.4 on x86-64.

(ms)      memq      member    memq-alias my-memq   jit
lexical   41        47        52         137       17
dynamic   65        70        74         256       21

And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where I also occasionally use Elfeed, and where I have been very interested in improving performance.

(ms)      memq      member    memq-alias my-memq   jit
lexical   170       235       242        614       79
dynamic   274       340       345        1130      92

And here’s how you can run the benchmarks for yourself, perhaps with different parameters:

The header explains how to run the benchmark in batch mode:

$ emacs -Q -batch -f batch-byte-compile jit-bench.el
$ emacs -Q -batch -l jit-bench.elc -f benchmark-batch

A Showerthoughts Fortune File

I have created a fortune file for the all-time top 10,000 /r/Showerthoughts posts, as of October 2016. As a word of warning: Many of these entries are adult humor and may not be appropriate for your work computer. These fortunes would be categorized as “offensive” (fortune -o).

Download: showerthoughts (1.3 MB)

The copyright status of this file is subject to each of its thousands of authors. Since it’s not possible to contact many of these authors — some may not even still live — it’s obviously never going to be under an open source license (Creative Commons, etc.). Even more, some quotes are probably from comedians and such, rather than by the redditor who made the post. I distribute it only for fun.

Installation

To install this into your fortune database, first process it with strfile to create a random-access index, showerthoughts.dat, then copy them to the directory with the rest.

$ strfile showerthoughts
"showerthoughts.dat" created
There were 10000 strings
Longest string: 343 bytes
Shortest string: 39 bytes

$ cp showerthoughts* /usr/share/games/fortunes/

Alternatively, fortune can be told to use this file directly:

$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
        ―AndItsDeepToo, Aug 2016

If you didn’t already know, fortune is an old unix utility that displays a random quotation from a quotation database — a digital fortune cookie. I use it as an interactive login shell greeting on my ODROID-C2 server:

if shopt -q login_shell; then
    fortune ~/.fortunes
fi

How was it made?

Fortunately I didn’t have to do something crazy like scrape reddit for weeks on end. Instead, I downloaded the pushshift.io submission archives, which is currently around 70 GB compressed. Each file contains one month’s worth of JSON data, one object per submission, one submission per line, all compressed with bzip2.

Unlike so many other datasets, especially when it’s made up of arbitrary inputs from millions of people, the format of the /r/Showerthoughts posts is surprisingly very clean and requires virtually no touching up. It’s some really fantastic data.

A nice feature of bzip2 is concatenating compressed files also concatenates the uncompressed files. Additionally, it’s easy to parallelize bzip2 compression and decompression, which gives it an edge over xz. I strongly recommend using lbzip2 to decompress this data, should you want to process it yourself.

cat RS_*.bz2 | lbunzip2 > everything.json

jq is my favorite command line tool for processing JSON (and rendering fractals). To filter all the /r/Showerthoughts posts, it’s a simple select expression. Just mind the capitalization of the subreddit’s name. The -c tells jq to keep it one per line.

cat RS_*.bz2 | \
    lbunzip2 | \
    jq -c 'select(.subreddit == "Showerthoughts")' \
    > showerthoughts.json

However, you’ll quickly find that jq is the bottleneck, parsing all that JSON. Your cores won’t be exploited by lbzip2 as they should. So I throw grep in front to dramatically decrease the workload for jq.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts")'
    > showerthoughts.json

This will let some extra things through, but it’s a superset. The -a option is necessary because the data contains some null bytes. Without it, grep switches into binary mode and breaks everything. This is incredibly frustrating when you’ve already waited half an hour for results.

To further reduce the workload further down the pipeline, I take advantage of the fact that only four fields will be needed: title, score, author, and created_utc. The rest can — and should, for efficiency’s sake — be thrown away where it’s cheap to do so.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts") |
               {title, score, author, created_utc}' \
    > showerthoughts.json

This gathers all 1,199,499 submissions into a 185 MB JSON file (as of this writing). Most of these submissions are terrible, so the next step is narrowing it to the small set of good submissions and putting them into the fortune database format.

It turns out reddit already has a method for finding the best submissions: a voting system. Just pick the highest scoring posts. Through experimentation I arrived at 10,000 as the magic cut-off number. After this the quality really starts to drop off. Over time this should probably be scaled up with the total number of submissions.

I did both steps at the same time using a bit of Emacs Lisp, which is particularly well-suited to the task:

This Elisp program reads one JSON object at a time and sticks each into a AVL tree sorted by score (descending), then timestamp (ascending), then title (ascending). The AVL tree is limited to 10,000 items, with the lowest items being dropped. This was a lot faster than the more obvious approach: collecting everything into a big list, sorting it, and keeping the top 10,000 items.

Formatting

The most complicated part is actually paragraph wrapping the submissions. Most are too long for a single line, and letting the terminal hard wrap them is visually unpleasing. The submissions are encoded in UTF-8, some with characters beyond simple ASCII. Proper wrapping requires not just Unicode awareness, but also some degree of Unicode rendering. The algorithm needs to recognize grapheme clusters and know the size of the rendered text. This is not so trivial! Most paragraph wrapping tools and libraries get this wrong, some counting width by bytes, others counting width by codepoints.

Emacs’ M-x fill-paragraph knows how to do all these things — only for a monospace font, which is all I needed — and I decided to leverage it when generating the fortune file. Here’s an example that paragraph-wraps a string:

(defun string-fill-paragraph (s)
  (with-temp-buffer
    (insert s)
    (fill-paragraph)
    (buffer-string)))

For the file format, items are delimited by a % on a line by itself. I put the wrapped content, followed by a quotation dash, the author, and the date. A surprising number of these submissions have date-sensitive content (“on this day X years ago”), so I found it was important to include a date.

April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
        ―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
        ―MKLV, Aug 2015
%

There’s the potential that a submission itself could end with a lone % and, with a bit of bad luck, it happens to wrap that onto its own line. Fortunately this hasn’t happened yet. But, now that I’ve advertised it, someone could make such a submission, popular enough for the top 10,000, with the intent to personally trip me up in a future update. I accept this, though it’s unlikely, and it would be fairly easy to work around if it happened.

The strfile program looks for the % delimiters and fills out a table of file offsets. The header of the .dat file indicates the number strings along with some other metadata. What follows is a table of 32-bit file offsets.

struct {
    uint32_t str_version;  /* version number */
    uint32_t str_numstr;   /* # of strings in the file */
    uint32_t str_longlen;  /* length of longest string */
    uint32_t str_shortlen; /* shortest string length */
    uint32_t str_flags;    /* bit field for flags */
    char str_delim;        /* delimiting character */
}

Note that the table doesn’t necessarily need to list the strings in the same order as they appear in the original file. In fact, recent versions of strfile can sort the strings by sorting the table, all without touching the original file. Though none of this important to fortune.

Now that you know how it all works, you can build your own fortune file from your own inputs!

Portable Structure Access with Member Offset Constants

Suppose you need to write a C program to access a long sequence of structures from a binary file in a specified format. These structures have different lengths and contents, but also a common header identifying its type and size. Here’s the definition of that header (no padding):

struct event {
    uint64_t time;   // unix epoch (microseconds)
    uint32_t size;   // including this header (bytes)
    uint16_t source;
    uint16_t type;
};

The size member is used to find the offset of the next structure in the file without knowing anything else about the current structure. Just add size to the offset of the current structure.

The type member indicates what kind of data follows this structure. The program is likely to switch on this value.

The actual structures might look something like this (in the spirit of X-COM). Note how each structure begins with struct event as header. All angles are expressed using binary scaling.

#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30

struct observer {
    struct event event;
    uint32_t latitude;   // binary scaled angle
    uint32_t longitude;  //
    uint16_t source_id;  // later used for event source
    uint16_t name_size;  // not including null terminator
    char name[];
};

struct ufo_sighting {
    struct event event;
    uint32_t azimuth;    // binary scaled angle
    uint32_t elevation;  //
};

struct suspicious_signal {
    struct event event;
    uint16_t num_channels;
    uint16_t sample_rate;  // Hz
    uint32_t num_samples;  // per channel
    int16_t samples[];
};

If all integers are stored in little endian byte order (least significant byte first), there’s a strong temptation to lay the structures directly over the data. After all, this will work correctly on most computers.

struct event header;
fread(buffer, sizeof(header), 1, file);
switch (header.type) {
    // ...
}

This code will not work correctly when:

  1. The host machine doesn’t use little endian byte order, though this is now uncommon. Sometimes developers will attempt to detect the byte order at compile time and use the preprocessor to byte-swap if needed. This is a mistake.

  2. The host machine has different alignment requirements and so introduces additional padding to the structure. Sometimes this can be resolved with a non-standard #pragma pack.

Integer extraction functions

Fortunately it’s easy to write fast, correct, portable code for this situation. First, define some functions to extract little endian integers from an octet buffer (uint8_t). These will work correctly regardless of the host’s alignment and byte order.

static inline uint16_t
extract_u16le(const uint8_t *buf)
{
    return (uint16_t)buf[1] << 8 |
           (uint16_t)buf[0] << 0;
}

static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return (uint32_t)buf[3] << 24 |
           (uint32_t)buf[2] << 16 |
           (uint32_t)buf[1] <<  8 |
           (uint32_t)buf[0] <<  0;
}

static inline uint64_t
extract_u64le(const uint8_t *buf)
{
    return (uint64_t)buf[7] << 56 |
           (uint64_t)buf[6] << 48 |
           (uint64_t)buf[5] << 40 |
           (uint64_t)buf[4] << 32 |
           (uint64_t)buf[3] << 24 |
           (uint64_t)buf[2] << 16 |
           (uint64_t)buf[1] <<  8 |
           (uint64_t)buf[0] <<  0;
}

The big endian version is identical, but with shifts in reverse order.

A common concern is that these functions are a lot less efficient than they could be. On x86 where alignment is very relaxed, each could be implemented as a single load instruction. However, on GCC 4.x and earlier, extract_u32le compiles to something like this:

extract_u32le:
        movzx   eax, [rdi+3]
        sal     eax, 24
        mov     edx, eax
        movzx   eax, [rdi+2]
        sal     eax, 16
        or      eax, edx
        movzx   edx, [rdi]
        or      eax, edx
        movzx   edx, [rdi+1]
        sal     edx, 8
        or      eax, edx
        ret

It’s tempting to fix the problem with the following definition:

// Note: Don't do this.
static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return *(uint32_t *)buf;
}

It’s unportable, it’s undefined behavior, and worst of all, it might not work correctly even on x86. Fortunately I have some great news. On GCC 5.x and above, the correct definition compiles to the desired, fast version. It’s the best of both worlds.

extract_u32le:
        mov     eax, [rdi]
        ret

It’s even smart about the big endian version:

static inline uint32_t
extract_u32be(const uint8_t *buf)
{
    return (uint32_t)buf[0] << 24 |
           (uint32_t)buf[1] << 16 |
           (uint32_t)buf[2] <<  8 |
           (uint32_t)buf[3] <<  0;
}

Is compiled to exactly what you’d want:

extract_u32be:
        mov     eax, [rdi]
        bswap   eax
        ret

Or, even better, if your system supports movbe (gcc -mmovbe):

extract_u32be:
        movbe   eax, [rdi]
        ret

Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m betting it will eventually learn how to do this, too.

Member offset constants

For this next technique, that struct event from above need not actually be in the source. It’s purely documentation. Instead, let’s define the structure in terms of member offset constants — a term I just made up for this article. I’ve included the integer types as part of the name to aid in their correct use.

#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14

Given a buffer, the integer extraction functions, and these offsets, structure members can be plucked out on demand.

uint8_t *buf;
// ...
uint64_t time   = extract_u64le(buf + EVENT_U64LE_TIME);
uint32_t size   = extract_u32le(buf + EVENT_U32LE_SIZE;
uint16_t source = extract_u16le(buf + EVENT_U16LE_SOURCE);
uint16_t type   = extract_u16le(buf + EVENT_U16LE_TYPE);

On x86 with GCC 5.x, each member access will be inlined and compiled to a one-instruction extraction. As far as performance is concerned, it’s identical to using a structure overlay, but this time the C code is clean and portable. A slight downside is the lack of type checking on member access: it’s easy to mismatch the types and accidentally read garbage.

Memory mapping and iteration

There’s a real advantage to memory mapping the input file and using its contents directly. On a system with a huge virtual address space, such as x86-64 or AArch64, this memory is almost “free.” Already being backed by a file, paging out this memory costs nothing (i.e. it’s discarded). The input file can comfortably be much larger than physical memory without straining the system.

Unportable structure overlay can take advantage of memory mapping this way, but has the previously-described issues. An approach with member offset constants will take advantage of it just as well, all while remaining clean and portable.

I like to wrap the memory mapping code into a simple interface, which makes porting to non-POSIX platforms, such Windows, easier. Caveat: This won’t work with files whose size exceeds the available contiguous virtual memory of the system — a real problem for 32-bit systems.

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

uint8_t *
map_file(const char *path, size_t *length)
{
    int fd = open(path, O_RDONLY);
    if (fd == -1)
        return 0;

    struct stat stat;
    if (fstat(fd, &stat) == -1) {
        close(fd);
        return 0;
    }

    *length = stat.st_size;  // TODO: possible overflow
    uint8_t *p = mmap(0, *length, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

void
unmap_file(uint8_t *p, size_t length)
{
    munmap(p, length);
}

Next, here’s an example that iterates over all the structures in input_file, in this case counting each. The size member is extracted in order to stride to the next structure.

size_t length;
uint8_t *data = map_file(input_file, &length);
if (!data)
    FATAL();

size_t event_count = 0;
uint8_t *p = data;
while (p < data + length) {
    event_count++;
    uint32_t size = extract_u32le(p + EVENT_U32LE_SIZE);
    if (size > length - (p - data))
        FATAL();  // invalid size
    p += size;
}
printf("I see %zu events.\n", event_count);

unmap_file(data, length);

This is the basic structure for navigating this kind of data. A deeper dive would involve a switch inside the loop, extracting the relevant members for whatever use is needed.

Fast, correct, simple. Pick three.

null program

Chris Wellons

(PGP)