When the Compiler Bites

Update: There are discussions on Reddit and on Hacker News.

So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.

In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.

x86-64 ABI ambiguity

The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:

double
zero_multiply(double x)
{
    return x * 0.0;
}

The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)

As a result, both GCC and Clang perform the multiply:

zero_multiply:
    xorpd  xmm1, xmm1
    mulsd  xmm0, xmm1
    ret

The -ffast-math option relaxes the C standard floating point rules, permitting an optimization at the cost of conformance and consistency:

zero_multiply:
    xorps  xmm0, xmm0
    ret

Side note: -ffast-math doesn’t necessarily mean “less precise.” Sometimes it will actually improve precision.

Here’s a modified version of the function that’s a little more interesting. I’ve changed the argument to a short:

double
zero_multiply_short(short x)
{
    return x * 0.0;
}

It’s no longer possible for the argument to be one of those special values. The short will be promoted to one of 65,535 possible double values, each of which results in 0.0 when multiplied by 0.0. GCC misses this optimization (-Os):

zero_multiply_short:
    movsx     edi, di       ; sign-extend 16-bit argument
    xorps     xmm1, xmm1    ; xmm1 = 0.0
    cvtsi2sd  xmm0, edi     ; convert int to double
    mulsd     xmm0, xmm1
    ret

Clang also misses this optimization:

zero_multiply_short:
    cvtsi2sd xmm1, edi
    xorpd    xmm0, xmm0
    mulsd    xmm0, xmm1
    ret

But hang on a minute. This is shorter by one instruction. What happened to the sign-extension (movsx)? Clang is treating that short argument as if it were a 32-bit value. Why do GCC and Clang differ? Is GCC doing something unnecessary?

It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.

Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.

I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.

Floating point precision

Without looking it up or trying it, what does this function return? Think carefully.

int
float_compare(void)
{
    float x = 1.3f;
    return x == 1.3f;
}

Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.

$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1

So what’s going on here? The original ANSI C specification wasn’t clear about how intermediate floating point values get rounded, and implementations all did it differently. The C99 specification cleaned this all up and introduced FLT_EVAL_METHOD. Implementations can still differ, but at least you can now determine at compile-time what the compiler would do by inspecting that macro.

Back in the late 1980’s or early 1990’s when the GCC developers were deciding how GCC should implement floating point arithmetic, the trend at the time was to use as much precision as possible. On the x86 this meant using its support for 80-bit extended precision floating point arithmetic. Floating point operations are performed in long double precision and truncated afterward (FLT_EVAL_METHOD == 2).

In float_compare() the left-hand side is truncated to a float by the assignment, but the right-hand side, despite being a float literal, is actually “1.3” at 80 bits of precision as far as GCC is concerned. That’s pretty unintuitive!

The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.

The trend reversed once SIMD hardware became widely available and there were huge performance gains to be had. Multiple values could be computed at once, side by side, at lower precision. So on x86-64, this became the default (FLT_EVAL_METHOD == 0). The young Clang compiler wasn’t around until well after this trend reversed, so it behaves differently than the backwards compatible GCC on the old x86.

I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.

Built-in Function Elimination

I’ve saved this one for last since it’s my favorite. Suppose we have this little function, new_image(), that allocates a greyscale image for, say, some multimedia library.

static unsigned char *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, shade, w * h);
        }
    }
    return p;
}

It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.

I write a unit test to make sure it detects overflow. This function should return 0.

/* expected return == 0 */
int
test_new_image_overflow(void)
{
    void *p = new_image(2, SIZE_MAX, 0);
    return !!p;
}

So far my test passes. Good.

I’d also like to make sure it correctly returns NULL — or, more specifically, that it doesn’t crash — if the allocation fails. But how can I make malloc() fail? As a hack I can pass image dimensions that I know cannot ever practically be allocated. Essentially I want to force a malloc(SIZE_MAX), e.g. allocate every available byte in my virtual address space. For a conventional 64-bit machine, that’s 16 exibytes of memory, and it leaves space for nothing else, including the program itself.

/* expected return == 0 */
int
test_new_image_oom(void)
{
    void *p = new_image(1, SIZE_MAX, 0xff);
    return !!p;
}

I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?

Disassembling the test reveals what’s going on:

test_new_image_overflow:
    xor  eax, eax
    ret

test_new_image_oom:
    mov  eax, 1
    ret

The first test is actually being evaluated at compile time by the compiler. The function being tested was inlined into the unit test itself. This permits the compiler to collapse the whole thing down to a single instruction. The path with malloc() became dead code and was trivially eliminated.

In the second test, Clang correctly determined that the image buffer is not actually being used, despite the memset(), so it eliminated the allocation altogether and then simulated a successful allocation despite it being absurdly large. Allocating memory is not an observable side effect as far as the language specification is concerned, so it’s allowed to do this. My thinking was wrong, and the compiler outsmarted me.

I soon realized I can take this further and trick Clang into performing an invalid optimization, revealing a bug. Consider this slightly-optimized version that uses calloc() when the shade is zero (black). The calloc() function does its own overflow check, so new_image() doesn’t need to do it.

static void *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (shade == 0) { // shortcut
        p = calloc(w, h);
    } else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, color, w * h);
        }
    }
    return p;
}

With this change, my overflow unit test is now also failing. The situation is even worse than before. The calloc() is being eliminated despite the overflow, and replaced with a simulated success. This time it’s actually a bug in Clang. While failing a unit test is mostly harmless, this could introduce a vulnerability in a real program. The OpenBSD folks are so worried about this sort of thing that they’ve disabled this optimization.

Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).

struct table {
    unsigned *counter;
    unsigned *values;
};

static int
table_init(struct table *t, size_t n)
{
    t->counter = calloc(n, sizeof(*t->counter));
    if (t->counter) {
        /* Overflow already tested above */
        t->values = malloc(n * sizeof(*t->values));
        if (!t->values) {
            free(t->counter);
            return 0; // fail
        }
        return 1; // success
    }
    return 0; // fail
}

This function relies on the overflow test in calloc() for the second malloc() allocation. However, this is a static function that’s likely to get inlined, as we saw before. If the program doesn’t actually make use of the counter table, and Clang is able to statically determine this fact, it may eliminate the calloc(). This would also eliminate the overflow test, introducing a vulnerability. If an attacker can control n, then they can overwrite arbitrary memory through that values pointer.

The takeaway

Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.

The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.

My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?

Just before the end it would pause the game and wait… forever.

Blast from the Past: Borland C++ on Windows 98

My first exposure to C and C++ was a little over 20 years ago. I remember it being some version of Borland C++, either 4.x or 5.x, running on Windows 95. I didn’t have a mentor, so I did the best I could slowly working through what was probably a poorly written beginner C++ book, typing out the examples and exercises with little understanding. Since I didn’t learn much from the experience, there was a 7 or 8 year gap before I’d revisit C and C++ in college.

I thought it would be interesting to revisit this software, to reevaluate it from a far more experienced perspective. Keep in mind that C++ wasn’t even standardized yet, and the most recent C standard was from 1989. Given this, what was it like to be a professional software developer using a Borland toolchain on Windows 20 years ago? Was it miserable, made bearable only by ignorance of how much better the tooling could be? Or maybe it actually wasn’t so bad, and these tools are better than I expect?

Ultimately my conclusion is that it’s a little bit of both. There are some significant capability gaps compared to today, but the core toolchain itself is actually quite reasonable, especially for the mid 1990s.

The setup

Before getting into the evaluation, let’s discuss how I got it all up and running. While it’s technically possible to run Windows 95 on a modern x86-64 machine thanks to the architecture’s extreme backwards compatibility, it’s more compatible, simpler, and safer to virtualize it. Most importantly, I can emulate older hardware that will have better driver support.

Despite that early start in Windows all those years ago, I’m primarily a Linux user. The premier virtualization solution on Linux these days is KVM, a kernel module that turns Linux into a hypervisor and makes efficient use of hardware virtualization extensions. Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m using QEmu (with KVM disabled), a hardware emulator closely associated with KVM. Since it doesn’t take advantage of hardware virtualization extensions, it will be slower. This is fine since my goal is to emulate slow, 20+ year old hardware anyway.

There’s very little practical difference between Windows 95 and Windows 98. Since Windows 98 runs a lot smoother virtualized, I decided to go with that instead. This will be perfectly sufficient for my toolchain evaluation.

Software

To get started, I’ll need an installer for Windows 98. I thought this would be difficult to find, but there’s a copy available on the Internet Archive. I don’t know how “legitimate” this is, but it works. Since it’s running in a virtual machine without network access, I also don’t really care if this copy is somehow infected with malware.

Internet Archive: Windows 98 Second Edition

Also on the Internet Archive is a complete copy of Borland C++ 5.02, with the same caveats of legitimacy. It works, which is good enough for my purposes.

Internet Archive: Borland C++ 5.02

Thank you Internet Archive!

Hardware

I’ve got my software, now to set up the virtualized hardware. First I create a drive image:

$ qemu-image create -fqcow2 win98.img 8G

I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a virtual hard drive with modern sizes would probably break the installer. This sort of issue is a common theme among old software, where there may be complaints about negative available disk space due to signed integer overflow.

I decided to give the machine 256MB of memory (-m 256). This is also a little excessive, but I wanted to be sure memory didn’t limit Borland’s capabilities. This amount of memory is close to the upper bound, and going much beyond will likely cause problems with Windows 98.

For the CPU I settled on a Pentium (-cpu pentium). My original goal was to go a little simpler with a 486 (-cpu 486), but the Windows 98 installer kept crashing when I tried this.

I experimented with different configurations for the network card, but I couldn’t get anything to work. So I’ve disabled networking (-net none). The only reason I’d want this is that it would be easier to move files in and out of the virtual machine.

Finally, here’s how I ran QEmu. The last two lines are only needed when installing.

$ qemu-system-x86_64 \
    -localtime \
    -cpu pentium \
    -no-acpi \
    -no-hpet \
    -m 256 \
    -hda win98.img \
    -soundhw sb16 \
    -vga cirrus \
    -net none \
    -cdrom "Windows 98 Second Edition.iso" \
    -boot d

Installation

Installation is just a matter of following the instructions. You’ll need that product key listed on the Internet Archive site.

That copy of Borland is just a big .zip file. This presents two problems.

  1. Without network access, I’ll need to figure out how to get this inside the virtual machine.

  2. This version of Windows doesn’t come with software to unzip this file. I’d need to find and install an unzip tool first.

Fortunately I can kill two birds with one stone by converting that .zip archive into a .iso and mounting it in the virtual machine.

unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"

Then in the QEmu console (C-A-2) I attach it:

change ide1-cd0 borland.iso

This little trick of generating .iso files and mounting them is how I will be moving all the other files into the virtual machine.

Borland C++

The first thing I did was play around with with Borland IDE. This is what I would have been using 20 years ago.

Despite being Borland C++, I’m personally most interested in its ANSI C compiler. As I already pointed out, this software pre-dates C++’s standardization, and a lot has changed over the past two decades. On the other hand, C hasn’t really changed all that much. The 1999 update to the C standard (e.g. “C99”) was big and important, but otherwise little has changed. The biggest drawback is the lack of “declare anywhere” variables, including in for-loop initializers. Otherwise it’s the same as writing C today.

To test drive the IDE, I made a couple of test projects, built and ran them with different options, and poked around with the debugger. The debugger is actually pretty decent, especially for the 1990s. It can be operated via the IDE or standalone, so I could use it without firing up the IDE and making a project.

The toolchain includes an assembler, and I can inspect the compiler’s assembly output. To nobody’s surprise this is Intel-flavored assembly, which is very welcome. Imagining myself as a software developer in the mid 1990s, this means I can see exactly what the compiler’s doing as well as write some of the performance sensitive parts in assembly if necessary.

The built-in editor is the worst part of the IDE, which is unfortunate since it really spoils the whole experience. It’s easy to jump between warnings and errors, it has incremental search, and it has good syntax highlighting. But these are the only positive things I can say about it. If I had to work with this editor full-time, I’d spend my days pretty irritated.

Switch to command line tools

Like with the debugger, the Borland people did a good job modularizing their development tools. As part of the installation process, all of the Borland command line tools are added to the system PATH (reminder: this is a single-user system). This includes compiler, linker, assembler, debugger, and even an incomplete implementation of make.

With this, I can essentially pretend the IDE doesn’t exist and replace that crummy editor with something better: Vim.

The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3, released in 2010. I download those binaries, trim a few things from my .vimrc, and smuggle it all into my virtual machine via a virtual CD. I’ve now got a powerful text editor in Windows 98 and my situation has drastically improved.

Since I hardly use features added since Vim 7.3, this feels right at home to me. I can invoke the build from Vim, and it can populate the quickfix list from Borland’s output, so I could actually be fairly productive in these circumstances! I’m honestly really impressed with how well this all works together.

At this point I only have two significant annoyances:

  1. Borland’s command line tools belong to that category of irritating programs that print their version banner on every invocation. There’s not even a command line switch to turn this off. All this noise is quickly tiresome. The Visual Studio toolchain does the same thing by default, though it can be turned off (-nologo). I dislike that some GNU tools also commit this sin, but at least GNU limits this to interactive programs.

  2. The Windows/DOS command shell and console is even worse than it is today. I didn’t think that was possible. This is back when it was still genuinely DOS and not just pretending to be (e.g. in NT). The worst part by far is the lack of command history. There’s no using the up-arrow to get previous commands. There’s no tab completion. Forward slash is not a substitute for backslash in paths. If I wanted to improve my productivity, replacing this console and shell would be the first priority.

Update: In an email, Aristotle Pagaltzis informed me that Windows 98 comes with DOSKEY.COM, which provides command history for COMMAND.EXE. Alternatively there’s Enhanced DOSKEY.com, an open source, alternative implementation that also provides tab completion for commands and filesnames. This makes the console a lot more usable (and, honestly, in some ways better than the modern defaults).

Building Enchive with Borland

Last year I wrote a backup encryption tool called Enchive, and I still use it regularly. One of my design goals was high portability since it may be needed to decrypt something important in the distant future. It should be as bit-rot-proof as possible. In software, the best way to future-proof is to past-proof.

If I had a time machine that could send source code back in time, and I sent Enchive to a competant developer 20 years ago, would they be able to compile it and run it? If the answer is yes, then that means Enchive already has 20 years of future-proofing built into it.

To accomplish this, Enchive is 3,300 lines of strict ANSI C, 1989-style, with no dependencies other than the C standard library and a handful of operating system functions — e.g. functionality not in the C standard library. In practice, any ANSI C compiler targeting either POSIX, or Windows 95 or later, should be able to compile it.

My Windows 98 virtual machine includes an ANSI C compiler, and can be used to simulate this time machine. I generated an “amalgamation” build (make amalgamation) — essentially a concatenation of all the source files — and sent this into the virtual machine. Before Borland was able to compile it, I needed to make three small changes.

First, Enchive includes stdint.h to get fixed-width integers needed for the encryption routines. This header comes from C99, and C89 has no equivalent. I anticipated this problem from the beginning and made it easy for the person performing the build to correct it. This header is included exactly once, in config.h, and this is placed at the top of the amalgamation build. The include only needs to be replaced with a handful of manual typedefs. For Borland that looks like this:

typedef unsigned char    uint8_t;
typedef unsigned short   uint16_t;
typedef unsigned long    uint32_t;
typedef unsigned __int64 uint64_t;

typedef long             int32_t;
typedef __int64          int64_t;

#define INT8_C(n)   (n)
#define INT16_C(n)  (n)
#define INT32_C(n)  (n##U)

Second, in more recent versions of Windows, GetFileAttributes() can return the value INVALID_FILE_ATTRIBUTES. Checking for an error that cannot happen is harmless, but this value isn’t defined in Borland’s SDK. I only had to eliminate that check.

Third, the CryptGenRandom() interface isn’t defined in Borland’s SDK. This is used by Enchive to generate keys. MSDN reports this function wasn’t available until Windows XP, but it’s definitely there in Windows 98, exported by ADVAPI32.dll. I’m able to call it, though it always reports an error. Perhaps it’s been disabled in this version due to cryptographic export restrictions?

Regardless of what’s wrong, I ripped this out and replaced it with a fatal error. This version of Enchive can’t generate new keys — unless derived from a passphrase — nor encrypt files, including the use of a protection key to encrypt the secret key. However, it can decrypt files, which is the important part that needs to be future-proofed.

With this three changes — which took me about 10 minutes to sort out — Enchive builds and runs, and it correctly decrypts files I encrypted on Linux. So Enchive has at least 20 years of past-proofing! The screenshot at the top of this article shows it running successfully in an MS-DOS console window.

What’s wrong? What’s missing?

I mentioned that there were some gaps. The most obvious is the lack of the standard POSIX utilities, especially a decent shell. I don’t know if any had been ported to Windows in the mid 1990s. But that could be solved one way or another without too much trouble, even if it meant doing some of that myself.

No, the biggest capability I’d miss, and which wouldn’t be easily obtained, is Git, or a least a decent source control system. I really don’t want to work without proper source control. Git’s support for Windows is second tier, and the port to modern Windows is already a bit of a hack. Getting it to run in Windows 98 would probably be a challenge, especially if I had to compile it with Borland.

The other major issue is the lack of stability. In this experiment, I’ve been seeing this screen a lot:

I remember Windows crashing a lot back in those days, and it certainly had a bad reputation for being unstable, but this is far worse than I remembered. While the hardware emulator may be somewhat at fault here, keep in mind that I never installed third party drivers. Most of these crashes are Windows’ fault. I found I can reliably bring the whole system down with a single GetProcAddress() call on a system DLL. The only way I can imagine this instability was so tolerated back then was general ignorance that computing could be so much better.

I was tempted to write this article in Vim on Windows 98, but all this crashing made me too nervous. I didn’t want some stupid filesystem corruption to wipe out my work. Too risky.

A better alternative

If I was stuck working in Windows 98 — or was at least targeting it as a platform — but had access to a modern tooling ecosystem, could I do better than Borland? Yes! Programs built by Mingw-w64 can be run even as far back as Windows 95.

Now, there’s a catch. I thought it would be this simple:

$ i686-w64-mingw32-gcc -Os hello.c

But when I brought the resulting binary into the virtual machine it crashed when ran it: illegal instruction. Turns out it contained a conditional move (cmov) which is an instruction not available until the Pentium Pro (686). The “pentium” emulation is just a 586.

I tried to disable cmov by picking the specific architecture:

$ i686-w64-mingw32-gcc -march=pentium -Os hello.c

This still didn’t work because the statically-linked part of the CRT contained the cmov. I’d have to recompile that as well.

I could have switched the QEmu options to “upgrade” to a Pentium Pro, but remember that my goal was really the 486. Fortunately this was easy to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number of times before, so I knew it wouldn’t be difficult.

I could go step by step, but it’s all fairly well documented in the Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version), and for the target I picked “i486-w64-mingw32”. When it was done I could compile binaries on Linux to run in my Windows 98 virtual machine:

$ i486-w64-mingw32-gcc -Os hello.c

This should enable quite a bit of modern software to run inside my virtual machine if I so wanted. I didn’t actually try this (yet?), but, to take this concept all the way, I could use this cross-compiler to cross-compile Mingw-w64 itself to run inside the virtual machine, directly replacing Borland C++.

And the only thing I’d miss about Borland is its debugger.

A Crude Personal Package Manager

For the past couple of months I’ve been using a custom package manager to manage a handful of software packages within various unix-like environments. Packages are installed in my home directory under ~/.local/bin, and the package manager itself is just a 110 line Bourne shell script. It’s is not intended to replace the system’s package manager but, instead, compliment it in some cases where I need more flexibility. I use it to run custom versions of specific pieces of software — newer or older than the system-installed versions, or with my own patches and modifications — without interfering with the rest of system, and without a need for root access. It’s worked out really well so far and I expect to continue making heavy use of it in the future.

It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):

Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.

I say it’s crude because its “install” (-I) procedure is little more than a wrapper around tar. It doesn’t invoke libtool after installing a library, and there’s no post-install script — or postinst as Debian calls it. It doesn’t check for conflicts between packages, though there’s a command for doing so manually ahead of time. It doesn’t manage dependencies, nor even have them as a concept. That’s all on the user to screw up.

In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:

  1. It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.

  2. Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.

  3. It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.

The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.

The second caveat is that some builds tailor themselves by default to the host (e.g. -march=native). If care isn’t taken, those packages may not be very portable. This is more common than I had expected and has mildly annoyed me.

Birth of a package manager

While the package manager is new, I’ve been building and installing software in my home directory for years. I’d follow the normal process of setting the install prefix to $HOME/.local, running the build, and then letting the “install” target do its thing.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.

A common difficulty has been handling the release of new versions of software. I’d like to upgrade to the new version, but lack a way to cleanly uninstall the previous version. Simply clobbering the old version by installing it on top usually works. Occasionally it wouldn’t, and I’d have to blow away ~/.local and start all over again. With more and more software installed in my home directory, restarting has become more and more of a chore that I’d like to avoid.

What I needed was a way to track exactly which files were installed so that I could remove them later when I needed to uninstall. Fortunately there’s a widely-used convention for exactly this purpose: DESTDIR.

It’s expected that when a Makefile provides an “install” target, it prefixes the installation path with the DESTDIR macro, which is assigned to the empty string by default. This allows the user to install the software to a temporary location for the purposes of packaging. Unlike the installation prefix (--prefix) configured before the build began, the software is not expected to function properly when run in the DESTDIR location.

$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=destdir install

A different tool will used to copy these files into place and actually install it. This tool can track what files were installed, allowing them to be removed later when uninstalling. My package manager uses the tar program for both purposes. First it creates a package by packing up the DESTDIR (at the root of the actual install prefix):

$ tar czf package.tgz -C $DESTDIR$HOME/.local .

So a package is nothing more than a gzipped tarball. To install, it unpacks the tarball in ~/.local.

$ cd $HOME/.local
$ tar xzf ~/package.tgz

But how does it uninstall a package? It didn’t keep track of what was installed. Easy! The tarball itself contains the package list, and it’s printed with tar’s t mode.

cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done

I’m using grep to skip directories, which are conveniently listed with a trailing slash. Note that in the example above, there are a couple of issues with file names containing whitespace. If the file contains a space character, it will word split incorrectly in the for loop. A Makefile couldn’t handle such a file in the first place, but, in case it’s still necessary, my package manager sets IFS to just a newline.

If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.

Commands

There are five commands, each assigned to a capital letter: -B, -C, -I, -V, and -U. It’s an interface pattern inspired by Ted Unangst’s signify (see signify(1)). I also used this pattern with Blowpipe and, in retrospect, wish I had also used with Enchive.

Build (-B)

Unlike the other three commands, the “build” command isn’t essential, and is just for convenience. It assumes the build uses an Autoconfg-like configure script and runs it automatically, followed by make with the appropriate -j (jobs) option. It automatically sets the --prefix argument when running the configure script.

If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.

Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/

In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.

Once qpkg returns, the program will be fully built — or stuck on a build error if you’re unlucky. If you need to pass custom configure options, just tack them on the qpkg command:

$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses

Since the second and third steps — creating the build directory and moving into it — is so common, there’s an optional switch for it: -d. This option’s argument is the build directory. qpkg creates that directory and runs the build inside it. In practice I just use “x” for the build directory since it’s so quick to add “dx” to the command.

$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/

With the software compiled, the next step is creating the package.

Create (-C)

The “create” command creates the DESTDIR (_destdir in the working directory) and runs the “install” Makefile target to fill it with files. Continuing with the example above and its x/ build directory:

$ qpkg -Cdx name

Where “name” is the name of the package, without any file name extension. Like with “build”, extra arguments after the package name are passed to make in case there needs to be any additional tweaking.

When the “create” command finishes, there will be new package named name.tgz in the working directory. At this point the source and build directories are no longer needed, assuming everything went fine.

$ rm -rf name-version/
$ rm -rf x/

This package is ready to install, though you may want to verify it first.

Verify (-V)

The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.

$ qpkg -V name.tgz

The most common conflict I’ve seen is in the info index (info/dir) file, which is safe to ignore since I don’t care about it.

If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.

Install (-I)

The “install” command is just the dumb tar xzf explained above. It will clobber anything in its way without warning, which is why, if that matters, “verify” should be used first.

$ qpkg -I name.tgz

When qpkg returns, the package has been installed and is probably ready to go. A lot of packages complain that you need to run libtool to finalize an installation, but I’ve never had a problem skipping it. This dumb unpacking generally works fine.

Uninstall (-U)

Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.

$ qpkg -U name.tgz

Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.

I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.

Package by accumulation

Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):

  1. binutils
  2. C headers
  3. core GCC
  4. C runtime
  5. rest of GCC

Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)

To deal with this, I added a “keep” (-k) option that leaves the DESTDIR around after creating the package. To keep things tidy, the intermediate packages exist and are installed, but the final, big cross-compiler package accumulates into the DESTDIR. The final package at the end is actually the whole cross compiler in one package, a superset of them all.

Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.

My use case, and an alternative

The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.

All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.

Emacs Lisp Lambda Expressions Are Not Self-Evaluating

This week I made a mistake that ultimately enlightened me about the nature of function objects in Emacs Lisp. There are three kinds of function objects, but they each behave very differently when evaluated as objects.

But before we get to that, let’s talk about one of Emacs’ embarrassing, old missteps: eval-after-load.

Taming an old dragon

One of the long-standing issues with Emacs is that loading Emacs Lisp files (.el and .elc) is a slow process, even when those files have been byte compiled. There are a number of dirty hacks in place to deal with this issue, and the biggest and nastiest of them all is the dumper, also known as unexec.

The Emacs you routinely use throughout the day is actually a previous instance of Emacs that’s been resurrected from the dead. Your undead Emacs was probably created months, if not years, earlier, back when it was originally compiled. The first stage of compiling Emacs is to compile a minimal C core called temacs. The second stage is loading a bunch of Emacs Lisp files, then dumping a memory image in an unportable, platform-dependent way. On Linux, this actually requires special hooks in glibc. The Emacs you know and love is this dumped image loaded back into memory, continuing from where it left off just after it was compiled. Regardless of your own feelings on the matter, you have to admit this is a very lispy thing to do.

There are two notable costs to Emacs’ dumper:

  1. The dumped image contains hard-coded memory addresses. This means Emacs can’t be a Position Independent Executable (PIE). It can’t take advantage of a security feature called Address Space Layout Randomization (ASLR), which would increase the difficulty of exploiting some classes of bugs. This might be important to you if Emacs processes untrusted data, such as when it’s used as a mail client, a web server or generally parses data downloaded across the network.

  2. It’s not possible to cross-compile Emacs since it can only be dumped by running temacs on its target platform. As an experiment I’ve attempted to dump the Windows version of Emacs on Linux using Wine, but was unsuccessful.

The good news is that there’s a portable dumper in the works that makes this a lot less nasty. If you’re adventurous, you can already disable dumping and run temacs directly by setting CANNOT_DUMP=yes at compile time. Be warned, though, that a non-dumped Emacs takes several seconds, or worse, to initialize before it even begins loading your own configuration. It’s also somewhat buggy since it seems nobody ever runs it this way productively.

The other major way Emacs users have worked around slow loading is aggressive use of lazy loading, generally via autoloads. The major package interactive entry points are defined ahead of time as stub functions. These stubs, when invoked, load the full package, which overrides the stub definition, then finally the stub re-invokes the new definition with the same arguments.

To further assist with lazy loading, an evaluated defvar form will not override an existing global variable binding. This means you can, to a certain extent, configure a package before it’s loaded. The package will not clobber any existing configuration when it loads. This also explains the bizarre interfaces for the various hook functions, like add-hook and run-hooks. These accept symbols — the names of the variables — rather than values of those variables as would normally be the case. The add-to-list function does the same thing. It’s all intended to cooperate with lazy loading, where the variable may not have been defined yet.

eval-after-load

Sometimes this isn’t enough and you need some some configuration to take place after the package has been loaded, but without forcing it to load early. That is, you need to tell Emacs “evaluate this code after this particular package loads.” That’s where eval-after-load comes into play, except for its fatal flaw: it takes the word “eval” completely literally.

The first argument to eval-after-load is the name of a package. Fair enough. The second argument is a form that will be passed to eval after that package is loaded. Now hold on a minute. The general rule of thumb is that if you’re calling eval, you’re probably doing something seriously wrong, and this function is no exception. This is completely the wrong mechanism for the task.

The second argument should have been a function — either a (sharp quoted) symbol or a function object. And then instead of eval it would be something more sensible, like funcall. Perhaps this improved version would be named call-after-load or run-after-load.

The big problem with passing an s-expression is that it will be left uncompiled due to being quoted. I’ve talked before about the importance of evaluating your lambdas. eval-after-load not only encourages badly written Emacs Lisp, it demands it.

;;; BAD!
(eval-after-load 'simple-httpd
                 '(push '("c" . "text/plain") httpd-mime-types))

This was all corrected in Emacs 25. If the second argument to eval-after-load is a function — the result of applying functionp is non-nil — then it uses funcall. There’s also a new macro, with-eval-after-load, to package it all up nicely.

;;; Better (Emacs >= 25 only)
(eval-after-load 'simple-httpd
  (lambda ()
    (push '("c" . "text/plain") httpd-mime-types)))

;;; Best (Emacs >= 25 only)
(with-eval-after-load 'simple-httpd
  (push '("c" . "text/plain") httpd-mime-types))

Though in both of these examples the compiler will likely warn about httpd-mime-types not being defined. That’s a problem for another day.

A workaround

But what if you need to use Emacs 24, as was the situation that sparked this article? What can we do with the bad version of eval-after-load? We could situate a lambda such that it’s evaluated, but then smuggle the resulting function object into the form passed to eval-after-load, all using a backquote.

;;; Note: this is subtly broken
(eval-after-load 'simple-httpd
  `(funcall
    ,(lambda ()
       (push '("c" . "text/plain") httpd-mime-types)))

When everything is compiled, the backquoted form evalutes to this:

(funcall #[0 <bytecode> [httpd-mime-types ("c" . "text/plain")] 2])

Where the second value (#[...]) is a byte-code object. However, as the comment notes, this is subtly broken. A cleaner and correct way to solve all this is with a named function. The damage caused by eval-after-load will have been (mostly) minimized.

(defun my-simple-httpd-hook ()
  (push '("c" . "text/plain") httpd-mime-types))

(eval-after-load 'simple-httpd
  '(funcall #'my-simple-httpd-hook))

But, let’s go back to the anonymous function solution. What was broken about it? It all has to do with evaluating function objects.

Evaluating function objects

So what happens when we evaluate an expression like the one above with eval? Here’s what it looks like again.

(funcall #[...])

First, eval notices it’s been given a non-empty list, so it’s probably a function call. The first argument is the name of the function to be called (funcall) and the remaining elements are its arguments. But each of these elements must be evaluated first, and the result of that evaluation becomes the arguments.

Any value that isn’t a list or a symbol is self-evaluating. That is, it evaluates to its own value:

(eval 10)
;; => 10

If the value is a symbol, it’s treated as a variable. If the value is a list, it goes through the function call process I’m describing (or one of a number of other special cases, such as macro expansion, lambda expressions, and special forms).

So, conceptually eval recurses on the function object #[...]. A function object is not a list or a symbol, so it’s self-evaluating. No problem.

;; Byte-code objects are self-evaluating

(let ((x (byte-compile (lambda ()))))
  (eq x (eval x)))
;; => t

What if this code wasn’t compiled? Rather than a byte-code object, we’d have some other kind of function object for the interpreter. Let’s examine the dynamic scope (shudder) case. Here, a lambda appears to evaluate to itself, but appearances can be deceiving:

(eval (lambda ())
;; => (lambda ())

However, this is not self-evaluation. Lambda expressions are not self-evaluating. It’s merely coincidence that the result of evaluating a lambda expression looks like the original expression. This is just how the Emacs Lisp interpreter is currently implemented and, strictly speaking, it’s an implementation detail that just so happens to be mostly compatible with byte-code objects being self-evaluating. It would be a mistake to rely on this.

Instead, dynamic scope lambda expression evaluation is idempotent. Applying eval to the result will return an equal, but not identical (eq), expression. In contrast, a self-evaluating value is also idempotent under evaluation, but with eq results.

;; Not self-evaluating:

(let ((x '(lambda ())))
  (eq x (eval x)))
;; => nil

;; Evaluation is idempotent:

(let ((x '(lambda ())))
  (equal x (eval x)))
;; => t

(let ((x '(lambda ())))
  (equal x (eval (eval x))))
;; => t

So, with dynamic scope, the subtly broken backquote example will still work, but only by sheer luck. Under lexical scope, the situation isn’t so lucky:

;;; -*- lexical-scope: t; -*-

(lambda ())
;; => (closure (t) nil)

These interpreted lambda functions are neither self-evaluating nor idempotent. Passing t as the second argument to eval tells it to use lexical scope, as shown below:

;; Not self-evaluating:

(let ((x '(lambda ())))
  (eq x (eval x t)))
;; => nil

;; Not idempotent:

(let ((x '(lambda ())))
  (equal x (eval x t)))
;; => nil

(let ((x '(lambda ())))
  (equal x (eval (eval x t) t)))
;; error: (void-function closure)

I can imagine an implementation of Emacs Lisp where dynamic scope lambda expressions are in the same boat, where they’re not even idempotent. For example:

;;; -*- lexical-binding: nil; -*-

(lambda ())
;; => (totally-not-a-closure ())

Most Emacs Lisp would work just fine under this change, and only code that makes some kind of logical mistake — where there’s nested evaluation of lambda expressions — would break. This essentially already happened when lots of code was quietly switched over to lexical scope after Emacs 24. Lambda idempotency was lost and well-written code didn’t notice.

There’s a temptation here for Emacs to define a closure function or special form that would allow interpreter closure objects to be either self-evaluating or idempotent. This would be a mistake. It would only serve as a hack that covers up logical mistakes that lead to nested evaluation. Much better to catch those problems early.

Solving the problem with one character

So how do we fix the subtly broken example? With a strategically placed quote right before the comma.

(eval-after-load 'simple-httpd
  `(funcall
    ',(lambda ()
        (push '("c" . "text/plain") httpd-mime-types)))

So the form passed to eval-after-load becomes:

;; Compiled:
(funcall (quote #[...]))

;; Dynamic scope:
(funcall (quote (lambda () ...)))

;; Lexical scope:
(funcall (quote (closure (t) () ...)))

The quote prevents eval from evaluating the function object, which would be either needless or harmful. There’s also an argument to be made that this is a perfect situation for a sharp-quote (#'), which exists to quote functions.

Two Chaotic Motion Demos

I’ve put together two online, interactive, demonstrations of chaotic motion. One is 2D and the other is 3D, but both are rendered using WebGL — which, for me, is the most interesting part. Both are governed by ordinary differential equations. Both are integrated using the Runge–Kutta method, specifically RK4.

Far more knowledgeable people have already written introductions for chaos theory, so here’s just a quick summary. A chaotic system is deterministic but highly sensitive to initial conditions. Tweaking a single bit of the starting state of either of my demos will quickly lead to two arbitrarily different results. Both demonstrations have features that aim to show this in action.

This ain’t my first chaotic system rodeo. About eight years ago I made water wheel Java applet, and that was based on some Matlab code I collaborated on some eleven years ago. I really hope you’re not equipped to run a crusty old Java applet in 2018, though. (Update: now upgraded to HTML5 Canvas.)

If you want to find either of these demos again in the future, you don’t need to find this article first. They’re both listed in my Showcase page, linked from the header of this site.

Double pendulum

First up is the classic double pendulum. This one’s more intuitive than my other demo since it’s modeling a physical system you could actually build and observe in the real world.

Source: https://github.com/skeeto/double-pendulum

I lifted the differential equations straight from the Wikipedia article (derivative() in my code). Same for the Runge–Kutta method (rk4() in my code). It’s all pretty straightforward. RK4 may not have been the best choice for this system since it seems to bleed off energy over time. If you let my demo run over night, by morning there will obviously be a lot less activity.

I’m not a fan of buttons and other fancy GUI widgets — neither designing them nor using them myself — prefering more cryptic, but easy-to-use keyboard-driven interfaces. (Hey, it works well for mpv and MPlayer.) I haven’t bothered with a mobile interface, so sorry if you’re reading on your phone. You’ll just have to enjoy watching a single pendulum.

Here are the controls:

To witness chaos theory in action:

  1. Start with a single pendulum (the default).
  2. Pause the simulation (SPACE).
  3. Make a dozen or so clones (press c for awhile).
  4. Unpause.

At first it will appear as single pendulum, but they’re actually all stacked up, each starting from slightly randomized initial conditions. Within a minute you’ll witness the pendulums diverge, and after a minute they’ll all be completely different. It’s pretty to watch them come apart at first.

It might appear that the m key doesn’t actually do anything. That’s because the HTML5 Canvas rendering — which is what I actually implemented first — is really close to the WebGL rendering. I’m really proud of this. There are just three noticeable differences. First, there’s a rounded line cap in the Canvas rendering where the pendulum is “attached.” Second, the tail line segments aren’t properly connected in the Canvas rendering. The segments are stroked separately in order to get that gradient effect along its path. Third, it’s a lot slower, particularly when there are many pendulums to render.

In WebGL the two “masses” are rendered using that handy old circle rasterization technique on a quad. Either a triangle fan or pre-rendering the circle as a texture would probably have been a better choices. The two bars are the same quad buffers, just squeezed and rotated into place. Both were really simple to create. It’s the tail that was tricky to render.

When I wrote the original Canvas renderer, I set the super convenient lineWidth property to get a nice, thick tail. In my first cut at rendering the tail I used GL_LINE_STRIP to draw a line primitive. The problem with the line primitive is that an OpenGL implementation is only required to support single pixel wide lines. If I wanted wider, I’d have to generate the geometry myself. So I did.

Like before, I wasn’t about to dirty my hands manipulating a graphite-filled wooden stick on a piece of paper to solve this problem. No, I lifted the math from something I found on the internet again. In this case it was a forum post by paul.houx, which provides a few vector equations to compute a triangle strip from a line strip. My own modification was to add a miter limit, which keeps sharp turns under control. You can find my implementation in polyline() in my code. Here’s a close-up with the skeleton rendered on top in black:

For the first time I’m also using ECMAScript’s new template literals to store the shaders inside the JavaScript source. These string literals can contain newlines, but, even cooler, I it does string interpolation, meaning I can embed JavaScript variables directly into the shader code:

let massRadius = 0.12;
let vertexShader = `
attribute vec2 a_point;
uniform   vec2 u_center;
varying   vec2 v_point;

void main() {
    v_point = a_point;
    gl_Position = vec4(a_point * ${massRadius} + u_center, 0, 1);
}`;

Allocation avoidance

If you’ve looked at my code you might have noticed something curious. I’m using a lot of destructuring assignments, which is another relatively new addition to ECMAScript. This was part of a little experiment.

function normalize(v0, v1) {
    let d = Math.sqrt(v0 * v0 + v1 * v1);
    return [v0 / d, v1 / d];
}

/* ... */

let [nx, ny] = normalize(-ly, lx);

One of my goals for this project was zero heap allocations in the main WebGL rendering loop. There are no garbage collector hiccups if there’s no garbage to collect. This sort of thing is trivial in a language with manual memory management, such as C and C++. Just having value semantics for aggregates would be sufficient.

But with JavaScript I don’t get to choose how my objects are allocated. I either have to pre-allocate everything, including space for all the intermediate values (e.g. an object pool). This would be clunky and unconventional. Or I can structure and access my allocations in such a way that the JIT compiler can eliminate them (via escape analysis, scalar replacement, etc.).

In this case, I’m trusting that JavaScript implementations will flatten these destructuring assignments so that the intermediate array never actually exists. It’s like pretending the array has value semantics. This seems to work as I expect with V8, but not so well with SpiderMonkey (yet?), at least in Firefox 52 ESR.

Single precision

I briefly considered using Math.fround() to convince JavaScript to compute all the tail geometry in single precision. The double pendulum system would remain double precision, but the geometry doesn’t need all that precision. It’s all rounded to single precision going out to the GPU anyway.

Normally when pulling values from a Float32Array, they’re cast to double precision — JavaScript’s only numeric type — and all operations are performed in double precision, even if the result is stored back in a Float32Array. This is because the JIT compiler is required to correctly perform all the intermediate rounding. To relax this requirement, surround each operation with a call to Math.fround(). Since the result of doing each operation in double precision with this rounding step in between is equivalent to doing each operation in single precision, the JIT compiler can choose to do the latter.

let x = new Float32Array(n);
let y = new Float32Array(n);
let d = new Float32Array(n);
// ...
for (let i = 0; i < n; i++) {
    let xprod = Math.fround(x[i] * x[i]);
    let yprod = Math.fround(y[i] * y[i]);
    d[i] = Math.sqrt(Math.fround(xprod + yprod));
}

I ultimately decided not to bother with this since it would significantly obscures my code for what is probably a minuscule performance gain (in this case). It’s also really difficult to tell if I did it all correctly. So I figure this is better suited for compilers that target JavaScript rather than something to do by hand.

Lorenz system

The other demo is a Lorenz system with its famous butterfly pattern. I actually wrote this one a year and a half ago but never got around to writing about it. You can tell it’s older because I’m still using var.

Source: https://github.com/skeeto/lorenz-webgl

Like before, the equations came straight from the Wikipedia article (Lorenz.lorenz() in my code). They math is a lot simpler this time, too.

This one’s a bit more user friendly with a side menu displaying all your options. The keys are basically the same. This was completely by accident, I swear. Here are the important ones:

Witnessing chaos theory in action is the same process as before: clear it down to a single solution (C then a), then add a bunch of randomized clones (c).

There is no Canvas renderer for this one. It’s pure WebGL. The tails are drawn using GL_LINE_STRIP, but in this case it works fine that they’re a single pixel wide. If heads are turned on, those are just GL_POINT. The geometry is threadbare for this one.

There is one notable feature: The tails are stored exclusively in GPU memory. Only the “head” is stored CPU-side. After it computes the next step, it updates a single spot of the tail with glBufferSubData(), and the VBO is actually a circular buffer. OpenGL doesn’t directly support rendering from circular buffers, but it does have element arrays. An element array is an additional buffer of indices that tells OpenGL what order to use the elements in the other buffers.

Naively would mean for a tail of 4 segments, I need 4 different element arrays, one for each possible rotation:

array 0: 0 1 2 3
array 1: 1 2 3 0
array 2: 2 3 0 1
array 3: 3 0 1 2

With the knowledge that element arrays can start at an offset, and with a little cleverness, you might notice these can all overlap in a single, 7-element array:

0 1 2 3 0 1 2

Array 0 is at offset 0, array 1 is at offset 1, array 2 is at offset 2, and array 3 is at offset 3. The tails in the Lorenz system are drawn using drawElements() with exactly this sort of array.

Like before, I was very careful to produce zero heap allocations in the main loop. The FPS counter generates some garbage in the DOM due to reflow, but this goes away if you hide the help menu (?). This was long enough ago that destructuring assignment wasn’t available, but Lorenz system and rendering it were so simple that using pre-allocated objects worked fine.

Beyond just the programming, I’ve gotten hours of entertainment playing with each of these systems. This was also the first time I’ve used WebGL in over a year, and this project was a reminder of just how working with it is so pleasurable. The specification is superbly written and serves perfectly as its own reference.

Options for Structured Data in Emacs Lisp

Russian translation by ClipArtMag.
Ukrainian translation by Open Source Initiative.

So your Emacs package has grown beyond a dozen or so lines of code, and the data it manages is now structured and heterogeneous. Informal plain old lists, the bread and butter of any lisp, are not longer cutting it. You really need to cleanly abstract this structure, both for your own organizational sake any for anyone reading your code.

With informal lists as structures, you might regularly ask questions like, “Was the ‘name’ slot stored in the third list element, or was it the fourth element?” A plist or alist helps with this problem, but those are better suited for informal, externally-supplied data, not for internal structures with fixed slots. Occasionally someone suggests using hash tables as structures, but Emacs Lisp’s hash tables are much too heavy for this. Hash tables are more appropriate when keys themselves are data.

Defining a data structure from scratch

Imagine a refrigerator package that manages a collection of food in a refrigerator. A food item could be structured as a plain old list, with slots at specific positions.

(defun fridge-item-create (name expiry weight)
  (list name expiry weight))

A function that computes the mean weight of a list of food items might look like this:

(defun fridge-mean-weight (items)
  (if (null items)
      0.0
    (let ((sum 0.0)
          (count 0))
      (dolist (item items (/ sum count))
        (setf count (1+ count)
              sum (+ sum (nth 2 item)))))))

Note the use of (nth 2 item) at the end, used to get the item’s weight. That magic number 2 is easy to mess up. Even worse, if lots of code accesses “weight” this way, then future extensions will be inhibited. Defining some accessor functions solves this problem.

(defsubst fridge-item-name (item)
  (nth 0 item))

(defsubst fridge-item-expiry (item)
  (nth 1 item))

(defsubst fridge-item-weight (item)
  (nth 2 item))

The defsubst defines an inline function, so there’s effectively no additional run-time costs for these accessors compared to a bare nth. Since these only cover getting slots, we should also define some setters using the built-in gv (generalized variable) package.

(require 'gv)

(gv-define-setter fridge-item-name (value item)
  `(setf (nth 0 ,item) ,value))

(gv-define-setter fridge-item-expiry (value item)
  `(setf (nth 1 ,item) ,value))

(gv-define-setter fridge-item-weight (value item)
  `(setf (nth 2 ,item) ,value))

This makes each slot setf-able. Generalized variables are great for simplifying APIs, since otherwise there would need to be an equal number of setter functions (fridge-item-set-name, etc.). With generalized variables, both are at the same entrypoint:

(setf (fridge-item-name item) "Eggs")

There are still two more significant improvements.

  1. As far as Emacs Lisp is concerned, this isn’t a real type. The type-ness of it is just a fiction created by the conventions of the package. It would be easy to make the mistake of passing an arbitrary list to these fridge-item functions, and the mistake wouldn’t be caught so long as that list has at least three items. An common solution is to add a type tag: a symbol at the beginning of the structure that identifies it.

  2. It’s still a linked list, and nth has to walk the list (i.e. O(n)) to retrieve items. It would be much more efficient to use a vector, turning this into an efficient O(1) operation.

Addressing both of these at once:

(defun fridge-item-create (name expiry weight)
  (vector 'fridge-item name expiry weight))

(defsubst fridge-item-p (object)
  (and (vectorp object)
       (= (length object) 4)
       (eq 'fridge-item (aref object 0))))

(defsubst fridge-item-name (item)
  (unless (fridge-item-p item)
    (signal 'wrong-type-argument (list 'fridge-item item)))
  (aref item 1))

(defsubst fridge-item-name--set (item value)
  (unless (fridge-item-p item)
    (signal 'wrong-type-argument (list 'fridge-item item)))
  (setf (aref item 1) value))

(gv-define-setter fridge-item-name (value item)
  `(fridge-item-name--set ,item ,value))

;; And so on for expiry and weight...

As long as fridge-mean-weight uses the fridge-item-weight accessor, it continues to work unmodified across all these changes. But, whew, that’s quite a lot of boilerplate to write and maintain for each data structure in our package! Boilerplate code generation is a perfect candidate for a macro definition. Luckily for us, Emacs already defines a macro to generate all this code: cl-defstruct.

(require 'cl)

(cl-defstruct fridge-item
  name expiry weight)

In Emacs 25 and earlier, this innocent looking definition expands into essentially all the above code. The code it generates is expressed in the most optimal form for its version of Emacs, and it exploits many of the available optimizations by using function declarations such as side-effect-free and error-free. It’s configurable, too, allowing for the exclusion of a type tag (:named) — discarding all the type checks — or using a list rather than a vector as the underlying structure (:type). As a crude form of structural inheritance, it even allows for directly embedding other structures (:include).

Two pitfalls

There a couple pitfalls, though. First, for historical reasons, the macro will define two namespace-unfriendly functions: make-NAME and copy-NAME. I always override these, preferring the -create convention for the constructor, and tossing the copier since it’s either useless or, worse, semantically wrong.

(cl-defstruct (fridge-item (:constructor fridge-item-create)
                           (:copier nil))
  name expiry weight)

If the constructor needs to be more sophisticated than just setting slots, it’s common to define a “private” constructor (double dash in the name) and wrap it with a “public” constructor that has some behavior.

(cl-defstruct (fridge-item (:constructor fridge-item--create)
                           (:copier nil))
  name expiry weight entry-time)

(cl-defun fridge-item-create (&rest args)
  (apply #'fridge-item--create :entry-time (float-time) args))

The other pitfall is related to printing. In Emacs 25 and earlier, types defined by cl-defstruct are still only types by convention. They’re really just vectors as far as Emacs Lisp is concerned. One benefit from this is that printing and reading these structures is “free” because vectors are printable. It’s trivial to serialize cl-defstruct structures out to a file. This is exactly how the Elfeed database works.

The pitfall is that once a structure has been serialized, there’s no more changing the cl-defstruct definition. It’s now a file format definition, so the slots are locked in place. Forever.

Emacs 26 throws a wrench in all this, though it’s worth it in the long run. There’s a new primitive type in Emacs 26 with its own reader syntax: records. This is similar to hash tables becoming first class in the reader in Emacs 23.2. In Emacs 26, cl-defstruct uses records instead of vectors.

;; Emacs 25:
(fridge-item-create :name "Eggs" :weight 11.1)
;; => [cl-struct-fridge-item "Eggs" nil 11.1]

;; Emacs 26:
(fridge-item-create :name "Eggs" :weight 11.1)
;; => #s(fridge-item "Eggs" nil 11.1)

So far slots are still accessed using aref, and all the type checking still happens in Emacs Lisp. The only practical change is the record function is used in place of the vector function when allocating a structure. But it does pave the way for more interesting things in the future.

The major short-term downside is that this breaks printed compatibility across the Emacs 25/26 boundary. The cl-old-struct-compat-mode function can be used for some degree of backwards, but not forwards, compatibility. Emacs 26 can read and use some structures printed by Emacs 25 and earlier, but the reverse will never be true. This issue initially tripped up Emacs’ built-in packages, and when Emacs 26 is released we’ll see more of these issues arise in external packages.

Dynamic dispatch

Prior to Emacs 25, the major built-in package for dynamic dispatch — functions that specialize on the run-time type of their arguments — was EIEIO, though it only supported single dispatch (specializing on a single argument). EIEIO brought much of the Common Lisp Object System (CLOS) to Emacs Lisp, including classes and methods.

Emacs 25 introduced a more sophisticated dynamic dispatch package called cl-generic. It focuses only on dynamic dispatch and supports multiple dispatch, completely replacing the dynamic dispatch portion of EIEIO. Since cl-defstruct does inheritance and cl-generic does dynamic dispatch, there’s not really much left for EIEIO — besides bad ideas like multiple inheritance and method combination.

Without either of these packages, the most direct way to build single dispatch on top of cl-defstruct would be to shove a function in one of the slots. Then the “method” is just a wrapper that call this function.

;; Base "class"

(cl-defstruct greeter
  greeting)

(defun greet (thing)
  (funcall (greeter-greeting thing) thing))

;; Cow "class"

(cl-defstruct (cow (:include greeter)
                   (:constructor cow--create)))

(defun cow-create ()
  (cow--create :greeting (lambda (_) "Moo!")))

;; Bird "class"

(cl-defstruct (bird (:include greeter)
                    (:constructor bird--create)))

(defun bird-create ()
  (bird--create :greeting (lambda (_) "Chirp!")))

;; Usage:

(greet (cow-create))
;; => "Moo!"

(greet (bird-create))
;; => "Chirp!"

Since cl-generic is aware of the types created by cl-defstruct, functions can specialize on them as if they were native types. It’s a lot simpler to let cl-generic do all the hard work. The people reading your code will appreciate it, too:

(require 'cl-generic)

(cl-defgeneric greet (greeter))

(cl-defstruct cow)

(cl-defmethod greet ((_ cow))
  "Moo!")

(cl-defstruct bird)

(cl-defmethod greet ((_ bird))
  "Chirp!")

(greet (make-cow))
;; => "Moo!"

(greet (make-bird))
;; => "Chirp!"

The majority of the time a simple cl-defstruct will fulfill your needs, keeping in mind the gotcha with the constructor and copier names. Its use should feel almost as natural as defining functions.

Inspiration from Data-dependent Rotations

This article is an expanded email I wrote in response to a question from Frank Muller. He asked how I arrived at my solution to a branchless UTF-8 decoder:

I mean, when you started, I’m pretty the initial solution was using branches, right? Then, you’ve applied some techniques to eliminate them.

A bottom-up approach that begins with branches and then proceeds to eliminate them one at a time sounds like a plausible story. However, this story is the inverse of how it actually played out. It began when I noticed a branchless decoder could probably be done, then I put together the pieces one at a time without introducing any branches. But what sparked that initial idea?

The two prior posts reveal my train of thought at the time: a look at the Blowfish cipher and a 64-bit PRNG shootout. My layman’s study of Blowfish was actually part of an examination of a number of different block ciphers. For example, I also read the NSA’s Speck and Simon paper and then implemented the 128/128 variant of Speck — a 128-bit key and 128-bit block. I didn’t take the time to write an article about it, but note how the entire cipher — key schedule, encryption, and decryption — is just 40 lines of code:

struct speck {
    uint64_t k[32];
};

void
speck_init(struct speck *ctx, uint64_t x, uint64_t y)
{
    ctx->k[0] = y;
    for (uint64_t i = 0; i < 31; i++) {
        x = (x >> 8) | (x << 56);
        x += y;
        x ^= i;
        y = (y << 3) | (y >> 61);
        y ^= x;
        ctx->k[i + 1] = y;
    }
}

void
speck_encrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
    for (int i = 0; i < 32; i++) {
        *x = (*x >> 8) | (*x << 56);
        *x += *y;
        *x ^= ctx->k[i];
        *y = (*y << 3) | (*y >> 61);
        *y ^= *x;
    }
}

static void
speck_decrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
    for (int i = 0; i < 32; i++) {
        *y ^= *x;
        *y = (*y >> 3) | (*y << 61);
        *x ^= ctx->k[31 - i];
        *x -= *y;
        *x = (*x << 8) | (*x >> 56);
    }
}

Isn’t that just beautiful? It’s so tiny and fast. Other than the not-very-arbitrary selection of 32 rounds, and the use of 3-bit and 8-bit rotations, there are no magic values. One could fairly reasonably commit this cipher to memory if necessary, similar to the late RC4. Speck is probably my favorite block cipher right now, except that I couldn’t figure out the key schedule for any of the other key/block size variants.

Another cipher I studied, though in less depth, was RC5 (1994), a block cipher by (obviously) Ron Rivest. The most novel part of RC5 is its use of data dependent rotations. This was a very deliberate decision, and the paper makes this clear:

RC5 should highlight the use of data-dependent rotations, and encourage the assessment of the cryptographic strength data-dependent rotations can provide.

What’s a data-dependent rotation. In the Speck cipher shown above, notice how the right-hand side of all the rotation operations is a constant (3, 8, 56, and 61). Suppose that these operands were not constant, instead they were based on some part of the value of the block:

    int r = *y & 0x0f;
    *x = (*x >> r) | (*x << (64 - r));

Such “random” rotations “frustrate differential cryptanalysis” according to the paper, increasing the strength of the cipher.

Another algorithm that uses data-dependent shift is the PCG family of PRNGs. Honestly, the data-dependent “permutation” shift is the defining characteristic of PCG. As a reminder, here’s the simplified PCG from my shootout:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

Notice how the final shift depends on the high order bits of the PRNG state. (This one weird trick from Melissa O’Neil will significantly improve your PRNG. Xorshift experts hate her.)

I think this raises a really interesting question: Why did it take until 2014 for someone to apply a data-dependent shift to a PRNG? Similarly, why are data-dependent rotations not used in many ciphers?

My own theory is that this is because many older instruction set architectures can’t perform data-dependent shift operations efficiently.

Many instruction sets only have a fixed shift (e.g. 1-bit), or can only shift by an immediate (e.g. a constant). In these cases, a data-dependent shift would require a loop. These loops would be a ripe source of side channel attacks in ciphers due to the difficultly of making them operate in constant time. It would also be relatively slow for video game PRNGs, which often needed to run on constrained hardware with limited instruction sets. For example, the 6502 (Atari, Apple II, NES, Commodore 64) and the Z80 (too many to list) can only shift/rotate one bit at a time.

Even on an architecture with an instruction for data-dependent shifts, such as the x86, those shifts will be slower than constant shifts, at least in part due to the additional data dependency.

It turns out there are also some patent issues (ex. 1, 2). Fortunately most of these patents have now expired, and one in particular is set to expire this June. I still like my theory better.

To branchless decoding

So I was thinking about data-dependent shifts, and I had also noticed I could trivially check the length of a UTF-8 code point using a small lookup table — the first step in my decoder. What if I combined these: a data-dependent shift based on a table lookup. This would become the last step in my decoder. The idea for a branchless UTF-8 decoder was essentially borne out of connecting these two thoughts, and then filling in the middle.

Debugging Emacs or: How I Learned to Stop Worrying and Love DTrace

Update: This article was featured on BSD Now 233 (starting at 21:38).

For some time Elfeed was experiencing a strange, spurious failure. Every so often users were seeing an error (spoiler warning) when updating feeds: “error in process sentinel: Search failed.” If you use Elfeed, you might have even seen this yourself. From the surface it appeared that curl, tasked with the responsibility for downloading feed data, was producing incomplete output despite reporting a successful run. Since the run was successful, Elfeed assumed certain data was in curl’s output buffer, but, since it wasn’t, it failed hard.

Unfortunately this issue was not reproducible. Manually running curl outside of Emacs never revealed any issues. Asking Elfeed to retry fetching the feeds would work fine. The issue would only randomly rear its head when Elfeed was fetching many feeds in parallel, under stress. By the time the error was discovered, the curl process had exited and vital debugging information was lost. Considering that this was likely to be a bug in Emacs itself, there really wasn’t a reliable way to capture the necessary debugging information from within Emacs Lisp. And, indeed, this later proved to be the case.

A quick-and-dirty work around is to use condition-case to catch and swallow the error. When the bizarre issue shows up, rather than fail badly in front of the user, Elfeed could attempt to swallow the error — assuming it can be reliably detected — and treat the fetch as simply a failure. That didn’t sit comfortably with me. Elfeed had done its due diligence checking for errors already. Someone was lying to Elfeed, and I intended to catch them with their pants on fire. Someday.

I’d just need to witness the bug on one of my own machines. Elfeed is part of my daily routine, so surely I’d have to experience this issue myself someday. My plan was, should that day come, to run a modified Elfeed, instrumented to capture extra data. I would have also routinely run Emacs under GDB so that I could inspect the failure more deeply.

For now I just had to wait to hunt that zebra.

Bryan Cantrill, DTrace, and FreeBSD

Over the holidays I re-discovered Bryan Cantrill, a systems software engineer who worked for Sun between 1996 and 2010, and is most well known for DTrace. My first exposure to him was in a BSD Now interview in 2015. I had re-watched that interview and decided there was a lot more I had to learn from him. He’s become a personal hero to me. So I scoured the internet for more of his writing and talks. Besides what I’ve already linked in this article, here are a couple more great presentations:

You can also find some of his writing scattered around the DTrace blog.

Some interesting operating system technology came out of Sun during its final 15 or so years — most notably DTrace and ZFS — and Bryan speaks about it passionately. Almost as a matter of luck, most of it survived the Oracle acquisition thanks to Sun releasing it as open source in just the nick of time. Otherwise it would have been lost forever. The scattered ex-Sun employees, still passionate about their prior work at Sun, along with some of their old customers have since picked up the pieces and kept going as a community under the name illumos. It’s like an open source flotilla.

Naturally I wanted to get my hands on this stuff to try it out for myself. Is it really as good as they say? Normally I stick to Linux, but it (generally) doesn’t have these Sun technologies. The main reason is license incompatibility. Sun released its code under the CDDL, which is incompatible with the GPL. Ubuntu does infamously include ZFS, but other distributions are unwilling to take that risk. Porting DTrace is a serious undertaking since it’s got its fingers throughout the kernel, which also makes the licensing issues even more complicated.

(Update Feburary 2018: DTrace has been released under the GPLv2, allowing it to be legally integrated with Linux.)

Linux has a reputation for Not Invented Here (NIH) syndrome, and these licensing issues certainly contribute to that. Rather than adopt ZFS and DTrace, they’ve been reinvented from scratch: btrfs instead of ZFS, and a slew of partial options instead of DTrace. Normally I’m most interested in system call tracing, and my go to is strace, though it certainly has its limitations — including this situation of debugging curl under Emacs. Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2).

So, if I want to try these for myself, I’ll need to install a different operating system. I’ve dabbled with OmniOS, an OS built on illumos, in virtual machines, using it as an alien environment to test some of my software (e.g. enchive). OmniOS has a philosophy called Keep Your Software To Yourself (KYSTY), which is really just code for “we don’t do packaging.” Honestly, you can’t blame them since they’re a tiny community. The best solution to this is probably pkgsrc, which is essentially a universal packaging system. Otherwise you’re on your own.

There’s also openindiana, which is a more friendly desktop-oriented illumos distribution. Still, the short of it is that you’re very much on your own when things don’t work. The situation is like running Linux a couple decades ago, when it was still difficult to do.

If you’re interested in trying DTrace, the easiest option these days is probably FreeBSD. It’s got a big, active community, thorough documentation, and a huge selection of packages. Its license (the BSD license, duh) is compatible with the CDDL, so both ZFS and DTrace have been ported to FreeBSD.

What is DTrace?

I’ve done all this talking but haven’t yet described what DTrace really is. I won’t pretend to write my own tutorial, but I’ll provide enough information to follow along. DTrace is a tracing framework for debugging production systems in real time, both for the kernel and for applications. The “production systems” part means it’s stable and safe — using DTrace won’t put your system at risk of crashing or damaging data. The “real time” part means it has little impact on performance. You can use DTrace on live, active systems with little impact. Both of these core design principles are vital for troubleshooting those really tricky bugs that only show up in production.

There are DTrace probes scattered all throughout the system: on system calls, scheduler events, networking events, process events, signals, virtual memory events, etc. Using a specialized language called D (unrelated to the general purpose programming language D), you can dynamically add behavior at these instrumentation points. Generally the behavior is to capture information, but it can also manipulate the event being traced.

Each probe is fully identified by a 4-tuple delimited by colons: provider, module, function, and probe name. An empty element denotes a sort of wildcard. For example, syscall::open:entry is a probe at the beginning (i.e. “entry”) of open(2). syscall:::entry matches all system call entry probes.

Unlike strace on Linux which monitors a specific process, DTrace applies to the entire system when active. To run curl under strace from Emacs, I’d have to modify Emacs’ behavior to do so. With DTrace I can instrument every curl process without making a single change to Emacs, and with negligible impact to Emacs. That’s a big deal.

So, when it comes to this Elfeed issue, FreeBSD is much better poised for debugging the problem. All I have to do is catch it in the act. However, it’s been months since that bug report and I’m not really making this connection yet. I’m just hoping I eventually find an interesting problem where I can apply DTrace.

FreeBSD on a Raspberry Pi 2

So I’ve settled in FreeBSD as the playground for these technologies, I just have to decide where. I could always run it in a virtual machine, but it’s always more interesting to try things out on real hardware. FreeBSD supports the Raspberry Pi 2 as a Tier 2 system, and I had a Raspberry Pi 2 sitting around collecting dust, so I put it to use.

I wrote the image to an SD card, and for a few days I stretched my legs on this new system. I cloned a couple dozen of my own git repositories, ran the builds and the tests, and just got a feel for things. I tried out the ports system for the first time, mainly to discover that the low-powered Raspberry Pi 2 takes days to build some of the packages I want to try.

I mostly program in Vim these days, so it’s some days before I even set up Emacs. Eventually I do build Emacs, clone my configuration, fire it up, and give Elfeed a spin.

And that’s when the “search failed” bug strikes! Not just once, but dozens of times. Perfect! This low-powered platform is the jackpot for this particular bug, triggering it left and right. Given that I’ve got DTrace at my disposal, it’s the perfect place to debug this. Something is lying to Elfeed and DTrace will play the judge.

Before I dive in I see three possibilities:

  1. curl is reporting success but truncating its output.
  2. Emacs is quietly truncating curl’s output.
  3. Emacs is misinterpreting curl’s exit status.

With Dtrace I can observe what every curl process writes to Emacs, and I can also double check curl’s exit status. I come up with the following (newbie) DTrace script:

syscall::write:entry
/execname == "curl"/
{
    printf("%d WRITE %d \"%s\"\n",
           pid, arg2, stringof(copyin(arg1, arg2)));
}

syscall::exit:entry
/execname == "curl"/
{
    printf("%d EXIT  %d\n", pid, arg0);
}

The /execname == "curl"/ is a predicate that (obviously) causes the behavior to only fire for curl processes. The first probe has DTrace print a line for every write(2) from curl. arg0, arg1, and arg2 correspond to the arguments of write(2): fd, buf, count. It logs the process ID (pid) of the write, the length of the write, and the actual contents written. Remember that these curl processes are run in parallel by Emacs, so the pid allows me to associate the separate writes and the exit status.

The second probe prints the pid and the exit status (the first argument to exit(2)).

I also want to compare this to exactly what is delivered to Elfeed when curl exits, so I modify the process sentinel — the callback that handles a subprocess exiting — to call write-file before any action is taken. I can compare these buffer dumps to the logs produced by DTrace.

There are two important findings.

First, when the “search failed” bug occurs, the buffer was completely empty (95% of the time) or truncated at the end of the HTTP headers (5% of the time), right at the blank line. DTrace indicates that curl did its job to the full, so it’s Emacs who’s the liar. It’s not delivering all of curl’s data to Elfeed. That’s pretty annoying.

Second, curl was line-buffered. Each line was a separate, independent write(2). I was certainly not expecting this. Normally the C library only does line buffering when the output is a terminal. That’s because it’s guessing a user may be watching, expecting the output to arrive a line at a time.

Here’s a sample of what it looked like in the log:

88188 WRITE 32 "Server: Apache/2.4.18 (Ubuntu)
"
88188 WRITE 46 "Location: https://blog.plover.com/index.atom
"
88188 WRITE 21 "Content-Length: 299
"
88188 WRITE 45 "Content-Type: text/html; charset=iso-8859-1
"
88188 WRITE 2 "
"

Why would curl think Emacs is a terminal?

Oh. That’s right. This is the same problem I ran into four years ago when writing EmacSQL. By default Emacs connects to subprocesses through a psuedo-terminal (pty). I called this a mistake in Emacs back then, and I still stand by that claim. The pty causes weird, annoying problems for little benefit:

Just from eyeballing the DTrace log I knew what to do: dump the pty and switch to a pipe. This is controlled with the process-connection-type variable, and fixing it is a one-liner.

Not only did this completely resolve the truncation issue, Elfeed is noticeably faster at fetching feeds on all machines. It’s no longer receiving mountains of XML one line at a time, like sucking pudding through a straw. It’s now quite zippy even on my Raspberry Pi 2, which had never been the case before (without the “search failed” bug). Even if you were never affected by this bug, you will benefit from the fix.

I haven’t officially reported this as an Emacs bug yet because reproducibility is still an issue. It needs something better than “fire off a bunch of HTTP requests across the internet in parallel from a Raspberry Pi.”

The fix reminds me of that old boilermaker story about charging a lot of money just to swing a hammer. Once the problem arose, DTrace quickly helped to identify the place to hit Emacs with the hammer.

Finally, a big thanks to alphapapa for originally taking the time to report this bug months ago.

null program

Chris Wellons