Render Multimedia in Pure C

In a previous article I demonstrated video filtering with C and a unix pipeline. Thanks to the ubiquitous support for the ridiculously simple Netpbm formats — specifically the “Portable PixMap” (.ppm, P6) binary format — it’s trivial to parse and produce image data in any language without image libraries. Video decoders and encoders at the ends of the pipeline do the heavy lifting of processing the complicated video formats actually used to store and transmit video.

Naturally this same technique can be used to produce new video in a simple program. All that’s needed are a few functions to render artifacts — lines, shapes, etc. — to an RGB buffer. With a bit of basic sound synthesis, the same concept can be applied to create audio in a separate audio stream — in this case using the simple (but not as simple as Netpbm) WAV format. Put them together and a small, standalone program can create multimedia.

Here’s the demonstration video I’ll be going through in this article. It animates and visualizes various in-place sorting algorithms (see also). The elements are rendered as colored dots, ordered by hue, with red at 12 o’clock. A dot’s distance from the center is proportional to its corresponding element’s distance from its correct position. Each dot emits a sinusoidal tone with a unique frequency when it swaps places in a particular frame.

Original credit for this visualization concept goes to w0rthy.

All of the source code (less than 600 lines of C), ready to run, can be found here:

On any modern computer, rendering is real-time, even at 60 FPS, so you may be able to pipe the program’s output directly into your media player of choice. (If not, consider getting a better media player!)

$ ./sort | mpv --no-correct-pts --fps=60 -

VLC requires some help from ppmtoy4m:

$ ./sort | ppmtoy4m -F60:1 | vlc -

Or you can just encode it to another format. Recent versions of libavformat can input PPM images directly, which means x264 can read the program’s output directly:

$ ./sort | x264 --fps 60 -o video.mp4 /dev/stdin

By default there is no audio output. I wish there was a nice way to embed audio with the video stream, but this requires a container and that would destroy all the simplicity of this project. So instead, the -a option captures the audio in a separate file. Use ffmpeg to combine the audio and video into a single media file:

$ ./sort -a audio.wav | x264 --fps 60 -o video.mp4 /dev/stdin
$ ffmpeg -i video.mp4 -i audio.wav -vcodec copy -acodec mp3 \
         combined.mp4

You might think you’ll be clever by using mkfifo (i.e. a named pipe) to pipe both audio and video into ffmpeg at the same time. This will only result in a deadlock since neither program is prepared for this. One will be blocked writing one stream while the other is blocked reading on the other stream.

Several years ago my intern and I used the exact same pure C rendering technique to produce these raytracer videos:

I also used this technique to illustrate gap buffers.

Pixel format and rendering

This program really only has one purpose: rendering a sorting video with a fixed, square resolution. So rather than write generic image rendering functions, some assumptions will be hard coded. For example, the video size will just be hard coded and assumed square, making it simpler and faster. I chose 800x800 as the default:

#define S     800

Rather that define some sort of color struct with red, green, and blue fields, color will be represented by a 24-bit integer (long). I arbitrarily chose red to be the most significant 8 bits. This has nothing to do with the order of the individual channels in Netpbm since these integers are never dumped out. (This would have stupid byte-order issues anyway.) “Color literals” are particularly convenient and familiar in this format. For example, the constant for pink: 0xff7f7fUL.

In practice the color channels will be operated upon separately, so here are a couple of helper functions to convert the channels between this format and normalized floats (0.0–1.0).

static void
rgb_split(unsigned long c, float *r, float *g, float *b)
{
    *r = ((c >> 16) / 255.0f);
    *g = (((c >> 8) & 0xff) / 255.0f);
    *b = ((c & 0xff) / 255.0f);
}

static unsigned long
rgb_join(float r, float g, float b)
{
    unsigned long ir = roundf(r * 255.0f);
    unsigned long ig = roundf(g * 255.0f);
    unsigned long ib = roundf(b * 255.0f);
    return (ir << 16) | (ig << 8) | ib;
}

Originally I decided the integer form would be sRGB, and these functions handled the conversion to and from sRGB. Since it had no noticeable effect on the output video, I discarded it. In more sophisticated rendering you may want to take this into account.

The RGB buffer where images are rendered is just a plain old byte buffer with the same pixel format as PPM. The ppm_set() function writes a color to a particular pixel in the buffer, assumed to be S by S pixels. The complement to this function is ppm_get(), which will be needed for blending.

static void
ppm_set(unsigned char *buf, int x, int y, unsigned long color)
{
    buf[y * S * 3 + x * 3 + 0] = color >> 16;
    buf[y * S * 3 + x * 3 + 1] = color >>  8;
    buf[y * S * 3 + x * 3 + 2] = color >>  0;
}

static unsigned long
ppm_get(unsigned char *buf, int x, int y)
{
    unsigned long r = buf[y * S * 3 + x * 3 + 0];
    unsigned long g = buf[y * S * 3 + x * 3 + 1];
    unsigned long b = buf[y * S * 3 + x * 3 + 2];
    return (r << 16) | (g << 8) | b;
}

Since the buffer is already in the right format, writing an image is dead simple. I like to flush after each frame so that observers generally see clean, complete frames. It helps in debugging.

static void
ppm_write(const unsigned char *buf, FILE *f)
{
    fprintf(f, "P6\n%d %d\n255\n", S, S);
    fwrite(buf, S * 3, S, f);
    fflush(f);
}

Dot rendering

If you zoom into one of those dots, you may notice it has a nice smooth edge. Here’s one rendered at 30x the normal resolution. I did not render, then scale this image in another piece of software. This is straight out of the C program.

In an early version of this program I used a dumb dot rendering routine. It took a color and a hard, integer pixel coordinate. All the pixels within a certain distance of this coordinate were set to the color, everything else was left alone. This had two bad effects:

Instead the dot’s position is computed in floating point and is actually rendered as if it were between pixels. This is done with a shader-like routine that uses smoothstep — just as found in shader languages — to give the dot a smooth edge. That edge is blended into the image, whether that’s the background or a previously-rendered dot. The input to the smoothstep is the distance from the floating point coordinate to the center (or corner?) of the pixel being rendered, maintaining that between-pixel smoothness.

Rather than dump the whole function here, let’s look at it piece by piece. I have two new constants to define the inner dot radius and the outer dot radius. It’s smooth between these radii.

#define R0    (S / 400.0f)  // dot inner radius
#define R1    (S / 200.0f)  // dot outer radius

The dot-drawing function takes the image buffer, the dot’s coordinates, and its foreground color.

static void
ppm_dot(unsigned char *buf, float x, float y, unsigned long fgc);

The first thing to do is extract the color components.

    float fr, fg, fb;
    rgb_split(fgc, &fr, &fg, &fb);

Next determine the range of pixels over which the dot will be draw. These are based on the two radii and will be used for looping.

    int miny = floorf(y - R1 - 1);
    int maxy = ceilf(y + R1 + 1);
    int minx = floorf(x - R1 - 1);
    int maxx = ceilf(x + R1 + 1);

Here’s the loop structure. Everything else will be inside the innermost loop. The dx and dy are the floating point distances from the center of the dot.

    for (int py = miny; py <= maxy; py++) {
        float dy = py - y;
        for (int px = minx; px <= maxx; px++) {
            float dx = px - x;
            /* ... */
        }
    }

Use the x and y distances to compute the distance and smoothstep value, which will be the alpha. Within the inner radius the color is on 100%. Outside the outer radius it’s 0%. Elsewhere it’s something in between.

            float d = sqrtf(dy * dy + dx * dx);
            float a = smoothstep(R1, R0, d);

Get the background color, extract its components, and blend the foreground and background according to the computed alpha value. Finally write the pixel back into the buffer.

            unsigned long bgc = ppm_get(buf, px, py);
            float br, bg, bb;
            rgb_split(bgc, &br, &bg, &bb);

            float r = a * fr + (1 - a) * br;
            float g = a * fg + (1 - a) * bg;
            float b = a * fb + (1 - a) * bb;
            ppm_set(buf, px, py, rgb_join(r, g, b));

That’s all it takes to render a smooth dot anywhere in the image.

Rendering the array

The array being sorted is just a global variable. This simplifies some of the sorting functions since a few are implemented recursively. They can call for a frame to be rendered without needing to pass the full array. With the dot-drawing routine done, rendering a frame is easy:

#define N     360           // number of dots

static int array[N];

static void
frame(void)
{
    static unsigned char buf[S * S * 3];
    memset(buf, 0, sizeof(buf));
    for (int i = 0; i < N; i++) {
        float delta = abs(i - array[i]) / (N / 2.0f);
        float x = -sinf(i * 2.0f * PI / N);
        float y = -cosf(i * 2.0f * PI / N);
        float r = S * 15.0f / 32.0f * (1.0f - delta);
        float px = r * x + S / 2.0f;
        float py = r * y + S / 2.0f;
        ppm_dot(buf, px, py, hue(array[i]));
    }
    ppm_write(buf, stdout);
}

The buffer is static since it will be rather large, especially if S is cranked up. Otherwise it’s likely to overflow the stack. The memset() fills it with black. If you wanted a different background color, here’s where you change it.

For each element, compute its delta from the proper array position, which becomes its distance from the center of the image. The angle is based on its actual position. The hue() function (not shown in this article) returns the color for the given element.

With the frame() function complete, all I need is a sorting function that calls frame() at appropriate times. Here are a couple of examples:

static void
shuffle(int array[N], uint64_t *rng)
{
    for (int i = N - 1; i > 0; i--) {
        uint32_t r = pcg32(rng) % (i + 1);
        swap(array, i, r);
        frame();
    }
}

static void
sort_bubble(int array[N])
{
    int c;
    do {
        c = 0;
        for (int i = 1; i < N; i++) {
            if (array[i - 1] > array[i]) {
                swap(array, i - 1, i);
                c = 1;
            }
        }
        frame();
    } while (c);
}

Synthesizing audio

To add audio I need to keep track of which elements were swapped in this frame. When producing a frame I need to generate and mix tones for each element that was swapped.

Notice the swap() function above? That’s not just for convenience. That’s also how things are tracked for the audio.

static int swaps[N];

static void
swap(int a[N], int i, int j)
{
    int tmp = a[i];
    a[i] = a[j];
    a[j] = tmp;
    swaps[(a - array) + i]++;
    swaps[(a - array) + j]++;
}

Before we get ahead of ourselves I need to write a WAV header. Without getting into the purpose of each field, just note that the header has 13 fields, followed immediately by 16-bit little endian PCM samples. There will be only one channel (monotone).

#define HZ    44100         // audio sample rate

static void
wav_init(FILE *f)
{
    emit_u32be(0x52494646UL, f); // "RIFF"
    emit_u32le(0xffffffffUL, f); // file length
    emit_u32be(0x57415645UL, f); // "WAVE"
    emit_u32be(0x666d7420UL, f); // "fmt "
    emit_u32le(16,           f); // struct size
    emit_u16le(1,            f); // PCM
    emit_u16le(1,            f); // mono
    emit_u32le(HZ,           f); // sample rate (i.e. 44.1 kHz)
    emit_u32le(HZ * 2,       f); // byte rate
    emit_u16le(2,            f); // block size
    emit_u16le(16,           f); // bits per sample
    emit_u32be(0x64617461UL, f); // "data"
    emit_u32le(0xffffffffUL, f); // byte length
}

Rather than tackle the annoying problem of figuring out the total length of the audio ahead of time, I just wave my hands and write the maximum possible number of bytes (0xffffffff). Most software that can read WAV files will understand this to mean the entire rest of the file contains samples.

With the header out of the way all I have to do is write 1/60th of a second worth of samples to this file each time a frame is produced. That’s 735 samples (1,470 bytes) at 44.1kHz.

The simplest place to do audio synthesis is in frame() right after rendering the image.

#define FPS   60            // output framerate
#define MINHZ 20            // lowest tone
#define MAXHZ 1000          // highest tone

static void
frame(void)
{
    /* ... rendering ... */

    /* ... synthesis ... */
}

With the largest tone frequency at 1kHz, Nyquist says we only need to sample at 2kHz. 8kHz is a very common sample rate and gives some overhead space, making it a good choice. However, I found that audio encoding software was a lot happier to accept the standard CD sample rate of 44.1kHz, so I stuck with that.

The first thing to do is to allocate and zero a buffer for this frame’s samples.

    int nsamples = HZ / FPS;
    static float samples[HZ / FPS];
    memset(samples, 0, sizeof(samples));

Next determine how many “voices” there are in this frame. This is used to mix the samples by averaging them. If an element was swapped more than once this frame, it’s a little louder than the others — i.e. it’s played twice at the same time, in phase.

    int voices = 0;
    for (int i = 0; i < N; i++)
        voices += swaps[i];

Here’s the most complicated part. I use sinf() to produce the sinusoidal wave based on the element’s frequency. I also use a parabola as an envelope to shape the beginning and ending of this tone so that it fades in and fades out. Otherwise you get the nasty, high-frequency “pop” sound as the wave is given a hard cut off.

    for (int i = 0; i < N; i++) {
        if (swaps[i]) {
            float hz = i * (MAXHZ - MINHZ) / (float)N + MINHZ;
            for (int j = 0; j < nsamples; j++) {
                float u = 1.0f - j / (float)(nsamples - 1);
                float parabola = 1.0f - (u * 2 - 1) * (u * 2 - 1);
                float envelope = parabola * parabola * parabola;
                float v = sinf(j * 2.0f * PI / HZ * hz) * envelope;
                samples[j] += swaps[i] * v / voices;
            }
        }
    }

Finally I write out each sample as a signed 16-bit value. I flush the frame audio just like I flushed the frame image, keeping them somewhat in sync from an outsider’s perspective.

    for (int i = 0; i < nsamples; i++) {
        int s = samples[i] * 0x7fff;
        emit_u16le(s, wav);
    }
    fflush(wav);

Before returning, reset the swap counter for the next frame.

    memset(swaps, 0, sizeof(swaps));

Font rendering

You may have noticed there was text rendered in the corner of the video announcing the sort function. There’s font bitmap data in font.h which gets sampled to render that text. It’s not terribly complicated, but you’ll have to study the code on your own to see how that works.

Learning more

This simple video rendering technique has served me well for some years now. All it takes is a bit of knowledge about rendering. I learned quite a bit just from watching Handmade Hero, where Casey writes a software renderer from scratch, then implements a nearly identical renderer with OpenGL. The more I learn about rendering, the better this technique works.

Before writing this post I spent some time experimenting with using a media player as a interface to a game. For example, rather than render the game using OpenGL or similar, render it as PPM frames and send it to the media player to be displayed, just as game consoles drive television sets. Unfortunately the latency is horrible — multiple seconds — so that idea just doesn’t work. So while this technique is fast enough for real time rendering, it’s no good for interaction.

Make Flet Great Again

Do you long for the days before Emacs 24.3 when flet was dynamically scoped? Well, you probably shouldn’t since there are some very good reasons lexical scope. But, still, a dynamically scoped flet is situationally really useful, particularly in unit testing. The good news is that it’s trivial to get this original behavior back without relying on deprecated functions nor third-party packages.

But first, what is flet and what does it mean for it to be dynamically scoped? The name stands for “function let” (or something to that effect). It’s a macro to bind named functions within a local scope, just as let binds variables within some local scope. It’s provided by the now-deprecated cl package.

(require 'cl)  ; deprecated!

(defun norm (x y)
  (flet ((square (v) (* v v)))
    (sqrt (+ (square x) (square y)))))

However, a gotcha here is that square is visible not just to the body of norm but also to any function called directly or indirectly from the flet body. That’s dynamic scope.

(flet ((sqrt (v) (/ v 2)))  ; close enough
  (norm 2 2))
;; -> 4

Note: This works because sqrt hasn’t (yet?) been assigned a bytecode opcode. One weakness with flet is that, due to being dynamically scoped, it is unable to define or override functions whose calls evaporate under byte compilation. For example, addition:

(defun add-with-flet ()
  (flet ((+ (&rest _) :override))
    (+ 1 2 3)))

(add-with-flet)
;; -> :override

(funcall (byte-compile #'add-with-flet))
;; -> 6

Since + has its own opcode, the function call is eliminated under byte-compilation and flet can’t do its job. This is similar these same functions being unadvisable.

cl-lib and cl-flet

The cl-lib package introduced in Emacs 24.3, replacing cl, adds a namespace prefix, cl-, to all of these Common Lisp style functions. In most cases this was the only change. One exception is cl-flet, which has different semantics: It’s lexically scoped, just like in Common Lisp. Its bindings aren’t visible outside of the cl-flet body.

(require 'cl-lib)

(cl-flet ((sqrt (v) (/ v 2)))
  (norm 2 2))
;; -> 2.8284271247461903

In most cases this is what you actually want. The old flet subtly changes the environment for all functions called directly or indirectly from its body.

Besides being cleaner and less error prone, cl-flet also doesn’t have special exceptions for functions with assigned opcodes. At macro-expansion time it walks the body, taking its action before the byte-compiler can interfere.

(defun add-with-cl-flet ()
  (cl-flet ((+ (&rest _) :override))
    (+ 1 2 3)))

(add-with-cl-flet)
;; -> :override

(funcall (byte-compile #'add-with-cl-flet))
;; -> :override

In order for it to work properly, it’s essential that functions are quoted with sharp-quotes (#') so that the macro can tell the difference between functions and symbols. Just make a general habit of sharp-quoting functions.

In unit testing, temporarily overriding functions for all of Emacs is useful, so flet still has some uses. But it’s deprecated!

Unit testing with flet

Since Emacs can do anything, suppose there is an Emacs package that makes sandwiches. In this package there’s an interactive function to set the default sandwich cheese.

(defvar default-cheese 'cheddar)

(defun set-default-cheese (type)
  (interactive
   (let* ((options '("cheddar" "swiss" "american"))
          (input (completing-read "Cheese: " options nil t)))
     (when input
       (list (intern input)))))
  (setf default-cheese type))

Since it’s interactive, it uses completing-read to prompt the user for input. A unit test could call this function non-interactively, but perhaps we’d also like to test the interactive path. The code inside interactive occasionally gets messy and may warrant testing. It would obviously be inconvenient to prompt the user for input during testing, and it wouldn’t work at all in batch mode (-batch).

With flet we can stub out completing-read just for the unit test:

;;; -*- lexical-binding: t; -*-

(ert-deftest test-set-default-cheese ()
  ;; protect original with dynamic binding
  (let (default-cheese)
    ;; simulate user entering "american"
    (flet ((completing-read (&rest _) "american"))
      (call-interactively #'set-default-cheese)
      (should (eq 'american default-cheese)))))

Since default-cheese was defined with defvar, it will be dynamically scoped despite let normally using lexical scope in this example. Both of the side effects of the tested function — setting a global variable and prompting the user — are captured using a combination of let and flet.

Since cl-flet is lexically scoped, it cannot serve this purpose. If flet is deprecated and cl-flet can’t do the job, what’s the right way to fix it? The answer lies in generalized variables.

cl-letf

What’s really happening inside flet is it’s globally binding a function name to a different function, evaluating the body, and rebinding it back to the original definition when the body completes. It macro-expands to something like this:

(let ((original (symbol-function 'completing-read)))
  (setf (symbol-function 'completing-read)
        (lambda (&rest _) "american"))
  (unwind-protect
      (call-interactively #'set-default-cheese)
    (setf (symbol-function 'completing-read) original)))

The unwind-protect ensures the original function is rebound even if the body of the call were to fail. This is very much a let-like pattern, and I’m using symbol-function as a generalized variable via setf. Is there a generalized variable version of let?

Yes! It’s called cl-letf! In this case the f suffix is analogous to the f suffix in setf. That form above can be reduced to a more general form:

(cl-letf (((symbol-function 'completing-read)
           (lambda (&rest _) "american")))
  (call-interactively #'set-default-cheese))

And that’s the way to reproduce the dynamically scoped behavior of flet since Emacs 24.3. There’s nothing complicated about it.

(ert-deftest test-set-default-cheese ()
  (let (default-cheese)
    (cl-letf (((symbol-function 'completing-read)
               (lambda (&rest _) "american")))
      (call-interactively #'set-default-cheese)
      (should (eq 'american default-cheese)))))

Keep in mind that this suffers the exact same problem with bytecode-assigned functions as flet, and for exactly the same reasons. If completing-read were to ever be assigned its own opcode then cl-letf would no longer work for this particular example.

A Branchless UTF-8 Decoder

This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. You can find the source code here along with a test suite and benchmark:

In addition to decoding the next code point, it detects any errors and returns a pointer to the next code point. It’s the complete package.

Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.

The usual analogy is laundry. You can have more than one load of laundry in process at a time because laundry is typically a pipelined process. There’s a washing machine stage, dryer stage, and folding stage. One load can be in the washer, a second in the drier, and a third being folded, all at once. This greatly increases throughput because, under ideal circumstances with a full pipeline, an instruction is completed each clock cycle despite any individual instruction taking many clock cycles to complete.

Branches are the enemy of pipelines. The CPU can’t begin work on the next instruction if it doesn’t know which instruction will be executed next. It must finish computing the branch condition before it can know. To deal with this, pipelined CPUs are also equipped with branch predictors. It makes a guess at which branch will be taken and begins executing instructions on that branch. The prediction is initially made using static heuristics, and later those predictions are improved by learning from previous behavior. This even includes predicting the number of iterations of a loop so that the final iteration isn’t mispredicted.

A mispredicted branch has two dire consequences. First, all the progress on the incorrect branch will need to be discarded. Second, the pipeline will be flushed, and the CPU will be inefficient until the pipeline fills back up with instructions on the correct branch. With a sufficiently deep pipeline, it can easily be more efficient to compute and discard an unneeded result than to avoid computing it in the first place. Eliminating branches means eliminating the hazards of misprediction.

Another hazard for pipelines is dependencies. If an instruction depends on the result of a previous instruction, it may have to wait for the previous instruction to make sufficient progress before it can complete one of its stages. This is known as a pipeline stall, and it is an important consideration in instruction set architecture (ISA) design.

For example, on the x86-64 architecture, storing a 32-bit result in a 64-bit register will automatically clear the upper 32 bits of that register. Any further use of that destination register cannot depend on prior instructions since all bits have been set. This particular optimization was missed in the design of the i386: Writing a 16-bit result to 32-bit register leaves the upper 16 bits intact, creating false dependencies.

Dependency hazards are mitigated using out-of-order execution. Rather than execute two dependent instructions back to back, which would result in a stall, the CPU may instead executing an independent instruction further away in between. A good compiler will also try to spread out dependent instructions in its own instruction scheduling.

The effects of out-of-order execution are typically not visible to a single thread, where everything will appear to have executed in order. However, when multiple processes or threads can access the same memory out-of-order execution can be observed. It’s one of the many challenges of writing multi-threaded software.

The focus of my UTF-8 decoder was to be branchless, but there was one interesting dependency hazard that neither GCC nor Clang were able to resolve themselves. More on that later.

What is UTF-8?

Without getting into the history of it, you can generally think of UTF-8 as a method for encoding a series of 21-bit integers (code points) into a stream of bytes.

Keeping in mind these two rules, the entire format is summarized by this table:

length byte[0]  byte[1]  byte[2]  byte[3]
1      0xxxxxxx
2      110xxxxx 10xxxxxx
3      1110xxxx 10xxxxxx 10xxxxxx
4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The x placeholders are the bits of the encoded code point.

UTF-8 has some really useful properties:

A straightforward approach to decoding might look something like this:

unsigned char *
utf8_simple(unsigned char *s, long *c)
{
    unsigned char *next;
    if (s[0] < 0x80) {
        *c = s[0];
        next = s + 1;
    } else if ((s[0] & 0xe0) == 0xc0) {
        *c = ((long)(s[0] & 0x1f) <<  6) |
             ((long)(s[1] & 0x3f) <<  0);
        next = s + 2;
    } else if ((s[0] & 0xf0) == 0xe0) {
        *c = ((long)(s[0] & 0x0f) << 12) |
             ((long)(s[1] & 0x3f) <<  6) |
             ((long)(s[2] & 0x3f) <<  0);
        next = s + 3;
    } else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4)) {
        *c = ((long)(s[0] & 0x07) << 18) |
             ((long)(s[1] & 0x3f) << 12) |
             ((long)(s[2] & 0x3f) <<  6) |
             ((long)(s[3] & 0x3f) <<  0);
        next = s + 4;
    } else {
        *c = -1; // invalid
        next = s + 1; // skip this byte
    }
    if (*c >= 0xd800 && *c <= 0xdfff)
        *c = -1; // surrogate half
    return next;
}

It branches off on the highest bits of the leading byte, extracts all of those x bits from each byte, concatenates those bits, checks if it’s a surrogate half, and returns a pointer to the next character. (This implementation does not check that the highest two bits of each continuation byte are correct.)

The CPU must correctly predict the length of the code point or else it will suffer a hazard. An incorrect guess will stall the pipeline and slow down decoding.

In real world text this is probably not a serious issue. For the English language, the encoded length is nearly always a single byte. However, even for non-English languages, text is usually accompanied by markup from the ASCII range of characters, and, overall, the encoded lengths will still have consistency. As I said, the CPU predicts branches based on the program’s previous behavior, so this means it will temporarily learn some of the statistical properties of the language being actively decoded. Pretty cool, eh?

Eliminating branches from the decoder side-steps any issues with mispredicting encoded lengths. Only errors in the stream will cause stalls. Since that’s probably the unusual case, the branch predictor will be very successful by continually predicting success. That’s one optimistic CPU.

The branchless decoder

Here’s the interface to my branchless decoder:

void *utf8_decode(void *buf, uint32_t *c, int *e);

I chose void * for the buffer so that it doesn’t care what type was actually chosen to represent the buffer. It could be a uint8_t, char, unsigned char, etc. Doesn’t matter. The encoder accesses it only as bytes.

On the other hand, with this interface you’re forced to use uint32_t to represent code points. You could always change the function to suit your own needs, though.

Errors are returned in e. It’s zero for success and non-zero when an error was detected, without any particular meaning for different values. Error conditions are mixed into this integer, so a zero simply means the absence of error.

This is where you could accuse me of “cheating” a little bit. The caller probably wants to check for errors, and so they will have to branch on e. It seems I’ve just smuggled the branches outside of the decoder.

However, as I pointed out, unless you’re expecting lots of errors, the real cost is branching on encoded lengths. Furthermore, the caller could instead accumulate the errors: count them, or make the error “sticky” by ORing all e values together. Neither of these require a branch. The caller could decode a huge stream and only check for errors at the very end. The only branch would be the main loop (“are we done yet?”), which is trivial to predict with high accuracy.

The first thing the function does is extract the encoded length of the next code point:

    static const char lengths[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
    };

    unsigned char *s = buf;
    int len = lengths[s[0] >> 3];

Looking back to the UTF-8 table above, only the highest 5 bits determine the length. That’s 32 possible values. The zeros are for invalid prefixes. This will later cause a bit to be set in e.

With the length in hand, it can compute the position of the next code point in the buffer.

    unsigned char *next = s + len + !len;

Originally this expression was the return value, computed at the very end of the function. However, after inspecting the compiler’s assembly output, I decided to move it up, and the result was a solid performance boost. That’s because it spreads out dependent instructions. With the address of the next code point known so early, the instructions that decode the next code point can get started early.

The reason for the !len is so that the pointer is advanced one byte even in the face of an error (length of zero). Adding that !len is actually somewhat costly, though I couldn’t figure out why.

    static const int shiftc[] = {0, 18, 12, 6, 0};

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

This reads four bytes regardless of the actual length. Avoiding doing something is branching, so this can’t be helped. The unneeded bits are shifted out based on the length. That’s all it takes to decode UTF-8 without branching.

One important consequence of always reading four bytes is that the caller must zero-pad the buffer to at least four bytes. In practice, this means padding the entire buffer with three bytes in case the last character is a single byte.

The padding must be zero in order to detect errors. Otherwise the padding might look like legal continuation bytes.

    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shifte[] = {0, 6, 4, 2, 0};

    *e  = (*c < mins[len]) << 6;
    *e |= ((*c >> 11) == 0x1b) << 7;  // surrogate half?
    *e |= (s[1] & 0xc0) >> 2;
    *e |= (s[2] & 0xc0) >> 4;
    *e |= (s[3]       ) >> 6;
    *e ^= 0x2a;
    *e >>= shifte[len];

The first line checks if the shortest encoding was used, setting a bit in e if it wasn’t. For a length of 0, this always fails.

The second line checks for a surrogate half by checking for a certain prefix.

The next three lines accumulate the highest two bits of each continuation byte into e. Each should be the bits 10. These bits are “compared” to 101010 (0x2a) using XOR. The XOR clears these bits as long as they exactly match.

Finally the continuation prefix bits that don’t matter are shifted out.

The goal

My primary — and totally arbitrary — goal was to beat the performance of Björn Höhrmann’s DFA-based decoder. Under favorable (and artificial) benchmark conditions I had moderate success. You can try it out on your own system by cloning the repository and running make bench.

With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

The benchmark operates very similarly to my PRNG shootout (e.g. alarm(2)). First a buffer is filled with random UTF-8 data, then the decoder decodes it again and again until the alarm fires. The measurement is the number of bytes decoded.

The number of errors is printed at the end (always 0) in order to force errors to actually get checked for each code point. Otherwise the sneaky compiler omits the error checking from the branchless decoder, making it appear much faster than it really is — a serious letdown once I noticed my error. Since the other decoder is a DFA and error checking is built into its graph, the compiler can’t really omit its error checking.

I called this “favorable” because the buffer being decoded isn’t anything natural. Each time a code point is generated, first a length is chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that length is generated. The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction. Real text has a far more favorable distribution.

uint32_t
randchar(uint64_t *s)
{
    uint32_t r = rand32(s);
    int len = 1 + (r & 0x3);
    r >>= 2;
    switch (len) {
        case 1:
            return r % 128;
        case 2:
            return 128 + r % (2048 - 128);
        case 3:
            return 2048 + r % (65536 - 2048);
        case 4:
            return 65536 + r % (131072 - 65536);
    }
    abort();
}

Given the odd input zero-padding requirement and the artificial parameters of the benchmark, despite the supposed 20% speed boost under GCC, my branchless decoder is not really any better than the DFA decoder in practice. It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.

Update: Bryan Donlan has followed up with a SIMD UTF-8 decoder.

Finding the Best 64-bit Simulation PRNG

I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.

For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:

  1. The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.

  2. PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.

In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.

Depending on the programming language, the PRNGs found in various standard libraries may be of dubious quality. They’re slower than they need to be, or have poorer quality than required. In some cases, such as rand() in C, the algorithm isn’t specified, and you can’t rely on it for anything outside of trivial examples. In other cases the algorithm and behavior is specified, but you could easily do better yourself.

My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”

For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:

Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.

Shootout results

Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.

                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060

And the actual dieharder outputs:

The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.

These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.

Speed benchmark

In the speed benchmark, the PRNG is initialized, a 1-second alarm(1) is set, then the PRNG fills a large volatile buffer of 64-bit unsigned integers again and again as quickly as possible until the alarm fires. The amount of memory written is measured as the PRNG’s speed.

The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.

The purpose for making the buffer volatile is to force the entire output to actually be “consumed” as far as the compiler is concerned. Otherwise the compiler plays nasty tricks to make the program do as little work as possible. Another way to deal with this would be to write(2) buffer, but of course I didn’t want to introduce unnecessary I/O into a benchmark.

On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.

The SIGALRM handler sets a volatile global variable that tells the generator to stop. The PRNG call was unrolled 8 times to avoid the alarm check from significantly impacting the benchmark. You can see the effect for yourself by changing UNROLL to 1 (i.e. “don’t unroll”) in the code. Unrolling beyond 8 times had no measurable effect to my tests.

Due to the PRNGs being inlined, this unrolling makes the benchmark less realistic, and it shows in the results. Using volatile for the buffer helped to counter this effect and reground the results. This is a fuzzy problem, and there’s not really any way to avoid it, but I will also discuss this below.

Statistical benchmark

To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:

prng | dieharder -g200 -a -m4

This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.

There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.

The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.

Individual PRNGs

Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.

Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.

Blowfish

One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.

Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.

The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.

In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.

In CBC mode, each output depends on the previous, but in CTR mode all blocks are independent. Work can begin on the next output before the previous output is complete. The x86 architecture uses out-of-order execution to achieve many of its performance gains: Instructions may be executed in a different order than they appear in the program, though their observable effects must generally be ordered correctly. Breaking dependencies between instructions allows out-of-order execution to be fully exercised. It also gives the compiler more freedom in instruction scheduling, though the volatile accesses cannot be reordered with respect to each other (hence it helping to reground the benchmark).

Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.

On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?

In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.

Mersenne Twister (MT19937-64)

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:

Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Permuted Congruential Generator (PCG)

The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.

PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).

Here’s essentially what the simplified version looks like:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.

The two constants are entirely my own devising. It’s two 64-bit primes generated using Emacs’ M-x calc: 2 64 ^ k r k n k p k p k p.

Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.

However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.

Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.

Here’s my doubled-up version:

uint64_t
spcg64(uint64_t s[2])
{
    uint64_t m  = 0x9b60933458e17d7d;
    uint64_t a0 = 0xd737232eeccdf7ed;
    uint64_t a1 = 0x8b260b70b8e98891;
    uint64_t p0 = s[0];
    uint64_t p1 = s[1];
    s[0] = p0 * m + a0;
    s[1] = p1 * m + a1;
    int r0 = 29 - (p0 >> 61);
    int r1 = 29 - (p1 >> 61);
    uint64_t high = p0 >> r0;
    uint32_t low  = p1 >> r1;
    return (high << 32) | low;
}

The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.

In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.

RC4

This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.

xoroshiro128+

xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:

uint64_t
xoroshiro128plus(uint64_t s[2])
{
    uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    uint64_t result = s0 + s1;
    s1 ^= s0;
    s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
    s[1] = (s1 << 36) | (s1 >> 28);
    return result;
}

There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.

xorshift128+ and xorshift64*

These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.

uint64_t
xorshift64star(uint64_t s[1])
{
    uint64_t x = s[0];
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    s[0] = x;
    return x * UINT64_C(0x2545f4914f6cdd1d);
}

However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.

If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.

uint64_t
xorshift128plus(uint64_t s[2])
{
    uint64_t x = s[0];
    uint64_t y = s[1];
    s[0] = y;
    x ^= x << 23;
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
    return s[1] + y;
}

It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.

Conclusion

My own takeaway (until I re-evaluate some years in the future):

Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:

                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080

LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.

So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.

Blowpipe: a Blowfish-encrypted, Authenticated Pipe

Blowpipe is a toy crypto tool that creates a Blowfish-encrypted pipe. It doesn’t open any files and instead encrypts and decrypts from standard input to standard output. This pipe can encrypt individual files or even encrypt a network connection (à la netcat).

Most importantly, since Blowpipe is intended to be used as a pipe (duh), it will never output decrypted plaintext that hasn’t been authenticated. That is, it will detect tampering of the encrypted stream and truncate its output, reporting an error, without producing the manipulated data. Some very similar tools that aren’t considered toys lack this important feature, such as aespipe.

Purpose

Blowpipe came about because I wanted to study Blowfish, a 64-bit block cipher designed by Bruce Schneier in 1993. It’s played an important role in the history of cryptography and has withstood cryptanalysis for 24 years. Its major weakness is its small block size, leaving it vulnerable to birthday attacks regardless of any other property of the cipher. Even in 1993 the 64-bit block size was a bit on the small side, but Blowfish was intended as a drop-in replacement for the Data Encryption Standard (DES) and the International Data Encryption Algorithm (IDEA), other 64-bit block ciphers.

The main reason I’m calling this program a toy is that, outside of legacy interfaces, it’s simply not appropriate to deploy a 64-bit block cipher in 2017. Blowpipe shouldn’t be used to encrypt more than a few tens of GBs of data at a time. Otherwise I’m fairly confident in both my message construction and my implementation. One detail is a little uncertain, and I’ll discuss it later when describing message format.

A tool that I am confident about is Enchive, though since it’s intended for file encryption, it’s not appropriate for use as a pipe. It doesn’t authenticate until after it has produced most of its output. Enchive does try its best to delete files containing unauthenticated output when authentication fails, but this doesn’t prevent you from consuming this output before it can be deleted, particularly if you pipe the output into another program.

Usage

As you might expect, there are two modes of operation: encryption (-E) and decryption (-D). The simplest usage is encrypting and decrypting a file:

$ blowpipe -E < data.gz > data.gz.enc
$ blowpipe -D < data.gz.enc | gunzip > data.txt

In both cases you will be prompted for a passphrase which can be up to 72 bytes in length. The only verification for the key is the first Message Authentication Code (MAC) in the datastream, so Blowpipe cannot tell the difference between damaged ciphertext and an incorrect key.

In a script it would be smart to check Blowpipe’s exit code after decrypting. The output will be truncated should authentication fail somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t clean up for you.

Another use case is securely transmitting files over a network with netcat. In this example I’ll use a pre-shared key file, keyfile. Rather than prompt for a key, Blowpipe will use the raw bytes of a given file. Here’s how I would create a key file:

$ head -c 32 /dev/urandom > keyfile

First the receiver listens on a socket (bind(2)):

$ nc -lp 2000 | blowpipe -D -k keyfile > data.zip

Then the sender connects (connect(2)) and pipes Blowpipe through:

$ blowpipe -E -k keyfile < data.zip | nc -N hostname 2000

If all went well, Blowpipe will exit with 0 on the receiver side.

Blowpipe doesn’t buffer its output (but see -w). It performs one read(2), encrypts whatever it got, prepends a MAC, and calls write(2) on the result. This means it can comfortably transmit live sensitive data across the network:

$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000

Kernel messages will appear on the other end as they’re produced by dmesg. Though keep in mind that the size of each line will be known to eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to disguise the length. Those lengths may leak useful information.

Blowfish

This whole project started when I wanted to play with Blowfish as a small drop-in library. I wasn’t satisfied with the selection, so I figured it would be a good exercise to write my own. Besides, the specification is both an enjoyable and easy read (and recommended). It justifies the need for a new cipher and explains the various design decisions.

I coded from the specification, including writing a script to generate the subkey initialization tables. Subkeys are initialized to the binary representation of pi (the first ~10,000 decimal digits). After a couple hours of work I hooked up the official test vectors to see how I did, and all the tests passed on the first run. This wasn’t reasonable, so I spent awhile longer figuring out how I screwed up my tests. Turns out I absolutely nailed it on my first shot. It’s a really great sign for Blowfish that it’s so easy to implement correctly.

Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of storage. The key schedule is unusually complex: Subkeys are repeatedly encrypted with themselves as they are being computed. This complexity inspired the bcrypt password hashing scheme, which essentially works by iterating the key schedule many times in a loop, then encrypting a constant 24-byte string. My bcrypt implementation wasn’t nearly as successful on my first attempt, and it took hours of debugging in order to match OpenBSD’s outputs.

The encryption and decryption algorithms are nearly identical, as is typical for, and a feature of, Feistel ciphers. There are no branches (preventing some side-channel attacks), and the only operations are 32-bit XOR and 32-bit addition. This makes it ideal for implementation on 32-bit computers.

One tricky point is that encryption and decryption operate on a pair of 32-bit integers (another giveaway that it’s a Feistel cipher). To put the cipher to practical use, these integers have to be serialized into a byte stream. The specification doesn’t choose a byte order, even for mixing the key into the subkeys. The official test vectors are also 32-bit integers, not byte arrays. An implementer could choose little endian, big endian, or even something else.

However, there’s one place in which this decision is formally made: the official test vectors mix the key into the first subkey in big endian byte order. By luck I happened to choose big endian as well, which is why my tests passed on the first try. OpenBSD’s version of bcrypt also uses big endian for all integer encoding steps, further cementing big endian as the standard way to encode Blowfish integers.

Blowfish library

The Blowpipe repository contains a ready-to-use, public domain Blowfish library written in strictly conforming C99. The interface is just three functions:

void blowfish_init(struct blowfish *, const void *key, int len);
void blowfish_encrypt(struct blowfish *, uint32_t *, uint32_t *);
void blowfish_decrypt(struct blowfish *, uint32_t *, uint32_t *);

Technically the key can be up to 72 bytes long, but the last 16 bytes have an incomplete effect on the subkeys, so only the first 56 bytes should matter. Since bcrypt runs the key schedule multiple times, all 72 bytes have full effect.

The library also includes a bcrypt implementation, though it will only produce the raw password hash, not the base-64 encoded form. The main reason for including bcrypt is to support Blowpipe.

Message format

The main goal of Blowpipe was to build a robust, authenticated encryption tool using only Blowfish as a cryptographic primitive.

  1. It uses bcrypt with a moderately-high cost as a key derivation function (KDF). Not terrible, but this is not a memory hard KDF, which is important for protecting against cheap hardware brute force attacks.

  2. Encryption is Blowfish in “counter” CTR mode. A 64-bit counter is incremented and encrypted, producing a keystream. The plaintext is XORed with this keystream like a stream cipher. This allows the last block to be truncated when output and eliminates some padding issues. Since CRT mode is trivially malleable, the MAC becomes even more important. In CTR mode, blowfish_decrypt() is never called. In fact, Blowpipe never uses it.

  3. The authentication scheme is Blowfish-CBC-MAC with a unique key and encrypt-then-authenticate (something I harmlessly got wrong with Enchive). It essentially encrypts the ciphertext again with a different key, but in Cipher Block Chaining mode (CBC), but it only saves the final block. The final block is prepended to the ciphertext as the MAC. On decryption the same block is computed again to ensure that it matches. Only someone who knows the MAC key can compute it.

Of all three Blowfish uses, I’m least confident about authentication. CBC-MAC is tricky to get right, though I am following the rules: fixed length messages using a different key than encryption.

Wait a minute. Blowpipe is pipe-oriented and can output data without buffering the entire pipe. How can there be fixed-length messages?

The pipe datastream is broken into 64kB chunks. Each chunk is authenticated with its own MAC. Both the MAC and chunk length are written in the chunk header, and the length is authenticated by the MAC. Furthermore, just like the keystream, the MAC is continued from previous chunk, preventing chunks from being reordered. Blowpipe can output the content of a chunk and discard it once it’s been authenticated. If any chunk fails to authenticate, it aborts.

This also leads to another useful trick: The pipe is terminated with a zero length chunk, preventing an attacker from appending to the datastream. Everything after the zero-length chunk is discarded. Since the length is authenticated by the MAC, the attacker also cannot truncate the pipe since that would require knowledge of the MAC key.

The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1 byte for the bcrypt cost. The salt is like an initialization vector (IV) that allows keys to be safely reused in different Blowpipe instances. The cost byte is the only distinguishing byte in the stream. Since even the chunk lengths are encrypted, everything else in the datastream should be indistinguishable from random data.

Portability

Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I initially wrote it for POSIX (on Linux) of course, but I took an unusual approach when it came time to port it to Windows. Normally I’d invent a generic OS interface that makes the appropriate host system calls. This time I kept the POSIX interface (read(2), write(2), open(2), etc.) and implemented the tiny subset of POSIX that I needed in terms of Win32. That implementation can be found under w32-compat/. I even dropped in a copy of my own getopt().

One really cool feature of this technique is that, on Windows, Blowpipe will still “open” /dev/urandom. It’s intercepted by my own open(2), which in response to that filename actually calls CryptAcquireContext() and pretends like it’s a file. It’s all hidden behind the file descriptor. That’s the unix way.

I’m considering giving Enchive the same treatment since it would simply and reduce much of the interface code. In fact, this project has taught me a number of ways that Enchive could be improved. That’s the value of writing “toys” such as Blowpipe.

Gap Buffers Are Not Optimized for Multiple Cursors

Gap buffers are a common data structure for representing a text buffer in a text editor. Emacs famously uses gap buffers — long-standing proof that gap buffers are a perfectly sufficient way to represent a text buffer.

A gap buffer is really a pair of buffers where one buffer holds all of the content before the cursor (or point for Emacs), and the other buffer holds the content after the cursor. When the cursor is moved through the buffer, characters are copied from one buffer to the other. Inserts and deletes close to the gap are very efficient.

Typically it’s implemented as a single large buffer, with the pre-cursor content at the beginning, the post-cursor content at the end, and the gap spanning the middle. Here’s an illustration:

The top of the animation is the display of the text content and cursor as the user would see it. The bottom is the gap buffer state, where each character is represented as a gray block, and a literal gap for the cursor.

Ignoring for a moment more complicated concerns such as undo and Unicode, a gap buffer could be represented by something as simple as the following:

struct gapbuf {
    char *buf;
    size_t total;  /* total size of buf */
    size_t front;  /* size of content before cursor */
    size_t gap;    /* size of the gap */
};

This is close to how Emacs represents it. In the structure above, the size of the content after the cursor isn’t tracked directly, but can be computed on the fly from the other three quantities. That is to say, this data structure is normalized.

As an optimization, the cursor could be tracked separately from the gap such that non-destructive cursor movement is essentially free. The difference between cursor and gap would only need to be reconciled for a destructive change — an insert or delete.

A gap buffer certainly isn’t the only way to do it. For example, the original vi used an array of lines, which sort of explains some of its quirky line-oriented idioms. The BSD clone of vi, nvi, uses an entire database to represent buffers. Vim uses a fairly complex rope-like data structure with page-oriented blocks, which may be stored out-of-order in its swap file.

Multiple cursors

Multiple cursors is fairly recent text editor invention that has gained a lot of popularity recent years. It seems every major editor either has the feature built in or a readily-available extension. I myself used Magnar Sveen’s well-polished package for several years. Though obviously the concept didn’t originate in Emacs or else it would have been called multiple points, which doesn’t quite roll off the tongue quite the same way.

The concept is simple: If the same operation needs to done in many different places in a buffer, you place a cursor at each position, then drive them all in parallel using the same commands. It’s super flashy and great for impressing all your friends.

However, as a result of improving my typing skills, I’ve come to the conclusion that multiple cursors is all hat and no cattle. It doesn’t compose well with other editing commands, it doesn’t scale up to large operations, and it’s got all sorts of flaky edge cases (off-screen cursors). Nearly anything you can do with multiple cursors, you can do better with old, well-established editing paradigms.

Somewhere around 99% of my multiple cursors usage was adding a common prefix to a contiguous serious of lines. As similar brute force options, Emacs already has rectangular editing, and Vim already has visual block mode.

The most sophisticated, flexible, and robust alternative is a good old macro. You can play it back anywhere it’s needed. You can zip it across a huge buffer. The only downside is that it’s less flashy and so you’ll get invited to a slightly smaller number of parties.

But if you don’t buy my arguments about multiple cursors being tasteless, there’s still a good technical argument: Gap buffers are not designed to work well in the face of multiple cursors!

For example, suppose we have a series of function calls and we’d like to add the same set of arguments to each. It’s a classic situation for a macro or for multiple cursors. Here’s the original code:

foo();
bar();
baz();

The example is tiny so that it will fit in the animations to come. Here’s the desired code:

foo(x, y);
bar(x, y);
baz(x, y);

With multiple cursors you would place a cursor inside each set of parenthesis, then type x, y. Visually it looks something like this:

Text is magically inserted in parallel in multiple places at a time. However, if this is a text editor that uses a gap buffer, the situation underneath isn’t quite so magical. The entire edit doesn’t happen at once. First the x is inserted in each location, then the comma, and so on. The edits are not clustered so nicely.

From the gap buffer’s point of view, here’s what it looks like:

For every individual character insertion the buffer has to visit each cursor in turn, performing lots of copying back and forth. The more cursors there are, the worse it gets. For an edit of length n with m cursors, that’s O(n * m) calls to memmove(3). Multiple cursors scales badly.

Compare that to the old school hacker who can’t be bothered with something as tacky and modern (eww!) as multiple cursors, instead choosing to record a macro, then play it back:

The entire edit is done locally before moving on to the next location. It’s perfectly in tune with the gap buffer’s expectations, only needing O(m) calls to memmove(3). Most of the work flows neatly into the gap.

So, don’t waste your time with multiple cursors, especially if you’re using a gap buffer text editor. Instead get more comfortable with your editor’s macro feature. If your editor doesn’t have a good macro feature, get a new editor.

If you want to make your own gap buffer animations, here’s the source code. It includes a tiny gap buffer implementation:

Ten Years of Blogging

As of today, I’ve been blogging for 10 years. In this time I’ve written 302,000 words across 343 articles — a rate of one article every week and a half. These articles form a record of my professional progress, touching on both “hard” technical skills and “soft” communication skills. My older articles are a personal reminder of how far I’ve come. They are proof that I’m not as stagnant as I sometimes feel, and it helps me to sympathize with others who are currently in those earlier stages of their own career.

That index where you can find these 343 articles is sorted newest-first, because it correlates with best-first. It’s a trend I hope to continue.

History

Before blogging, I had a simple Penn State student page showcasing a few small — and, in retrospect, trivial — side projects (1, 2, 3), none of which went anywhere. Around the beginning of my final semester of college, I was inspired by Mark Dominus’ The Universe of Discourse and Andy Owen’s Friendly Robot Overlord (gone since 2011) to start my own blosxom blog. It would be an outlet to actively discuss my projects. Some time later GitHub was founded, and I switched to a static blog hosted by GitHub Pages, which is where it lives to this day.

It’s been more challenging to manage all this content than I ever anticipated. It’s like maintaining a large piece of software, except it’s naturally more fragile. Any time I make a non-trivial change to the CSS, I have to inspect the archives to check if I broke older articles. If I did, sometimes it’s a matter of further adjusting the CSS. Other times I’ll mass edit a couple hundred articles in order to normalize some particular aspect, such as heading consistency or image usage. (Running a macro over an Emacs’ Dired buffer is great for this.)

I decided in those early days to Capitalize Every Word of the Title, and I’ve stuck with this convention purely out of consistency even though it’s looked weird to me for years. I don’t want to edit the old titles, and any hard changeover date would be even weirder (in the index listing).

With more foresight and experience, I could have laid down better conventions for myself from the beginning. Besides the universal impossibility of already having experience before it’s needed, there’s also the issue that the internet’s conventions have changed, too. This blog is currently built with HTML5, but this wasn’t always the case — especially considering that predates HTML5. When I switched to HTML5, I also adjusted some of my own conventions to match, since, at the time, I was still writing articles in raw HTML.

The mobile revolution also arrived since starting this blog. Today, about one third of visitors read the blog from a mobile device. I’ve also adjusted the CSS to work well on these devices. To that third of you: I hope you’re enjoying the experience!

Just in case you haven’t tried it, the blog also works really well with terminal-based browsers, such as Lynx and ELinks. Go ahead and give it a shot. The header that normally appears at the top of the page is actually at the bottom of the HTML document structure. It’s out of the way for browsers that ignore CSS.

If that’s not enough, last year I also spent effort making the printed style of my articles look nice. Take a look at the printed version of this article (i.e. print preview, print to PDF), and make sure to turn off the little headers added by the browser. A media selector provides a separate print stylesheet. Chrome / Chromium has consistently had the best results, followed by Firefox. Someday I’d like for browsers to be useful as typesetting engines for static documents — as an alternative to LaTeX and Groff — but they’ve still got a ways to go with paged media. (Trust me, I’ve tried.)

With more foresight, I could have done something better with my permanent URLs. Notice how they’re just dates and don’t include the title. URLs work better when they include human-meaningful context. Ideally I should be able to look at any one of my URLs and know what it’s about. Again, this decision goes all the way back to those early days when I first configured blosxom, not knowing any better.

URLs are forever, and I don’t want to break all my old links. Consistency is better than any sort of correction. I’m also practically limited to one article per day, though this has never been a real problem.

Motivation

For me, an important motivation for writing is to say something unique about a topic. For example, I’m not interested in writing a tutorial unless either no such tutorial already exists, or there’s some vital aspect all the existing tutorials miss. Each article should add new information to the internet, either raw information or through assembling existing information in a new way or with a unique perspective.

I also almost constantly feel like I’m behind the curve, like I don’t know enough or I don’t have enough skill. As many of you know, the internet is really effective at causing these feelings. Every day lots of talented people are sharing interesting and complicated things across all sorts of topics. For topics that overlap my own desired expertise, those projects have me thinking, “Man, I have no idea how to even start doing that. There’s so much I don’t know, and so much more I need to learn.” Writing articles as I learn is a great way to keep on top of new subjects.

This is tied to another problem: I have a tendency to assume the things I’ve known for awhile are common knowledge. This shows up in a few ways. First, if everyone knows what I know, plus they know a bunch of things that I don’t know, then I’ve got a lot of catching up to do. That’s another source of feeling behind the curve.

Second, when writing an article on a topic where I’ve got years of experience, I leave out way too many important details, assuming the reader already knows them. When an article I regard as valuable gets a weak response, it’s probably related to this issue.

Third, after three years of teaching, it seems like it’s becoming more difficult to put myself in the student’s shoes. I’m increasingly further away from my own days learning those early topics, and it’s harder to remember the limitations of my knowledge at that time. Having this blog really helps, and I’ve re-read some of my older articles to recall my mindset at the time.

Another way the blog helps is that it’s like having my own textbook. When teaching a topic to someone — and not necessarily a formal mentee — or even when just having a discussion, I will reference my articles when appropriate. Since they’re designed to say something unique, my article may be the only place to find certain information in a conveniently packaged form.

Finally, the last important motivating factor is that I want to spread my preferred memes. Obviously the way I do things is the Right Way, and the people who do things differently (the Wrong Way) are stupid and wrong. By writing about the sorts of topics and technologies I enjoy — C, low-level design, Emacs, Vim, programming on unix-like systems, my personal programming style — I’m encouraging others to follow my lead. Surely I’m responsible for at least one Emacs convert out there!

“Formal” professional writing

Despite having three or four novels worth of (mostly) technical writing here, my formal workplace writing leaves much to be desired. I’m no standout in this area. In the same period of time I’ve written just a handful of formal memos, each about the same length as a long blog post.

Why so few? These memos are painful to write. In order to be officially recognized, the formal memo process must be imposed upon me. What this means is that, compared to a blog post, these memos take at least an order of magnitude longer to write.

The process involves more people, dragged out over a long period of time. The writing is a lot less personal, which, in my opinion, makes it drier and less enjoyable. After the initial draft is complete, I have to switch to vastly inferior tools: emailing the same Microsoft Office documents back and forth between various people, without any proper source control. The official, mandated memo template was created by someone who didn’t know how to effectively operate a word processor, who had a poor sense of taste (ALL CAPS HEADINGS), and who obviously had no training in typesetting or style.

At the end of this long process, it’s filed into a system with no practical search capability, and where it will be quietly purged after five years, never to be seen again. Outside of the reviewers who were directly involved in the memo process, somewhere between zero and two people will have actually read it. Literally.

Arguably the memo might more polished than a blog post. I’m skeptical of this, but suppose that’s true. I’d still much rather have written ten less-polished blog posts than one more-polished memo. That’s also ten shots to produce, by chance, a more valuable article than the single, precious memo.

Let’s broaden the scope to academic papers. Thanks to some great co-workers — all three of whom are smarter and handsomer than me — a year ago I finally got a published academic paper under my belt (and more to come): ROP Gadget Prevalence and Survival under Compiler-based Binary Diversification Schemes (and I said memos were dry!). A ton of work went into this paper, and it’s far more substantial than any memo or single blog post. The process was a lot more pleasant (LaTeX instead of Word), and the results are definitely much more polished than a typical blog post. It reads well and has interesting information to present.

This all sounds great until you consider the impact. According to ACM’s statistics, the paper has been accessed 130 times as of this writing. (Yes, providing an unofficial link to the paper like I just did above doesn’t help, but I ran out of those crappy “free” links. Sue me.) Sure, the PDF might have been passed around in untrackable ways, but I also bet a lot of those accesses were just downloads that were never read. So let’s split the difference and estimate around 130 people read it.

What kind of impact does a blog post have? Talking about these numbers feels a bit taboo, like discussing salaries, but it’s important for the point I’m about to make.

Note that all but the last has been published for less than time than our paper. The average time on these pages is between 5 and 6 minutes, so these are actual readers, not just visitors that take one glance and leave. Thanks to the information age, a technical blog article on established blog can reach an audience 100 times larger than a journal for a fraction of the effort and cost. There are other benefits, too:

  1. I get immediate feedback in the form of comments and email (open peer review).

  2. The content is available for free (open access). It’s trivial to link and share blog articles.

  3. Even more, this entire blog is in the public domain. If you don’t believe me, check out the public domain dedication in the footer of this page. It’s been there for years, and you can verify that yourself. Every single change to this blog in the past 6 years has been publicly documented (transparency).

  4. When I write about a topic, I make it a goal to provide the code and data to try it for yourself (open data and materials). This code and data is also either in the public domain, or as close to it as possible.

  5. Link aggregators and social media are great at finding the best stuff and making it more visible (censorship resistance). When I have a big hit, it’s often Reddit or Hacker News driving people to my article. Sometimes it’s other blogs.

In 2017, a blog is by far the most effective way to publish and share written information, and have more scientific quality than journals. More people should be publishing through blog articles than traditional forms of publications, which are far less effective.

Since this has proven to be such an effective medium, I’m going to make a promise right here, right now. And as I explained above with transparency, there are no take-backs on this one. If you’re reading this, it’s on the public record forever. I promise to deliver another 10 years of free, interesting, useful content. This stuff is going to be even better, too. On this blog’s 20th anniversary, September 1st, 2027, we’ll see how I did.

Vim vs. Emacs: The Working Directory

Vim and Emacs have different internals models for the current working directory, and these models influence the overall workflow for each editor. They decide how files are opened, how shell commands are executed, and how the build system is operated. These effects even reach outside the editor to influence the overall structure of the project being edited.

In the traditional unix model, which was eventually adopted everywhere else, each process has a particular working directory tracked by the operating system. When a process makes a request to the operating system using a relative path — a path that doesn’t begin with a slash — the operating system uses the process’ working directory to convert the path into an absolute path. When a process forks, its child starts in the same directory. A process can change its working directory at any time using chdir(2), though most programs never need to do it. The most obvious way this system call is exposed to regular users is through the shell’s built-in cd command.

Vim’s spiritual heritage is obviously rooted in vi, one of the classic unix text editors, and the most elaborate text editor standardized by POSIX. Like vi, Vim closely follows the unix model for working directories. At any given time Vim has exactly one working directory. Shell commands that are run within Vim will start in Vim’s working directory. Like a shell, the cd ex command changes and queries Vim’s working directory.

Emacs eschews this model and instead each buffer has its own working directory tracked using a buffer-local variable, default-directory. Emacs internally simulates working directories for its buffers like an operating system, resolving absolute paths itself, giving credence to the idea that Emacs is an operating system (“lacking only a decent editor”). Perhaps this model comes from ye olde lisp machines?

In contrast, Emacs’ M-x cd command manipulates the local variable and has no effect on the Emacs process’ working directory. In fact, Emacs completely hides its operating system working directory from Emacs Lisp. This can cause some trouble if that hidden working directory happens to be sitting on filesystem you’d like to unmount.

Vim can be configured to simulate Emacs’ model with its autochdir option. When set, Vim will literally chdir(2) each time the user changes buffers, switches windows, etc. To the user, this feels just like Emacs’ model, but this is just a convenience, and the core working directory model is still the same.

Single instance editors

For most of my Emacs career, I’ve stuck to running a single, long-lived Emacs instance no matter how many different tasks I’m touching simultaneously. I start the Emacs daemon shortly after logging in, and it continues running until I log out — typically only when the machine is shut down. It’s common to have multiple Emacs windows (frames) for different tasks, but they’re all bound to the same daemon process.

While with care it’s possible to have a complex, rich Emacs configuration that doesn’t significantly impact Emacs’ startup time, the general consensus is that Emacs is slow to start. But since it has a really solid daemon, this doesn’t matter: hardcore Emacs users only ever start Emacs occasionally. The rest of the time they’re launching emacsclient and connecting to the daemon. Outside of system administration, it’s the most natural way to use Emacs.

The case isn’t so clear for Vim. Vim is so fast that many users fire it up on demand and exit when they’ve finished the immediate task. At the other end of the spectrum, others advocate using a single instance of Vim like running a single Emacs daemon. In my initial dive into Vim, I tried the single-instance, Emacs way of doing things. I set autochdir out of necessity and pretended each buffer had its own working directory.

At least for me, this isn’t the right way to use Vim, and it all comes down to working directories. I want Vim to be anchored at the project root with one Vim instance per project. Everything is smoother when it happens in the context of the project’s root directory, from opening files, to running shell commands (ctags in particular), to invoking the build system. With autochdir, these actions are difficult to do correctly, particularly the last two.

Invoking the build

I suspect the Emacs’ model of per-buffer working directories has, in a Sapir-Whorf sort of way, been responsible for leading developers towards poorly-designed, recursive Makefiles. Without a global concept of working directory, it’s inconvenient to invoke the build system (M-x compile) in some particular grandparent directory that is the root of the project. If each directory has its own Makefile, it usually makes sense to invoke make in the same directory as the file being edited.

Over the years I’ve been reinventing the same solution to this problem, and it wasn’t until I spent time with Vim and its alternate working directory model that I truly understood the problem. Emacs itself has long had a solution lurking deep in its bowels, unseen by daylight: dominating files. The function I’m talking about is locate-dominating-file:

(locate-dominating-file FILE NAME)

Look up the directory hierarchy from FILE for a directory containing NAME. Stop at the first parent directory containing a file NAME, and return the directory. Return nil if not found. Instead of a string, NAME can also be a predicate taking one argument (a directory) and returning a non-nil value if that directory is the one for which we’re looking.

The trouble of invoking the build system at the project root is that Emacs doesn’t really have a concept of a project root. It doesn’t know where it is or how to find it. The vi model inherited by Vim is to leave the working directory at the project root. While Vim can simulate Emacs’ working directory model, Emacs cannot (currently) simulate Vim’s model.

Instead, by identifying a file name unique to the project’s root (i.e. a “dominating” file) such as Makefile or build.xml, then locate-dominating-file can discover the project root. All that’s left is wrapping M-x compile so that default-directory is temporarily adjusted to the project’s root.

That looks very roughly like this (and needs more work):

(defun my-compile ()
  (interactive)
  (let ((default-directory (locate-dominating-file "." "Makefile")))
    (compile "make")))

It’s a pattern I’ve used again and again and again, working against the same old friction. By running one Vim instance per project at the project’s root, I get the correct behavior for free.

null program

Chris Wellons