A Branchless UTF-8 Decoder

This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. You can find the source code here along with a test suite and benchmark:

In addition to decoding the next code point, it detects any errors and returns a pointer to the next code point. It’s the complete package.

Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.

The usual analogy is laundry. You can have more than one load of laundry in process at a time because laundry is typically a pipelined process. There’s a washing machine stage, dryer stage, and folding stage. One load can be in the washer, a second in the drier, and a third being folded, all at once. This greatly increases throughput because, under ideal circumstances with a full pipeline, an instruction is completed each clock cycle despite any individual instruction taking many clock cycles to complete.

Branches are the enemy of pipelines. The CPU can’t begin work on the next instruction if it doesn’t know which instruction will be executed next. It must finish computing the branch condition before it can know. To deal with this, pipelined CPUs are also equipped with branch predictors. It makes a guess at which branch will be taken and begins executing instructions on that branch. The prediction is initially made using static heuristics, and later those predictions are improved by learning from previous behavior. This even includes predicting the number of iterations of a loop so that the final iteration isn’t mispredicted.

A mispredicted branch has two dire consequences. First, all the progress on the incorrect branch will need to be discarded. Second, the pipeline will be flushed, and the CPU will be inefficient until the pipeline fills back up with instructions on the correct branch. With a sufficiently deep pipeline, it can easily be more efficient to compute and discard an unneeded result than to avoid computing it in the first place. Eliminating branches means eliminating the hazards of misprediction.

Another hazard for pipelines is dependencies. If an instruction depends on the result of a previous instruction, it may have to wait for the previous instruction to make sufficient progress before it can complete one of its stages. This is known as a pipeline stall, and it is an important consideration in instruction set architecture (ISA) design.

For example, on the x86-64 architecture, storing a 32-bit result in a 64-bit register will automatically clear the upper 32 bits of that register. Any further use of that destination register cannot depend on prior instructions since all bits have been set. This particular optimization was missed in the design of the i386: Writing a 16-bit result to 32-bit register leaves the upper 16 bits intact, creating false dependencies.

Dependency hazards are mitigated using out-of-order execution. Rather than execute two dependent instructions back to back, which would result in a stall, the CPU may instead executing an independent instruction further away in between. A good compiler will also try to spread out dependent instructions in its own instruction scheduling.

The effects of out-of-order execution are typically not visible to a single thread, where everything will appear to have executed in order. However, when multiple processes or threads can access the same memory out-of-order execution can be observed. It’s one of the many challenges of writing multi-threaded software.

The focus of my UTF-8 decoder was to be branchless, but there was one interesting dependency hazard that neither GCC nor Clang were able to resolve themselves. More on that later.

What is UTF-8?

Without getting into the history of it, you can generally think of UTF-8 as a method for encoding a series of 21-bit integers (code points) into a stream of bytes.

Keeping in mind these two rules, the entire format is summarized by this table:

length byte[0]  byte[1]  byte[2]  byte[3]
1      0xxxxxxx
2      110xxxxx 10xxxxxx
3      1110xxxx 10xxxxxx 10xxxxxx
4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The x placeholders are the bits of the encoded code point.

UTF-8 has some really useful properties:

A straightforward approach to decoding might look something like this:

unsigned char *
utf8_simple(unsigned char *s, long *c)
    unsigned char *next;
    if (s[0] < 0x80) {
        *c = s[0];
        next = s + 1;
    } else if ((s[0] & 0xe0) == 0xc0) {
        *c = ((long)(s[0] & 0x1f) <<  6) |
             ((long)(s[1] & 0x3f) <<  0);
        next = s + 2;
    } else if ((s[0] & 0xf0) == 0xe0) {
        *c = ((long)(s[0] & 0x0f) << 12) |
             ((long)(s[1] & 0x3f) <<  6) |
             ((long)(s[2] & 0x3f) <<  0);
        next = s + 3;
    } else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4)) {
        *c = ((long)(s[0] & 0x07) << 18) |
             ((long)(s[1] & 0x3f) << 12) |
             ((long)(s[2] & 0x3f) <<  6) |
             ((long)(s[3] & 0x3f) <<  0);
        next = s + 4;
    } else {
        *c = -1; // invalid
        next = s + 1; // skip this byte
    if (*c >= 0xd800 && *c <= 0xdfff)
        *c = -1; // surrogate half
    return next;

It branches off on the highest bits of the leading byte, extracts all of those x bits from each byte, concatenates those bits, checks if it’s a surrogate half, and returns a pointer to the next character. (This implementation does not check that the highest two bits of each continuation byte are correct.)

The CPU must correctly predict the length of the code point or else it will suffer a hazard. An incorrect guess will stall the pipeline and slow down decoding.

In real world text this is probably not a serious issue. For the English language, the encoded length is nearly always a single byte. However, even for non-English languages, text is usually accompanied by markup from the ASCII range of characters, and, overall, the encoded lengths will still have consistency. As I said, the CPU predicts branches based on the program’s previous behavior, so this means it will temporarily learn some of the statistical properties of the language being actively decoded. Pretty cool, eh?

Eliminating branches from the decoder side-steps any issues with mispredicting encoded lengths. Only errors in the stream will cause stalls. Since that’s probably the unusual case, the branch predictor will be very successful by continually predicting success. That’s one optimistic CPU.

The branchless decoder

Here’s the interface to my branchless decoder:

void *utf8_decode(void *buf, uint32_t *c, int *e);

I chose void * for the buffer so that it doesn’t care what type was actually chosen to represent the buffer. It could be a uint8_t, char, unsigned char, etc. Doesn’t matter. The encoder accesses it only as bytes.

On the other hand, with this interface you’re forced to use uint32_t to represent code points. You could always change the function to suit your own needs, though.

Errors are returned in e. It’s zero for success and non-zero when an error was detected, without any particular meaning for different values. Error conditions are mixed into this integer, so a zero simply means the absence of error.

This is where you could accuse me of “cheating” a little bit. The caller probably wants to check for errors, and so they will have to branch on e. It seems I’ve just smuggled the branches outside of the decoder.

However, as I pointed out, unless you’re expecting lots of errors, the real cost is branching on encoded lengths. Furthermore, the caller could instead accumulate the errors: count them, or make the error “sticky” by ORing all e values together. Neither of these require a branch. The caller could decode a huge stream and only check for errors at the very end. The only branch would be the main loop (“are we done yet?”), which is trivial to predict with high accuracy.

The first thing the function does is extract the encoded length of the next code point:

    static const char lengths[] = {
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0

    unsigned char *s = buf;
    int len = lengths[s[0] >> 3];

Looking back to the UTF-8 table above, only the highest 5 bits determine the length. That’s 32 possible values. The zeros are for invalid prefixes. This will later cause a bit to be set in e.

With the length in hand, it can compute the position of the next code point in the buffer.

    unsigned char *next = s + len + !len;

Originally this expression was the return value, computed at the very end of the function. However, after inspecting the compiler’s assembly output, I decided to move it up, and the result was a solid performance boost. That’s because it spreads out dependent instructions. With the address of the next code point known so early, the instructions that decode the next code point can get started early.

The reason for the !len is so that the pointer is advanced one byte even in the face of an error (length of zero). Adding that !len is actually somewhat costly, though I couldn’t figure out why.

    static const int shiftc[] = {0, 18, 12, 6, 0};

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

This reads four bytes regardless of the actual length. Avoiding doing something is branching, so this can’t be helped. The unneeded bits are shifted out based on the length. That’s all it takes to decode UTF-8 without branching.

One important consequence of always reading four bytes is that the caller must zero-pad the buffer to at least four bytes. In practice, this means padding the entire buffer with three bytes in case the last character is a single byte.

The padding must be zero in order to detect errors. Otherwise the padding might look like legal continuation bytes.

    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shifte[] = {0, 6, 4, 2, 0};

    *e  = (*c < mins[len]) << 6;
    *e |= ((*c >> 11) == 0x1b) << 7;  // surrogate half?
    *e |= (s[1] & 0xc0) >> 2;
    *e |= (s[2] & 0xc0) >> 4;
    *e |= (s[3]       ) >> 6;
    *e ^= 0x2a;
    *e >>= shifte[len];

The first line checks if the shortest encoding was used, setting a bit in e if it wasn’t. For a length of 0, this always fails.

The second line checks for a surrogate half by checking for a certain prefix.

The next three lines accumulate the highest two bits of each continuation byte into e. Each should be the bits 10. These bits are “compared” to 101010 (0x2a) using XOR. The XOR clears these bits as long as they exactly match.

Finally the continuation prefix bits that don’t matter are shifted out.

The goal

My primary — and totally arbitrary — goal was to beat the performance of Björn Höhrmann’s DFA-based decoder. Under favorable (and artificial) benchmark conditions I had moderate success. You can try it out on your own system by cloning the repository and running make bench.

With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.

The benchmark operates very similarly to my PRNG shootout (e.g. alarm(2)). First a buffer is filled with random UTF-8 data, then the decoder decodes it again and again until the alarm fires. The measurement is the number of bytes decoded.

The number of errors is printed at the end (always 0) in order to force errors to actually get checked for each code point. Otherwise the sneaky compiler omits the error checking from the branchless decoder, making it appear much faster than it really is — a serious letdown once I noticed my error. Since the other decoder is a DFA and error checking is built into its graph, the compiler can’t really omit its error checking.

I called this “favorable” because the buffer being decoded isn’t anything natural. Each time a code point is generated, first a length is chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that length is generated. The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction. Real text has a far more favorable distribution.

randchar(uint64_t *s)
    uint32_t r = rand32(s);
    int len = 1 + (r & 0x3);
    r >>= 2;
    switch (len) {
        case 1:
            return r % 128;
        case 2:
            return 128 + r % (2048 - 128);
        case 3:
            return 2048 + r % (65536 - 2048);
        case 4:
            return 65536 + r % (131072 - 65536);

Given the odd input zero-padding requirement and the artificial parameters of the benchmark, despite the supposed 20% speed boost under GCC, my branchless decoder is not really any better than the DFA decoder in practice. It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.

Update: Bryan Donlan has followed up with a SIMD UTF-8 decoder.

Finding the Best 64-bit Simulation PRNG

I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.

For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:

  1. The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.

  2. PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.

In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.

Depending on the programming language, the PRNGs found in various standard libraries may be of dubious quality. They’re slower than they need to be, or have poorer quality than required. In some cases, such as rand() in C, the algorithm isn’t specified, and you can’t rely on it for anything outside of trivial examples. In other cases the algorithm and behavior is specified, but you could easily do better yourself.

My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”

For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:

Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.

Shootout results

Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.

                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060

And the actual dieharder outputs:

The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.

These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.

Speed benchmark

In the speed benchmark, the PRNG is initialized, a 1-second alarm(1) is set, then the PRNG fills a large volatile buffer of 64-bit unsigned integers again and again as quickly as possible until the alarm fires. The amount of memory written is measured as the PRNG’s speed.

The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.

The purpose for making the buffer volatile is to force the entire output to actually be “consumed” as far as the compiler is concerned. Otherwise the compiler plays nasty tricks to make the program do as little work as possible. Another way to deal with this would be to write(2) buffer, but of course I didn’t want to introduce unnecessary I/O into a benchmark.

On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.

The SIGALRM handler sets a volatile global variable that tells the generator to stop. The PRNG call was unrolled 8 times to avoid the alarm check from significantly impacting the benchmark. You can see the effect for yourself by changing UNROLL to 1 (i.e. “don’t unroll”) in the code. Unrolling beyond 8 times had no measurable effect to my tests.

Due to the PRNGs being inlined, this unrolling makes the benchmark less realistic, and it shows in the results. Using volatile for the buffer helped to counter this effect and reground the results. This is a fuzzy problem, and there’s not really any way to avoid it, but I will also discuss this below.

Statistical benchmark

To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:

prng | dieharder -g200 -a -m4

This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.

There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.

The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.

Individual PRNGs

Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.

Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.


One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.

Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.

The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.

In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.

In CBC mode, each output depends on the previous, but in CTR mode all blocks are independent. Work can begin on the next output before the previous output is complete. The x86 architecture uses out-of-order execution to achieve many of its performance gains: Instructions may be executed in a different order than they appear in the program, though their observable effects must generally be ordered correctly. Breaking dependencies between instructions allows out-of-order execution to be fully exercised. It also gives the compiler more freedom in instruction scheduling, though the volatile accesses cannot be reordered with respect to each other (hence it helping to reground the benchmark).

Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.

On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?

In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.

Mersenne Twister (MT19937-64)

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:

Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Permuted Congruential Generator (PCG)

The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.

PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).

Here’s essentially what the simplified version looks like:

spcg32(uint64_t s[1])
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;

The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.

The two constants are entirely my own devising. It’s two 64-bit primes generated using Emacs’ M-x calc: 2 64 ^ k r k n k p k p k p.

Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.

However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.

Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.

Here’s my doubled-up version:

spcg64(uint64_t s[2])
    uint64_t m  = 0x9b60933458e17d7d;
    uint64_t a0 = 0xd737232eeccdf7ed;
    uint64_t a1 = 0x8b260b70b8e98891;
    uint64_t p0 = s[0];
    uint64_t p1 = s[1];
    s[0] = p0 * m + a0;
    s[1] = p1 * m + a1;
    int r0 = 29 - (p0 >> 61);
    int r1 = 29 - (p1 >> 61);
    uint64_t high = p0 >> r0;
    uint32_t low  = p1 >> r1;
    return (high << 32) | low;

The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.

In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.


This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.


xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:

xoroshiro128plus(uint64_t s[2])
    uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    uint64_t result = s0 + s1;
    s1 ^= s0;
    s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
    s[1] = (s1 << 36) | (s1 >> 28);
    return result;

There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.

xorshift128+ and xorshift64*

These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.

xorshift64star(uint64_t s[1])
    uint64_t x = s[0];
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    s[0] = x;
    return x * UINT64_C(0x2545f4914f6cdd1d);

However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.

If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.

xorshift128plus(uint64_t s[2])
    uint64_t x = s[0];
    uint64_t y = s[1];
    s[0] = y;
    x ^= x << 23;
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
    return s[1] + y;

It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.


My own takeaway (until I re-evaluate some years in the future):

Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:

                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080

LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.

So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.

Blowpipe: a Blowfish-encrypted, Authenticated Pipe

Blowpipe is a toy crypto tool that creates a Blowfish-encrypted pipe. It doesn’t open any files and instead encrypts and decrypts from standard input to standard output. This pipe can encrypt individual files or even encrypt a network connection (à la netcat).

Most importantly, since Blowpipe is intended to be used as a pipe (duh), it will never output decrypted plaintext that hasn’t been authenticated. That is, it will detect tampering of the encrypted stream and truncate its output, reporting an error, without producing the manipulated data. Some very similar tools that aren’t considered toys lack this important feature, such as aespipe.


Blowpipe came about because I wanted to study Blowfish, a 64-bit block cipher designed by Bruce Schneier in 1993. It’s played an important role in the history of cryptography and has withstood cryptanalysis for 24 years. Its major weakness is its small block size, leaving it vulnerable to birthday attacks regardless of any other property of the cipher. Even in 1993 the 64-bit block size was a bit on the small side, but Blowfish was intended as a drop-in replacement for the Data Encryption Standard (DES) and the International Data Encryption Algorithm (IDEA), other 64-bit block ciphers.

The main reason I’m calling this program a toy is that, outside of legacy interfaces, it’s simply not appropriate to deploy a 64-bit block cipher in 2017. Blowpipe shouldn’t be used to encrypt more than a few tens of GBs of data at a time. Otherwise I’m fairly confident in both my message construction and my implementation. One detail is a little uncertain, and I’ll discuss it later when describing message format.

A tool that I am confident about is Enchive, though since it’s intended for file encryption, it’s not appropriate for use as a pipe. It doesn’t authenticate until after it has produced most of its output. Enchive does try its best to delete files containing unauthenticated output when authentication fails, but this doesn’t prevent you from consuming this output before it can be deleted, particularly if you pipe the output into another program.


As you might expect, there are two modes of operation: encryption (-E) and decryption (-D). The simplest usage is encrypting and decrypting a file:

$ blowpipe -E < data.gz > data.gz.enc
$ blowpipe -D < data.gz.enc | gunzip > data.txt

In both cases you will be prompted for a passphrase which can be up to 72 bytes in length. The only verification for the key is the first Message Authentication Code (MAC) in the datastream, so Blowpipe cannot tell the difference between damaged ciphertext and an incorrect key.

In a script it would be smart to check Blowpipe’s exit code after decrypting. The output will be truncated should authentication fail somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t clean up for you.

Another use case is securely transmitting files over a network with netcat. In this example I’ll use a pre-shared key file, keyfile. Rather than prompt for a key, Blowpipe will use the raw bytes of a given file. Here’s how I would create a key file:

$ head -c 32 /dev/urandom > keyfile

First the receiver listens on a socket (bind(2)):

$ nc -lp 2000 | blowpipe -D -k keyfile > data.zip

Then the sender connects (connect(2)) and pipes Blowpipe through:

$ blowpipe -E -k keyfile < data.zip | nc -N hostname 2000

If all went well, Blowpipe will exit with 0 on the receiver side.

Blowpipe doesn’t buffer its output (but see -w). It performs one read(2), encrypts whatever it got, prepends a MAC, and calls write(2) on the result. This means it can comfortably transmit live sensitive data across the network:

$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000

Kernel messages will appear on the other end as they’re produced by dmesg. Though keep in mind that the size of each line will be known to eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to disguise the length. Those lengths may leak useful information.


This whole project started when I wanted to play with Blowfish as a small drop-in library. I wasn’t satisfied with the selection, so I figured it would be a good exercise to write my own. Besides, the specification is both an enjoyable and easy read (and recommended). It justifies the need for a new cipher and explains the various design decisions.

I coded from the specification, including writing a script to generate the subkey initialization tables. Subkeys are initialized to the binary representation of pi (the first ~10,000 decimal digits). After a couple hours of work I hooked up the official test vectors to see how I did, and all the tests passed on the first run. This wasn’t reasonable, so I spent awhile longer figuring out how I screwed up my tests. Turns out I absolutely nailed it on my first shot. It’s a really great sign for Blowfish that it’s so easy to implement correctly.

Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of storage. The key schedule is unusually complex: Subkeys are repeatedly encrypted with themselves as they are being computed. This complexity inspired the bcrypt password hashing scheme, which essentially works by iterating the key schedule many times in a loop, then encrypting a constant 24-byte string. My bcrypt implementation wasn’t nearly as successful on my first attempt, and it took hours of debugging in order to match OpenBSD’s outputs.

The encryption and decryption algorithms are nearly identical, as is typical for, and a feature of, Feistel ciphers. There are no branches (preventing some side-channel attacks), and the only operations are 32-bit XOR and 32-bit addition. This makes it ideal for implementation on 32-bit computers.

One tricky point is that encryption and decryption operate on a pair of 32-bit integers (another giveaway that it’s a Feistel cipher). To put the cipher to practical use, these integers have to be serialized into a byte stream. The specification doesn’t choose a byte order, even for mixing the key into the subkeys. The official test vectors are also 32-bit integers, not byte arrays. An implementer could choose little endian, big endian, or even something else.

However, there’s one place in which this decision is formally made: the official test vectors mix the key into the first subkey in big endian byte order. By luck I happened to choose big endian as well, which is why my tests passed on the first try. OpenBSD’s version of bcrypt also uses big endian for all integer encoding steps, further cementing big endian as the standard way to encode Blowfish integers.

Blowfish library

The Blowpipe repository contains a ready-to-use, public domain Blowfish library written in strictly conforming C99. The interface is just three functions:

void blowfish_init(struct blowfish *, const void *key, int len);
void blowfish_encrypt(struct blowfish *, uint32_t *, uint32_t *);
void blowfish_decrypt(struct blowfish *, uint32_t *, uint32_t *);

Technically the key can be up to 72 bytes long, but the last 16 bytes have an incomplete effect on the subkeys, so only the first 56 bytes should matter. Since bcrypt runs the key schedule multiple times, all 72 bytes have full effect.

The library also includes a bcrypt implementation, though it will only produce the raw password hash, not the base-64 encoded form. The main reason for including bcrypt is to support Blowpipe.

Message format

The main goal of Blowpipe was to build a robust, authenticated encryption tool using only Blowfish as a cryptographic primitive.

  1. It uses bcrypt with a moderately-high cost as a key derivation function (KDF). Not terrible, but this is not a memory hard KDF, which is important for protecting against cheap hardware brute force attacks.

  2. Encryption is Blowfish in “counter” CTR mode. A 64-bit counter is incremented and encrypted, producing a keystream. The plaintext is XORed with this keystream like a stream cipher. This allows the last block to be truncated when output and eliminates some padding issues. Since CRT mode is trivially malleable, the MAC becomes even more important. In CTR mode, blowfish_decrypt() is never called. In fact, Blowpipe never uses it.

  3. The authentication scheme is Blowfish-CBC-MAC with a unique key and encrypt-then-authenticate (something I harmlessly got wrong with Enchive). It essentially encrypts the ciphertext again with a different key, but in Cipher Block Chaining mode (CBC), but it only saves the final block. The final block is prepended to the ciphertext as the MAC. On decryption the same block is computed again to ensure that it matches. Only someone who knows the MAC key can compute it.

Of all three Blowfish uses, I’m least confident about authentication. CBC-MAC is tricky to get right, though I am following the rules: fixed length messages using a different key than encryption.

Wait a minute. Blowpipe is pipe-oriented and can output data without buffering the entire pipe. How can there be fixed-length messages?

The pipe datastream is broken into 64kB chunks. Each chunk is authenticated with its own MAC. Both the MAC and chunk length are written in the chunk header, and the length is authenticated by the MAC. Furthermore, just like the keystream, the MAC is continued from previous chunk, preventing chunks from being reordered. Blowpipe can output the content of a chunk and discard it once it’s been authenticated. If any chunk fails to authenticate, it aborts.

This also leads to another useful trick: The pipe is terminated with a zero length chunk, preventing an attacker from appending to the datastream. Everything after the zero-length chunk is discarded. Since the length is authenticated by the MAC, the attacker also cannot truncate the pipe since that would require knowledge of the MAC key.

The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1 byte for the bcrypt cost. The salt is like an initialization vector (IV) that allows keys to be safely reused in different Blowpipe instances. The cost byte is the only distinguishing byte in the stream. Since even the chunk lengths are encrypted, everything else in the datastream should be indistinguishable from random data.


Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I initially wrote it for POSIX (on Linux) of course, but I took an unusual approach when it came time to port it to Windows. Normally I’d invent a generic OS interface that makes the appropriate host system calls. This time I kept the POSIX interface (read(2), write(2), open(2), etc.) and implemented the tiny subset of POSIX that I needed in terms of Win32. That implementation can be found under w32-compat/. I even dropped in a copy of my own getopt().

One really cool feature of this technique is that, on Windows, Blowpipe will still “open” /dev/urandom. It’s intercepted by my own open(2), which in response to that filename actually calls CryptAcquireContext() and pretends like it’s a file. It’s all hidden behind the file descriptor. That’s the unix way.

I’m considering giving Enchive the same treatment since it would simply and reduce much of the interface code. In fact, this project has taught me a number of ways that Enchive could be improved. That’s the value of writing “toys” such as Blowpipe.

Gap Buffers Are Not Optimized for Multiple Cursors

Gap buffers are a common data structure for representing a text buffer in a text editor. Emacs famously uses gap buffers — long-standing proof that gap buffers are a perfectly sufficient way to represent a text buffer.

A gap buffer is really a pair of buffers where one buffer holds all of the content before the cursor (or point for Emacs), and the other buffer holds the content after the cursor. When the cursor is moved through the buffer, characters are copied from one buffer to the other. Inserts and deletes close to the gap are very efficient.

Typically it’s implemented as a single large buffer, with the pre-cursor content at the beginning, the post-cursor content at the end, and the gap spanning the middle. Here’s an illustration:

The top of the animation is the display of the text content and cursor as the user would see it. The bottom is the gap buffer state, where each character is represented as a gray block, and a literal gap for the cursor.

Ignoring for a moment more complicated concerns such as undo and Unicode, a gap buffer could be represented by something as simple as the following:

struct gapbuf {
    char *buf;
    size_t total;  /* total size of buf */
    size_t front;  /* size of content before cursor */
    size_t gap;    /* size of the gap */

This is close to how Emacs represents it. In the structure above, the size of the content after the cursor isn’t tracked directly, but can be computed on the fly from the other three quantities. That is to say, this data structure is normalized.

As an optimization, the cursor could be tracked separately from the gap such that non-destructive cursor movement is essentially free. The difference between cursor and gap would only need to be reconciled for a destructive change — an insert or delete.

A gap buffer certainly isn’t the only way to do it. For example, the original vi used an array of lines, which sort of explains some of its quirky line-oriented idioms. The BSD clone of vi, nvi, uses an entire database to represent buffers. Vim uses a fairly complex rope-like data structure with page-oriented blocks, which may be stored out-of-order in its swap file.

Multiple cursors

Multiple cursors is fairly recent text editor invention that has gained a lot of popularity recent years. It seems every major editor either has the feature built in or a readily-available extension. I myself used Magnar Sveen’s well-polished package for several years. Though obviously the concept didn’t originate in Emacs or else it would have been called multiple points, which doesn’t quite roll off the tongue quite the same way.

The concept is simple: If the same operation needs to done in many different places in a buffer, you place a cursor at each position, then drive them all in parallel using the same commands. It’s super flashy and great for impressing all your friends.

However, as a result of improving my typing skills, I’ve come to the conclusion that multiple cursors is all hat and no cattle. It doesn’t compose well with other editing commands, it doesn’t scale up to large operations, and it’s got all sorts of flaky edge cases (off-screen cursors). Nearly anything you can do with multiple cursors, you can do better with old, well-established editing paradigms.

Somewhere around 99% of my multiple cursors usage was adding a common prefix to a contiguous serious of lines. As similar brute force options, Emacs already has rectangular editing, and Vim already has visual block mode.

The most sophisticated, flexible, and robust alternative is a good old macro. You can play it back anywhere it’s needed. You can zip it across a huge buffer. The only downside is that it’s less flashy and so you’ll get invited to a slightly smaller number of parties.

But if you don’t buy my arguments about multiple cursors being tasteless, there’s still a good technical argument: Gap buffers are not designed to work well in the face of multiple cursors!

For example, suppose we have a series of function calls and we’d like to add the same set of arguments to each. It’s a classic situation for a macro or for multiple cursors. Here’s the original code:


The example is tiny so that it will fit in the animations to come. Here’s the desired code:

foo(x, y);
bar(x, y);
baz(x, y);

With multiple cursors you would place a cursor inside each set of parenthesis, then type x, y. Visually it looks something like this:

Text is magically inserted in parallel in multiple places at a time. However, if this is a text editor that uses a gap buffer, the situation underneath isn’t quite so magical. The entire edit doesn’t happen at once. First the x is inserted in each location, then the comma, and so on. The edits are not clustered so nicely.

From the gap buffer’s point of view, here’s what it looks like:

For every individual character insertion the buffer has to visit each cursor in turn, performing lots of copying back and forth. The more cursors there are, the worse it gets. For an edit of length n with m cursors, that’s O(n * m) calls to memmove(3). Multiple cursors scales badly.

Compare that to the old school hacker who can’t be bothered with something as tacky and modern (eww!) as multiple cursors, instead choosing to record a macro, then play it back:

The entire edit is done locally before moving on to the next location. It’s perfectly in tune with the gap buffer’s expectations, only needing O(m) calls to memmove(3). Most of the work flows neatly into the gap.

So, don’t waste your time with multiple cursors, especially if you’re using a gap buffer text editor. Instead get more comfortable with your editor’s macro feature. If your editor doesn’t have a good macro feature, get a new editor.

If you want to make your own gap buffer animations, here’s the source code. It includes a tiny gap buffer implementation:

Ten Years of Blogging

As of today, I’ve been blogging for 10 years. In this time I’ve written 302,000 words across 343 articles — a rate of one article every week and a half. These articles form a record of my professional progress, touching on both “hard” technical skills and “soft” communication skills. My older articles are a personal reminder of how far I’ve come. They are proof that I’m not as stagnant as I sometimes feel, and it helps me to sympathize with others who are currently in those earlier stages of their own career.

That index where you can find these 343 articles is sorted newest-first, because it correlates with best-first. It’s a trend I hope to continue.


Before blogging, I had a simple Penn State student page showcasing a few small — and, in retrospect, trivial — side projects (1, 2, 3), none of which went anywhere. Around the beginning of my final semester of college, I was inspired by Mark Dominus’ The Universe of Discourse and Andy Owen’s Friendly Robot Overlord (gone since 2011) to start my own blosxom blog. It would be an outlet to actively discuss my projects. Some time later GitHub was founded, and I switched to a static blog hosted by GitHub Pages, which is where it lives to this day.

It’s been more challenging to manage all this content than I ever anticipated. It’s like maintaining a large piece of software, except it’s naturally more fragile. Any time I make a non-trivial change to the CSS, I have to inspect the archives to check if I broke older articles. If I did, sometimes it’s a matter of further adjusting the CSS. Other times I’ll mass edit a couple hundred articles in order to normalize some particular aspect, such as heading consistency or image usage. (Running a macro over an Emacs’ Dired buffer is great for this.)

I decided in those early days to Capitalize Every Word of the Title, and I’ve stuck with this convention purely out of consistency even though it’s looked weird to me for years. I don’t want to edit the old titles, and any hard changeover date would be even weirder (in the index listing).

With more foresight and experience, I could have laid down better conventions for myself from the beginning. Besides the universal impossibility of already having experience before it’s needed, there’s also the issue that the internet’s conventions have changed, too. This blog is currently built with HTML5, but this wasn’t always the case — especially considering that predates HTML5. When I switched to HTML5, I also adjusted some of my own conventions to match, since, at the time, I was still writing articles in raw HTML.

The mobile revolution also arrived since starting this blog. Today, about one third of visitors read the blog from a mobile device. I’ve also adjusted the CSS to work well on these devices. To that third of you: I hope you’re enjoying the experience!

Just in case you haven’t tried it, the blog also works really well with terminal-based browsers, such as Lynx and ELinks. Go ahead and give it a shot. The header that normally appears at the top of the page is actually at the bottom of the HTML document structure. It’s out of the way for browsers that ignore CSS.

If that’s not enough, last year I also spent effort making the printed style of my articles look nice. Take a look at the printed version of this article (i.e. print preview, print to PDF), and make sure to turn off the little headers added by the browser. A media selector provides a separate print stylesheet. Chrome / Chromium has consistently had the best results, followed by Firefox. Someday I’d like for browsers to be useful as typesetting engines for static documents — as an alternative to LaTeX and Groff — but they’ve still got a ways to go with paged media. (Trust me, I’ve tried.)

With more foresight, I could have done something better with my permanent URLs. Notice how they’re just dates and don’t include the title. URLs work better when they include human-meaningful context. Ideally I should be able to look at any one of my URLs and know what it’s about. Again, this decision goes all the way back to those early days when I first configured blosxom, not knowing any better.

URLs are forever, and I don’t want to break all my old links. Consistency is better than any sort of correction. I’m also practically limited to one article per day, though this has never been a real problem.


For me, an important motivation for writing is to say something unique about a topic. For example, I’m not interested in writing a tutorial unless either no such tutorial already exists, or there’s some vital aspect all the existing tutorials miss. Each article should add new information to the internet, either raw information or through assembling existing information in a new way or with a unique perspective.

I also almost constantly feel like I’m behind the curve, like I don’t know enough or I don’t have enough skill. As many of you know, the internet is really effective at causing these feelings. Every day lots of talented people are sharing interesting and complicated things across all sorts of topics. For topics that overlap my own desired expertise, those projects have me thinking, “Man, I have no idea how to even start doing that. There’s so much I don’t know, and so much more I need to learn.” Writing articles as I learn is a great way to keep on top of new subjects.

This is tied to another problem: I have a tendency to assume the things I’ve known for awhile are common knowledge. This shows up in a few ways. First, if everyone knows what I know, plus they know a bunch of things that I don’t know, then I’ve got a lot of catching up to do. That’s another source of feeling behind the curve.

Second, when writing an article on a topic where I’ve got years of experience, I leave out way too many important details, assuming the reader already knows them. When an article I regard as valuable gets a weak response, it’s probably related to this issue.

Third, after three years of teaching, it seems like it’s becoming more difficult to put myself in the student’s shoes. I’m increasingly further away from my own days learning those early topics, and it’s harder to remember the limitations of my knowledge at that time. Having this blog really helps, and I’ve re-read some of my older articles to recall my mindset at the time.

Another way the blog helps is that it’s like having my own textbook. When teaching a topic to someone — and not necessarily a formal mentee — or even when just having a discussion, I will reference my articles when appropriate. Since they’re designed to say something unique, my article may be the only place to find certain information in a conveniently packaged form.

Finally, the last important motivating factor is that I want to spread my preferred memes. Obviously the way I do things is the Right Way, and the people who do things differently (the Wrong Way) are stupid and wrong. By writing about the sorts of topics and technologies I enjoy — C, low-level design, Emacs, Vim, programming on unix-like systems, my personal programming style — I’m encouraging others to follow my lead. Surely I’m responsible for at least one Emacs convert out there!

“Formal” professional writing

Despite having three or four novels worth of (mostly) technical writing here, my formal workplace writing leaves much to be desired. I’m no standout in this area. In the same period of time I’ve written just a handful of formal memos, each about the same length as a long blog post.

Why so few? These memos are painful to write. In order to be officially recognized, the formal memo process must be imposed upon me. What this means is that, compared to a blog post, these memos take at least an order of magnitude longer to write.

The process involves more people, dragged out over a long period of time. The writing is a lot less personal, which, in my opinion, makes it drier and less enjoyable. After the initial draft is complete, I have to switch to vastly inferior tools: emailing the same Microsoft Office documents back and forth between various people, without any proper source control. The official, mandated memo template was created by someone who didn’t know how to effectively operate a word processor, who had a poor sense of taste (ALL CAPS HEADINGS), and who obviously had no training in typesetting or style.

At the end of this long process, it’s filed into a system with no practical search capability, and where it will be quietly purged after five years, never to be seen again. Outside of the reviewers who were directly involved in the memo process, somewhere between zero and two people will have actually read it. Literally.

Arguably the memo might more polished than a blog post. I’m skeptical of this, but suppose that’s true. I’d still much rather have written ten less-polished blog posts than one more-polished memo. That’s also ten shots to produce, by chance, a more valuable article than the single, precious memo.

Let’s broaden the scope to academic papers. Thanks to some great co-workers — all three of whom are smarter and handsomer than me — a year ago I finally got a published academic paper under my belt (and more to come): ROP Gadget Prevalence and Survival under Compiler-based Binary Diversification Schemes (and I said memos were dry!). A ton of work went into this paper, and it’s far more substantial than any memo or single blog post. The process was a lot more pleasant (LaTeX instead of Word), and the results are definitely much more polished than a typical blog post. It reads well and has interesting information to present.

This all sounds great until you consider the impact. According to ACM’s statistics, the paper has been accessed 130 times as of this writing. (Yes, providing an unofficial link to the paper like I just did above doesn’t help, but I ran out of those crappy “free” links. Sue me.) Sure, the PDF might have been passed around in untrackable ways, but I also bet a lot of those accesses were just downloads that were never read. So let’s split the difference and estimate around 130 people read it.

What kind of impact does a blog post have? Talking about these numbers feels a bit taboo, like discussing salaries, but it’s important for the point I’m about to make.

Note that all but the last has been published for less than time than our paper. The average time on these pages is between 5 and 6 minutes, so these are actual readers, not just visitors that take one glance and leave. Thanks to the information age, a technical blog article on established blog can reach an audience 100 times larger than a journal for a fraction of the effort and cost. There are other benefits, too:

  1. I get immediate feedback in the form of comments and email (open peer review).

  2. The content is available for free (open access). It’s trivial to link and share blog articles.

  3. Even more, this entire blog is in the public domain. If you don’t believe me, check out the public domain dedication in the footer of this page. It’s been there for years, and you can verify that yourself. Every single change to this blog in the past 6 years has been publicly documented (transparency).

  4. When I write about a topic, I make it a goal to provide the code and data to try it for yourself (open data and materials). This code and data is also either in the public domain, or as close to it as possible.

  5. Link aggregators and social media are great at finding the best stuff and making it more visible (censorship resistance). When I have a big hit, it’s often Reddit or Hacker News driving people to my article. Sometimes it’s other blogs.

In 2017, a blog is by far the most effective way to publish and share written information, and have more scientific quality than journals. More people should be publishing through blog articles than traditional forms of publications, which are far less effective.

Since this has proven to be such an effective medium, I’m going to make a promise right here, right now. And as I explained above with transparency, there are no take-backs on this one. If you’re reading this, it’s on the public record forever. I promise to deliver another 10 years of free, interesting, useful content. This stuff is going to be even better, too. On this blog’s 20th anniversary, September 1st, 2027, we’ll see how I did.

Vim vs. Emacs: The Working Directory

Vim and Emacs have different internals models for the current working directory, and these models influence the overall workflow for each editor. They decide how files are opened, how shell commands are executed, and how the build system is operated. These effects even reach outside the editor to influence the overall structure of the project being edited.

In the traditional unix model, which was eventually adopted everywhere else, each process has a particular working directory tracked by the operating system. When a process makes a request to the operating system using a relative path — a path that doesn’t begin with a slash — the operating system uses the process’ working directory to convert the path into an absolute path. When a process forks, its child starts in the same directory. A process can change its working directory at any time using chdir(2), though most programs never need to do it. The most obvious way this system call is exposed to regular users is through the shell’s built-in cd command.

Vim’s spiritual heritage is obviously rooted in vi, one of the classic unix text editors, and the most elaborate text editor standardized by POSIX. Like vi, Vim closely follows the unix model for working directories. At any given time Vim has exactly one working directory. Shell commands that are run within Vim will start in Vim’s working directory. Like a shell, the cd ex command changes and queries Vim’s working directory.

Emacs eschews this model and instead each buffer has its own working directory tracked using a buffer-local variable, default-directory. Emacs internally simulates working directories for its buffers like an operating system, resolving absolute paths itself, giving credence to the idea that Emacs is an operating system (“lacking only a decent editor”). Perhaps this model comes from ye olde lisp machines?

In contrast, Emacs’ M-x cd command manipulates the local variable and has no effect on the Emacs process’ working directory. In fact, Emacs completely hides its operating system working directory from Emacs Lisp. This can cause some trouble if that hidden working directory happens to be sitting on filesystem you’d like to unmount.

Vim can be configured to simulate Emacs’ model with its autochdir option. When set, Vim will literally chdir(2) each time the user changes buffers, switches windows, etc. To the user, this feels just like Emacs’ model, but this is just a convenience, and the core working directory model is still the same.

Single instance editors

For most of my Emacs career, I’ve stuck to running a single, long-lived Emacs instance no matter how many different tasks I’m touching simultaneously. I start the Emacs daemon shortly after logging in, and it continues running until I log out — typically only when the machine is shut down. It’s common to have multiple Emacs windows (frames) for different tasks, but they’re all bound to the same daemon process.

While with care it’s possible to have a complex, rich Emacs configuration that doesn’t significantly impact Emacs’ startup time, the general consensus is that Emacs is slow to start. But since it has a really solid daemon, this doesn’t matter: hardcore Emacs users only ever start Emacs occasionally. The rest of the time they’re launching emacsclient and connecting to the daemon. Outside of system administration, it’s the most natural way to use Emacs.

The case isn’t so clear for Vim. Vim is so fast that many users fire it up on demand and exit when they’ve finished the immediate task. At the other end of the spectrum, others advocate using a single instance of Vim like running a single Emacs daemon. In my initial dive into Vim, I tried the single-instance, Emacs way of doing things. I set autochdir out of necessity and pretended each buffer had its own working directory.

At least for me, this isn’t the right way to use Vim, and it all comes down to working directories. I want Vim to be anchored at the project root with one Vim instance per project. Everything is smoother when it happens in the context of the project’s root directory, from opening files, to running shell commands (ctags in particular), to invoking the build system. With autochdir, these actions are difficult to do correctly, particularly the last two.

Invoking the build

I suspect the Emacs’ model of per-buffer working directories has, in a Sapir-Whorf sort of way, been responsible for leading developers towards poorly-designed, recursive Makefiles. Without a global concept of working directory, it’s inconvenient to invoke the build system (M-x compile) in some particular grandparent directory that is the root of the project. If each directory has its own Makefile, it usually makes sense to invoke make in the same directory as the file being edited.

Over the years I’ve been reinventing the same solution to this problem, and it wasn’t until I spent time with Vim and its alternate working directory model that I truly understood the problem. Emacs itself has long had a solution lurking deep in its bowels, unseen by daylight: dominating files. The function I’m talking about is locate-dominating-file:

(locate-dominating-file FILE NAME)

Look up the directory hierarchy from FILE for a directory containing NAME. Stop at the first parent directory containing a file NAME, and return the directory. Return nil if not found. Instead of a string, NAME can also be a predicate taking one argument (a directory) and returning a non-nil value if that directory is the one for which we’re looking.

The trouble of invoking the build system at the project root is that Emacs doesn’t really have a concept of a project root. It doesn’t know where it is or how to find it. The vi model inherited by Vim is to leave the working directory at the project root. While Vim can simulate Emacs’ working directory model, Emacs cannot (currently) simulate Vim’s model.

Instead, by identifying a file name unique to the project’s root (i.e. a “dominating” file) such as Makefile or build.xml, then locate-dominating-file can discover the project root. All that’s left is wrapping M-x compile so that default-directory is temporarily adjusted to the project’s root.

That looks very roughly like this (and needs more work):

(defun my-compile ()
  (let ((default-directory (locate-dominating-file "." "Makefile")))
    (compile "make")))

It’s a pattern I’ve used again and again and again, working against the same old friction. By running one Vim instance per project at the project’s root, I get the correct behavior for free.

A Tutorial on Portable Makefiles

In my first decade writing Makefiles, I developed the bad habit of liberally using GNU Make’s extensions. I didn’t know the line between GNU Make and the portable features guaranteed by POSIX. Usually it didn’t matter much, but it would become an annoyance when building on non-Linux systems, such as on the various BSDs. I’d have to specifically install GNU Make, then remember to invoke it (i.e. as gmake) instead of the system’s make.

I’ve since become familiar and comfortable with make’s official specification, and I’ve spend the last year writing strictly portable Makefiles. Not only has are my builds now portable across all unix-like systems, my Makefiles are cleaner and more robust. Many of the common make extensions — conditionals in particular — lead to fragile, complicated Makefiles and are best avoided anyway. It’s important to be able to trust your build system to do its job correctly.

This tutorial should be suitable for make beginners who have never written their own Makefiles before, as well as experienced developers who want to learn how to write portable Makefiles. Regardless, in order to understand the examples you must be familiar with the usual steps for building programs on the command line (compiler, linker, object files, etc.). I’m not going to suggest any fancy tricks nor provide any sort of standard starting template. Makefiles should be dead simple when the project is small, and grow in a predictable, clean fashion alongside the project.

I’m not going to cover every feature. You’ll need to read the specification for yourself to learn it all. This tutorial will go over the important features as well as the common conventions. It’s important to follow established conventions so that people using your Makefiles will know what to expect and how to accomplish the basic tasks.

If you’re running Debian, or a Debian derivative such as Ubuntu, the bmake and freebsd-buildutils packages will provide the bmake and fmake programs respectively. These alternative make implementations are very useful for testing your Makefiles’ portability, should you accidentally make use of a GNU Make feature. It’s not perfect since each implements some of the same extensions as GNU Make, but it will catch some common mistakes.

What’s in a Makefile?

I am free, no matter what rules surround me. If I find them tolerable, I tolerate them; if I find them too obnoxious, I break them. I am free because I know that I alone am morally responsible for everything I do. ―Robert A. Heinlein

At make’s core are one or more dependency trees, constructed from rules. Each vertex in the tree is called a target. The final products of the build (executable, document, etc.) are the tree roots. A Makefile specifies the dependency trees and supplies the shell commands to produce a target from its prerequisites.

In this illustration, the “.c” files are source files that are written by hand, not generated by commands, so they have no prerequisites. The syntax for specifying one or more edges in this dependency tree is simple:

target [target...]: [prerequisite...]

While technically multiple targets can be specified in a single rule, this is unusual. Typically each target is specified in its own rule. To specify the tree in the illustration above:

game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

The order of these rules doesn’t matter. The entire Makefile is parsed before any actions are taken, so the tree’s vertices and edges can be specified in any order. There’s one exception: the first non-special target in a Makefile is the default target. This target is selected implicitly when make is invoked without choosing a target. It should be something sensible, so that a user can blindly run make and get a useful result.

A target can be specified more than once. Any new prerequisites are appended to the previously-given prerequisites. For example, this Makefile is identical to the previous, though it’s typically not written this way:

game: graphics.o
game: physics.o
game: input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

There are six special targets that are used to change the behavior of make itself. All have uppercase names and start with a period. Names fitting this pattern are reserved for use by make. According to the standard, in order to get reliable POSIX behavior, the first non-comment line of the Makefile must be .POSIX. Since this is a special target, it’s not a candidate for the default target, so game will remain the default target:

game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

In practice, even a simple program will have header files, and sources that include a header file should also have an edge on the dependency tree for it. If the header file changes, targets that include it should also be rebuilt.

game: graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h

Adding commands to rules

We’ve constructed a dependency tree, but we still haven’t told make how to actually build any targets from its prerequisites. The rules also need to specify the shell commands that produce a target from its prerequisites.

If you were to create the source files in the example and invoke make, you will find that it actually does know how to build the object files. This is because make is initially configured with certain inference rules, a topic which will be covered later. For now, we’ll add the .SUFFIXES special target to the top, erasing all the built-in inference rules.

Commands immediately follow the target/prerequisite line in a rule. Each command line must start with a tab character. This can be awkward if your text editor isn’t configured for it, and it will be awkward if you try to copy the examples from this page.

Each line is run in its own shell, so be mindful of using commands like cd, which won’t affect later lines.

The simplest thing to do is literally specify the same commands you’d type at the shell:

game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c

Invoking make and choosing targets

I tried to walk into Target, but I missed. ―Mitch Hedberg

When invoking make, it accepts zero or more targets from the dependency tree, and it will build these targets — e.g. run the commands in the target’s rule — if the target is out-of-date. A target is out-of-date if it is older than any of its prerequisites.

# build the "game" binary (default target)
$ make

# build just the object files
$ make graphics.o physics.o input.o

This effect cascades up the dependency tree and causes further targets to be rebuilt until all of the requested targets are up-to-date. There’s a lot of room for parallelism since different branches of the tree can be updated independently. It’s common for make implementations to support parallel builds with the -j option. This is non-standard, but it’s a fantastic feature that doesn’t require anything special in the Makefile to work correctly.

Similar to parallel builds is make’s -k (“keep going”) option, which is standard. This tells make not to stop on the first error, and to continue updating targets that are unaffected by the error. This is nice for fully populating Vim’s quickfix list or Emacs’ compilation buffer.

It’s common to have multiple targets that should be built by default. If the first rule selects the default target, how do we solve the problem of needing multiple default targets? The convention is to use phony targets. These are called “phony” because there is no corresponding file, and so phony targets are never up-to-date. It’s convention for a phony “all” target to be the default target.

I’ll make game a prerequisite of a new “all” target. More real targets could be added as necessary to turn them into defaults. Users of this Makefile will also expect make all to build the entire project.

Another common phony target is “clean” which removes all of the built files. Users will expect make clean to delete all generated files.

all: game
game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c
    rm -f game graphics.o physics.o input.o

Customize the build with macros

So far the Makefile hardcodes cc as the compiler, and doesn’t use any compiler flags (warnings, optimization, hardening, etc.). The user should be able to easily control all these things, but right now they’d have to edit the entire Makefile to do so. Perhaps the user has both gcc and clang installed, and wants to choose one or the other without changing which is installed as cc.

To solve this, make has macros that expand into strings when referenced. The convention is to use the macro named CC when talking about the C compiler, CFLAGS when talking about flags passed to the C compiler, LDFLAGS for flags passed to the C compiler when linking, and LDLIBS for flags about libraries when linking. The Makefile should supply defaults as needed.

A macro is expanded with $(...). It’s valid (and normal) to reference a macro that hasn’t been defined, which will be an empty string. This will be the case with LDFLAGS below.

Macro values can contain other macros, which will be expanded recursively each time the macro is expanded. Some make implementations allow the name of the macro being expanded to itself be a macro, which is turing complete, but this behavior is non-standard.

CC     = cc
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
    $(CC) -c $(CFLAGS) graphics.c
physics.o: physics.c physics.h
    $(CC) -c $(CFLAGS) physics.c
input.o: input.c input.h graphics.h physics.h
    $(CC) -c $(CFLAGS) input.c
    rm -f game graphics.o physics.o input.o

Macros are overridden by macro definitions given as command line arguments in the form name=value. This allows the user to select their own build configuration. This is one of make’s most powerful and under-appreciated features.

$ make CC=clang CFLAGS='-O3 -march=native'

If the user doesn’t want to specify these macros on every invocation, they can (cautiously) use make’s -e flag to set overriding macros definitions from the environment.

$ export CC=clang
$ export CFLAGS=-O3
$ make -e all

Some make implementations have other special kinds of macro assignment operators beyond simple assignment (=). These are unnecessary, so don’t worry about them.

Inference rules so that you can stop repeating yourself

The road itself tells us far more than signs do. ―Tom Vanderbilt, Traffic: Why We Drive the Way We Do

There’s repetition across the three different object files. Wouldn’t it be nice if there was a way to communicate this pattern? Fortunately there is, in the form of inference rules. It says that a target with a certain extension, with a prerequisite with another certain extension, is built a certain way. This will make more sense with an example.

In an inference rule, the target indicates the extensions. The $< macro expands to the prerequisite, which is essential to making inference rules work generically. Unfortunately this macro is not available in target rules, as much as that would be useful.

For example, here’s an inference rule that teaches make how to build an object file from a C source file. This particular rule is one that is pre-defined by make, so you’ll never need to write this one yourself. I’ll include it for completeness.

    $(CC) $(CFLAGS) -c $<

These extensions must be added to .SUFFIXES before they will work. With that, the commands for the rules about object files can be omitted.

CC     = cc
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
    rm -f game graphics.o physics.o input.o

.SUFFIXES: .c .o
    $(CC) $(CFLAGS) -c $<

The first empty .SUFFIXES clears the suffix list. The second one adds .c and .o to the now-empty suffix list.

Other target conventions

Conventions are, indeed, all that shield us from the shivering void, though often they do so but poorly and desperately. ―Robert Aickman

Users usually expect an “install” target that installs the built program, libraries, man pages, etc. By convention this target should use the PREFIX and DESTDIR macros.

The PREFIX macro should default to /usr/local, and since it’s a macro the user can override it to install elsewhere, such as in their home directory. The user should override it for both building and installing, since the prefix may need to be built into the binary (e.g. -DPREFIX=$(PREFIX)).

The DESTDIR is macro is used for staged builds, so that it gets installed under a fake root directory for the sake of packaging. Unlike PREFIX, it will not actually be run from this directory.

CC     = cc
LDLIBS = -lm
PREFIX = /usr/local

all: game
install: game
    mkdir -p $(DESTDIR)$(PREFIX)/bin
    mkdir -p $(DESTDIR)$(PREFIX)/share/man/man1
    cp -f game $(DESTDIR)$(PREFIX)/bin
    gzip < game.1 > $(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
    rm -f game graphics.o physics.o input.o

You may also want to provide an “uninstall” phony target that does the opposite.

make PREFIX=$HOME/.local install

Other common targets are “mostlyclean” (like “clean” but don’t delete some slow-to-build targets), “distclean” (delete even more than “clean”), “test” or “check” (run the test suite), and “dist” (create a package).

Complexity and growing pains

One of make’s big weak points is scaling up as a project grows in size.

Recursive Makefiles

As your growing project is broken into subdirectories, you may be tempted to put a Makefile in each subdirectory and invoke them recursively.

Don’t use recursive Makefiles. It breaks the dependency tree across separate instances of make and typically results in a fragile build. There’s nothing good about it. Have one Makefile at the root of your project and invoke make there. You may have to teach your text editor how to do this.

When talking about files in subdirectories, just include the subdirectory in the name. Everything will work the same as far as make is concerned, including inference rules.

src/graphics.o: src/graphics.c
src/physics.o: src/physics.c
src/input.o: src/input.c

Out-of-source builds

Keeping your object files separate from your source files is a nice idea. When it comes to make, there’s good news and bad news.

The good news is that make can do this. You can pick whatever file names you like for targets and prerequisites.

obj/input.o: src/input.c

The bad news is that inference rules are not compatible with out-of-source builds. You’ll need to repeat the same commands for each rule as if inference rules didn’t exist. This is tedious for large projects, so you may want to have some sort of “configure” script, even if hand-written, to generate all this for you. This is essentially what CMake is all about. That, plus dependency management.

Dependency management

Another problem with scaling up is tracking the project’s ever-changing dependencies across all the source files. Missing a dependency means the build may not be correct unless you make clean first.

If you go the route of using a script to generate the tedious parts of the Makefile, both GCC and Clang have a nice feature for generating all the Makefile dependencies for you (-MM, -MT), at least for C and C++. There are lots of tutorials for doing this dependency generation on the fly as part of the build, but it’s fragile and slow. Much better to do it all up front and “bake” the dependencies into the Makefile so that make can do its job properly. If the dependencies change, rebuild your Makefile.

For example, here’s what it looks like invoking gcc’s dependency generator against the imaginary input.c for an out-of-source build:

$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h

Notice the output is in Makefile’s rule format.

Unfortunately this feature strips the leading paths from the target, so, in practice, using it is always more complicated than it should be (e.g. it requires the use of -MT).

Microsoft’s Nmake

Microsoft has an implementation of make called Nmake, which comes with Visual Studio. It’s nearly a POSIX-compatible make, but necessarily breaks from the standard in some places. Their cl.exe compiler uses .obj as the object file extension and .exe for binaries, both of which differ from the unix world, so it has different built-in inference rules. Windows also lacks a Bourne shell and the standard unix tools, so all of the commands will necessarily be different.

There’s no equivalent of rm -f on Windows, so good luck writing a proper “clean” target. No, del /f isn’t the same.

So while it’s close to POSIX make, it’s not practical to write a Makefile that will simultaneously work properly with both POSIX make and Nmake. These need to be separate Makefiles.

May your Makefiles be portable

It’s nice to have reliable, portable Makefiles that just work anywhere. Code to the standards and you don’t need feature tests or other sorts of special treatment.

Introducing the Pokerware Secure Passphrase Generator

I recently developed Pokerware, an offline passphrase generator that operates in the same spirit as Diceware. The primary difference is that it uses a shuffled deck of playing cards as its entropy source rather than dice. Draw some cards and use them to select a uniformly random word from a list. Unless you’re some sort of tabletop gaming nerd, a deck of cards is more readily available than five 6-sided dice, which would typically need to be borrowed from the Monopoly board collecting dust on the shelf, then rolled two at a time.

There are various flavors of two different word lists here:

Hardware random number generators are difficult to verify and may not actually be as random as they promise, either intentionally or unintentionally. For the particularly paranoid, Diceware and Pokerware are an easily verifiable alternative for generating secure passphrases for cryptographic purposes. At any time, a deck of 52 playing cards is in one of 52! possible arrangements. That’s more than 225 bits of entropy. If you give your deck a thorough shuffle, it will be in an arrangement that has never been seen before and will never be seen again. Pokerware draws on some of these bits to generate passphrases.

The Pokerware list has 5,304 words (12.4 bits per word), compared to Diceware’s 7,776 words (12.9 bits per word). My goal was to invent a card-drawing scheme that would uniformly select from a list in the same sized ballpark as Diceware. Much smaller and you’d have to memorize more words for the same passphrase strength. Much larger and the words on the list would be more difficult to memorize, since the list would contain longer and less frequently used words. Diceware strikes a nice balance at five dice.

One important difference for me is that I like my Pokerware word lists a lot more than the two official Diceware lists. My lists only have simple, easy-to-remember words (for American English speakers, at least), without any numbers or other short non-words. Pokerware has two official lists, “formal” and “slang,” since my early testers couldn’t agree on which was better. Rather than make a difficult decision, I took the usual route of making no decision at all.

The “formal” list is derived in part from Google’s Ngram Viewer, with my own additional filters and tweaking. It’s called “formal” because the ngrams come from formal publications and represent more formal kinds of speech.

The “slang” list is derived from every reddit comment between December 2005 and May 2017, tamed by the same additional filters. I have this data on hand, so I may as well put it to use. I figured more casually-used words would be easier to remember. Due to my extra filtering, there’s actually a lot of overlap between these lists, so the differences aren’t too significant.

If you have your own word list, perhaps in a different language, you can use the Makefile in the repository to build your own Pokerware lookup table, both plain text and PDF. The PDF is generated using Groff macros.

Passphrase generation instructions

  1. Thoroughly shuffle the deck.

  2. Draw two cards. Sort them by value, then suit. Suits are in alphabetical order: Clubs, Diamonds, Hearts, Spades.

  3. Draw additional cards until you get a card that doesn’t match the face value of either of your initial two cards. Observe its suit.

  4. Using your two cards and observed suit, look up a word in the table.

  5. Place all cards back in the deck, shuffle, and repeat from step 2 until you have the desired number of words. Each word is worth 12.4 bits of entropy.

A word of warning about step 4: If you use software to do the word list lookup, beware that it might save your search/command history — and therefore your passphrase — to a file. For example, the less pager will store search history in ~/.lesshst. It’s easy to prevent that one:

$ LESSHISTFILE=- less pokerware-slang.txt

Example word generation

Suppose in step 2 you draw King of Hearts (KH/K♥) and Queen of Clubs (QC/Q♣).

In step 3 you first draw King of Diamonds (KD/K♦), discarding it because it matches the face value of one of your cards from step 2.

Next you draw Four of Spades (4S/4♠), taking spades as your extra suit.

In order, this gives you Queen of Clubs, King of Hearts, and Spades: QCKHS or Q♣K♥♠. This corresponds to “wizard” in the formal word list and would be the first word in your passphrase.

A deck of cards as an office tool

I now have an excuse to keep a deck of cards out on my desk at work. I’ve been using Diceware — or something approximating it since I’m not so paranoid about hardware RNGs — for passwords for over 8 years now. From now I’ll deal new passwords from an in-reach deck of cards. Though typically I need to tweak the results to meet outdated character-composition requirements.

null program

Chris Wellons