nullprogram.com/blog/2018/07/31/
I recently got an itch to design my own non-cryptographic integer hash
function. Firstly, I wanted to better understand how hash
functions work, and the best way to learn is to do. For years I’d been
treating them like magic, shoving input into it and seeing
random-looking, but deterministic, output come out the other
end. Just how is the avalanche effect achieved?
Secondly, could I apply my own particular strengths to craft a hash
function better than the handful of functions I could find online?
Especially the classic ones from Thomas Wang and Bob
Jenkins. Instead of struggling with the mathematics, maybe I
could software engineer my way to victory, working from the advantage
of access to the excessive computational power of today.
Suppose, for example, I wrote tool to generate a random hash
function definition, then JIT compile it to a native function in
memory, then execute that function across various inputs to evaluate
its properties. My tool could rapidly repeat this process in a loop
until it stumbled upon an incredible hash function the world had never
seen. That’s what I actually did. I call it the Hash Prospector:
https://github.com/skeeto/hash-prospector
It only works on x86-64 because it uses the same JIT compiling
technique I’ve discussed before: allocate a page of memory, write
some machine instructions into it, set the page to executable, cast the
page pointer to a function pointer, then call the generated code through
the function pointer.
Generating a hash function
My focus is on integer hash functions: a function that accepts an
n-bit integer and returns an n-bit integer. One of the important
properties of an integer hash function is that it maps its inputs to
outputs 1:1. In other words, there are no collisions. If there’s a
collision, then some outputs aren’t possible, and the function isn’t
making efficient use of its entropy.
This is actually a lot easier than it sounds. As long as every n-bit
integer operation used in the hash function is reversable, then the
hash function has this property. An operation is reversible if, given
its output, you can unambiguously compute its input.
For example, XOR with a constant is trivially reversible: XOR the
output with the same constant to reverse it. Addition with a constant
is reversed by subtraction with the same constant. Since the integer
operations are modular arithmetic, modulo 2^n for n-bit integers,
multiplication by an odd number is reversible. Odd numbers are
coprime with the power-of-two modulus, so there is some modular
multiplicative inverse that reverses the operation.
Bret Mulvey’s hash function article provides a convenient list
of some reversible operations available for constructing integer hash
functions. This list was the catalyst for my little project. Here are
the ones used by the hash prospector:
x = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation
I’ve come across a couple more useful operations while studying existing
integer hash functions, but I didn’t put these in the prospector.
hash += ~(hash << constant);
hash -= ~(hash << constant);
The prospector picks some operations at random and fills in their
constants randomly within their proper constraints. For example,
here’s an awful hash function I made it generate as an example:
// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
x *= UINT32_C(0x1eca7d79);
x ^= x >> 20;
x = (x << 8) | (x >> 24);
x = ~x;
x ^= x << 5;
x += UINT32_C(0x10afe4e7);
return x;
}
That function is reversible, and it would be relatively
straightforward to define its inverse. However, it has
awful biases and poor avalanche. How do I know this?
The measure of a hash function
There are two key properties I’m looking for in randomly generated hash
functions.
-
High avalanche effect. When I flip one input bit, the output bits
should each flip with a 50% chance.
-
Low bias. Ideally there is no correlation between which output bits
flip for a particular flipped input bit.
Initially I screwed up and only measured the first property. This lead
to some hash functions that seemed to be amazing before close
inspection, since, for a 32-bit hash function, it was flipping over 15
output bits on average. However, the particular bits being flipped
were heavily biased, resulting in obvious patterns in the output.
For example, when hashing a counter starting from zero, the high bits
would follow a regular pattern. 15 to 16 bits were being flipped each
time, but it was always the same bits.
Conveniently it’s easy to measure both properties at the same time. For
an n-bit integer hash function, create an n by n table initialized
to zero. The rows are input bits and the columns are output bits. The
ith row and jth column track the correlation between the ith input
bit and jth output bit.
Then exhaustively iterate over all 2^n inputs, and flip each bit one at
a time. Increment the appropriate element in the table if the output bit
flips.
When you’re done, ideally each element in the table is exactly 2^(n-1).
That is, each output bit was flipped exactly half the time by each input
bit. Therefore the bias of the hash function is the distance (the
error) of the computed table from the ideal table.
For example, the ideal bias table for an 8-bit hash function would be:
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
The hash prospector computes the standard deviation in order to turn
this into a single, normalized measurement. Lower scores are better.
However, there’s still one problem: the input space for a 32-bit hash
function is over 4 billion values. The full test takes my computer about
an hour and a half. Evaluating a 64-bit hash function is right out.
Again, Monte Carlo to the rescue! Rather than sample the entire
space, just sample a random subset. This provides a good estimate in
less than a second, allowing lots of terrible hash functions to be
discarded early. The full test can be saved only for the known good
32-bit candidates. 64-bit functions will only ever receive the estimate.
What did I find?
Once I got the bias issue sorted out, and after hours and hours of
running, followed up with some manual tweaking on my part, the
prospector stumbled across this little gem:
// DO use this one!
uint32_t
prospector32(uint32_t x)
{
x ^= x >> 15;
x *= UINT32_C(0x2c1b3c6d);
x ^= x >> 12;
x *= UINT32_C(0x297a2d39);
x ^= x >> 15;
return x;
}
According to a full (e.g. not estimated) bias evaluation, this function
beats the snot out of most of 32-bit hash functions I could find. It
even comes out ahead of this well known hash function that I believe
originates from the H2 SQL Database. (Update: Thomas Mueller has
confirmed that, indeed, this is his hash function.)
uint32_t
hash32(uint32_t x)
{
x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
x = (x >> 16) ^ x;
return x;
}
It’s still an excellent hash function, just slightly more biased than
mine.
Very briefly, prospector32() was the best 32-bit hash function I could
find, and I thought I had a major breakthrough. Then I noticed the
finalizer function for the 32-bit variant of MurmurHash3. It’s
also a 32-bit hash function:
uint32_t
murmurhash32_mix32(uint32_t x)
{
x ^= x >> 16;
x *= UINT32_C(0x85ebca6b);
x ^= x >> 13;
x *= UINT32_C(0xc2b2ae35);
x ^= x >> 16;
return x;
}
This one is just barely less biased than mine. So I still haven’t
discovered the best 32-bit hash function, only the second best one.
:-)
A pattern emerges
If you’re paying close enough attention, you may have noticed that all
three functions above have the same structure. The prospector had
stumbled upon it all on its own without knowledge of the existing
functions. It may not be so obvious for the second function, but here it
is refactored:
uint32_t
hash32(uint32_t x)
{
x ^= x >> 16;
x *= UINT32_C(0x45d9f3b);
x ^= x >> 16;
x *= UINT32_C(0x45d9f3b);
x ^= x >> 16;
return x;
}
I hadn’t noticed this until after the prospector had come across it on
its own. The pattern for all three is XOR-right-shift, multiply,
XOR-right-shift, multiply, XOR-right-shift. There’s something
particularly useful about this multiply-xorshift construction. The
XOR-right-shift diffuses bits rightward and the multiply diffuses bits
leftward. I like to think it’s “sloshing” the bits right, left, right,
left.
It seems that multiplication is particularly good at diffusion, so it
makes perfect sense to exploit it in non-cryptographic hash functions,
especially since modern CPUs are so fast at it. Despite this, it’s not
used much in cryptography due to issues with completing it in constant
time.
I like to think of this construction in terms of a five-tuple. For the
three functions it’s the following:
(15, 0x2c1b3c6d, 12, 0x297a2d39, 15) // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16) // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16) // murmurhash32_mix32()
The prospector actually found lots of decent functions following this
pattern, especially where the middle shift is smaller than the outer
shift. Thinking of it in terms of this tuple, I specifically directed
it to try different tuple constants. That’s what I meant by
“tweaking.” Eventually my new function popped out with its really low
bias.
The prospector has a template option (-p) if you want to try it
yourself:
$ ./prospector -p xorr,mul,xorr,mul,xorr
If you really have your heart set on certain constants, such as my
specific selection of shifts, you can lock those in while randomizing
the other constants:
$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15
Or the other way around:
$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr
My function seems a little strange using shifts of 15 bits rather than
a nice, round 16 bits. However, changing those constants to 16
increases the bias. Similarly, neither of the two 32-bit constants is
a prime number, but nudging those constants to the nearest prime
increases the bias. These parameters really do seem to be a local
minima in the bias, and using prime numbers isn’t important.
What about 64-bit integer hash functions?
So far I haven’t been able to improve on 64-bit hash functions. The main
function to beat is SplittableRandom / SplitMix64:
uint64_t
splittable64(uint64_t x)
{
x ^= x >> 30;
x *= UINT64_C(0xbf58476d1ce4e5b9);
x ^= x >> 27;
x *= UINT64_C(0x94d049bb133111eb);
x ^= x >> 31;
return x;
}
I also came across this one:
uint64_t
hash64(uint64_t x)
{
x ^= x >> 32;
x *= UINT64_C(0xd6e8feb86659fd93);
x ^= x >> 32;
x *= UINT64_C(0xd6e8feb86659fd93);
x ^= x >> 32;
return x;
}
Again, these follow the same construction as before. There really is
something special about it, and many other people have noticed, too.
Both functions have about the same bias. (Remember, I can only estimate
the bias for 64-bit hash functions.) The prospector has found lots of
functions with about the same bias, but nothing provably better. Until
it does, I have no new 64-bit integer hash functions to offer.
String hash
I’m also experimenting with using my hash function as a sort of
primitive for a string hash function. Here I’m using my function in
the loop to mix in one byte at a time, and finishing it with the same
finalizer as MurmurHash3.
uint32_t
prospector32s(const void *buf, uint32_t len, uint32_t key)
{
uint32_t hash = key;
const unsigned char *p = buf;
for (uint32_t i = 0; i < len; i++) {
hash += p[i];
hash ^= hash >> 15;
hash *= UINT32_C(0x2c1b3c6d);
hash ^= hash >> 12;
hash *= UINT32_C(0x297a2d39);
hash ^= hash >> 15;
}
hash ^= len;
hash ^= hash >> 16;
hash *= UINT32_C(0x85ebca6b);
hash ^= hash >> 13;
hash *= UINT32_C(0xc2b2ae35);
hash ^= hash >> 16;
return hash + key;
}
It has the typical amount of collisions when running it on a large
dictionary, so it seems decent enough but I don’t know if this
hash function is worth much. More experimentation needed.
Beyond random search
Right now the prospector does a completely random, unstructured search
hoping to stumble upon something good by chance. Perhaps it would be
worth using a genetic algorithm to breed those 5-tuples towards
optimum? Others have had success in this area with simulated
annealing.
There’s probably more to exploit from the multiply-xorshift construction
that keeps popping up. If anything, the prospector is searching too
broadly, looking at constructions that could never really compete no
matter what the constants. In addition to everything above, I’ve been
looking for good 32-bit hash functions that don’t use any 32-bit
constants, but I’m really not finding any with a competitively low bias.
Update after one week
About one week after publishing this article I found an even better hash
function. I believe this is the least biased 32-bit integer hash
function of this form ever devised. It’s even less biased than the
MurmurHash3 finalizer.
// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
x ^= x >> 16;
x *= UINT32_C(0x7feb352d);
x ^= x >> 15;
x *= UINT32_C(0x846ca68b);
x ^= x >> 16;
return x;
}
// inverse
uint32_t
lowbias32_r(uint32_t x)
{
x ^= x >> 16;
x *= UINT32_C(0x43021123);
x ^= x >> 15 ^ x >> 30;
x *= UINT32_C(0x1d69e2a5);
x ^= x >> 16;
return x;
}
If you’re willing to use an additional round of multiply-xorshift, this
next function actually reaches the theoretical bias limit (bias =
~0.021) as exhibited by a perfect integer hash function:
// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
x ^= x >> 17;
x *= UINT32_C(0xed5ad4bb);
x ^= x >> 11;
x *= UINT32_C(0xac4c1b51);
x ^= x >> 15;
x *= UINT32_C(0x31848bab);
x ^= x >> 14;
return x;
}
It’s statistically indistinguishable from a random permutation of all
32-bit integers.
nullprogram.com/blog/2018/07/20/
In several places, the C and C++ language specifications use a
curious, and fairly controversial, phrase: undefined behavior. For
certain program constructs, the specification prescribes no specific
behavior, instead allowing anything to happen. Such constructs
are considered erroneous, and so the result depends on the particulars
of the platform and implementation. The original purpose of undefined
behavior was for implementation flexibility. In other words, it’s
slack that allows a compiler to produce appropriate and efficient code
for its target platform.
Specifying a particular behavior would have put unnecessary burden on
implementations — especially in the earlier days of computing — making
for inefficient programs on some platforms. For example, if the result
of dereferencing a null pointer was defined to trap — to cause the
program to halt with an error — then platforms that do not have
hardware trapping, such as those without virtual memory, would be
required to instrument, in software, each pointer dereference.
In the 21st century, undefined behavior has taken on a somewhat
different meaning. Optimizers use it — or abuse it depending on your
point of view — to lift constraints that would otherwise
inhibit more aggressive optimizations. It’s not so much a
fundamentally different application of undefined behavior, but it does
take the concept to an extreme.
The reasoning works like this: A program that evaluates a construct
whose behavior is undefined cannot, by definition, have any meaningful
behavior, and so that program would be useless. As a result,
compilers assume programs never invoke undefined behavior and
use those assumptions to prove its optimizations.
Under this newer interpretation, mistakes involving undefined behavior
are more punishing and surprising than before. Programs
that seem to make some sense when run on a particular architecture may
actually compile into a binary with a security vulnerability due to
conclusions reached from an analysis of its undefined behavior.
This can be frustrating if your programs are intended to run on a very
specific platform. In this situation, all behavior really could be
locked down and specified in a reasonable, predictable way. Such a
language would be like an extended, less portable version of C or C++.
But your toolchain still insists on running your program on the
abstract machine rather than the hardware you actually care about.
However, even in this situation undefined behavior can still be
desirable. I will provide a couple of examples in this article.
Signed integer overflow
To start things off, let’s look at one of my all time favorite examples
of useful undefined behavior, a situation involving signed integer
overflow. The result of a signed integer overflow isn’t just
unspecified, it’s undefined behavior. Full stop.
This goes beyond a simple matter of whether or not the underlying
machine uses a two’s complement representation. From the perspective of
the abstract machine, just the act a signed integer overflowing is
enough to throw everything out the window, even if the overflowed result
is never actually used in the program.
On the other hand, unsigned integer overflow is defined — or, more
accurately, defined to wrap, not overflow. Both the undefined signed
overflow and defined unsigned overflow are useful in different
situations.
For example, here’s a fairly common situation, much like what actually
happened in bzip2. Consider this function that does substring
comparison:
int
cmp_signed(int i1, int i2, unsigned char *buf)
{
for (;;) {
int c1 = buf[i1];
int c2 = buf[i2];
if (c1 != c2)
return c1 - c2;
i1++;
i2++;
}
}
int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
for (;;) {
int c1 = buf[i1];
int c2 = buf[i2];
if (c1 != c2)
return c1 - c2;
i1++;
i2++;
}
}
In this function, the indices i1 and i2 will always be some small,
non-negative value. Since it’s non-negative, it should be unsigned,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually don’t want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.
The constraint is that the behavior of i1 or i2 overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior. On x86-64, where int is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.
In the signed case, incrementing the integers cannot overflow since that
would be undefined behavior. This permits the compiler to perform the
increment only in 64-bit precision without truncation if it would be
more efficient, which, in this case, it is.
Here’s the output of Clang 6.0.0 with -Os on x86-64. Pay close
attention to the main loop, which I named .loop:
cmp_signed:
movsxd rdi, edi ; use i1 as a 64-bit integer
mov al, [rdx + rdi]
movsxd rsi, esi ; use i2 as a 64-bit integer
mov cl, [rdx + rsi]
jmp .check
.loop: mov al, [rdx + rdi + 1]
mov cl, [rdx + rsi + 1]
inc rdx ; increment only the base pointer
.check: cmp al, cl
je .loop
movzx eax, al
movzx ecx, cl
sub eax, ecx ; return c1 - c2
ret
cmp_unsigned:
mov eax, edi
mov al, [rdx + rax]
mov ecx, esi
mov cl, [rdx + rcx]
cmp al, cl
jne .ret
inc edi
inc esi
.loop: mov eax, edi ; truncated i1 overflow
mov al, [rdx + rax]
mov ecx, esi ; truncated i2 overflow
mov cl, [rdx + rcx]
inc edi ; increment i1
inc esi ; increment i2
cmp al, cl
je .loop
.ret: movzx eax, al
movzx ecx, cl
sub eax, ecx
ret
As unsigned values, i1 and i2 can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, buf, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).
The signed integer helps to communicate the narrow contract of the
function — the limited range of i1 and i2 — to the compiler. In a
variant of C where signed integer overflow is defined (i.e. -fwrapv),
this capability is lost. In fact, using -fwrapv deoptimizes the signed
version of this function.
Side note: Using size_t (an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.
Strict aliasing
Another controversial undefined behavior is strict aliasing.
This particular term doesn’t actually appear anywhere in the C
specification, but it’s the popular name for C’s aliasing rules. In
short, variables with types that aren’t compatible are not allowed to
alias through pointers.
Here’s the classic example:
int
foo(int *a, int *b)
{
*b = 0; // store
*a = 1; // store
return *b; // load
}
Naively one might assume the return *b could be optimized to a simple
return 0. However, since a and b have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.
If foo has a narrow contract that forbids a and b to alias, we
have a couple of options for helping our compiler.
First, we could manually resolve the aliasing issue by returning 0
explicitly. In more complicated functions this might mean making local
copies of values, working only with those local copies, then storing the
results back before returning. Then aliasing would no longer matter.
int
foo(int *a, int *b)
{
*b = 0;
*a = 1;
return 0;
}
Second, C99 introduced a restrict qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to memcpy() are qualified with restrict as of C99.
Passing aliasing pointers through restrict parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.
int foo(int *restrict a, int *restrict b);
The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, int and long are never compatible even
when they have the same representation.
int foo(int *a, long *b);
If you use an extended or modified version of C without strict
aliasing (-fno-strict-aliasing), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.
What irritates a lot of people is that compilers will still
apply the strict aliasing rule even when it’s trivial for the compiler
to prove that aliasing is occurring:
/* note: forbidden */
long a;
int *b = (int *)&a;
It’s not just a simple matter of making exceptions for these cases.
The language specification would need to define all the rules about
when and where incompatible types are permitted to alias, and
developers would have to understand all these rules if they wanted to
take advantage of the exceptions. It can’t just come down to trusting
that the compiler is smart enough to see the aliasing when it’s
sufficiently simple. It would need to be carefully defined.
Besides, there are probably conforming, portable solutions
that, with contemporary compilers, will safely compile to the efficient
code you actually want anyway.
There is one special exception for strict aliasing: char * is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a char * pointer could force the compiler to
generate additional, unnecessary loads.
In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: uint8_t is not necessarily unsigned char.
That’s just one possible typedef definition for it. It could instead
typedef to, say, some internal __byte type.
In other words, technically speaking, uint8_t does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is willing to exploit it for optimization purposes.
Other undefined behaviors
Some kinds of undefined behavior don’t have performance or portability
benefits. They’re only there to make the compiler’s job a little
simpler. Today, most of these are caught trivially at compile time as
syntax or semantic issues (i.e. a pointer cast to a float).
Some others are obvious about their performance benefits and don’t
require much explanation. For example, it’s undefined behavior to
index out of bounds (with some special exceptions for one past the
end), meaning compilers are not obligated to generate those checks,
instead relying on the programmer to arrange, by whatever means, that
it doesn’t happen.
Undefined behavior is like nitro, a dangerous, volatile substance that
makes things go really, really fast. You could argue that it’s too
dangerous to use in practice, but the aggressive use of undefined
behavior is not without merit.
nullprogram.com/blog/2018/06/23/
The ptrace(2) (“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing strace — system call trace. With Ptrace, tracers
can pause tracees, inspect and set registers and memory, monitor
system calls, or even intercept system calls.
By intercept, I mean that the tracer can mutate system call arguments,
mutate the system call return value, or even block certain system calls.
Reading between the lines, this means a tracer can fully service system
calls itself. This is particularly interesting because it also means a
tracer can emulate an entire foreign operating system. This is done
without any special help from the kernel beyond Ptrace.
The catch is that a process can only have one tracer attached at a time,
so it’s not possible emulate a foreign operating system while also
debugging that process with, say, GDB. The other issue is that emulated
systems calls will have higher overhead.
For this article I’m going to focus on Linux’s Ptrace on
x86-64, and I’ll be taking advantage of a few Linux-specific extensions.
For the article I’ll also be omitting error checks, but the full source
code listings will have them.
You can find runnable code for the examples in this article here:
https://github.com/skeeto/ptrace-examples
strace
Before getting into the really interesting stuff, let’s start by
reviewing a bare bones implementation of strace. It’s no
DTrace, but strace is still incredibly useful.
Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The ptrace(2)
prototype generally looks something like this, though the specific
types may be different.
long ptrace(int request, pid_t pid, void *addr, void *data);
The pid is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.
The request field selects a specific Ptrace function, just like the
ioctl(2) interface. For strace, only two are needed:
PTRACE_TRACEME: This process is to be traced by its parent.
PTRACE_SYSCALL: Continue, but stop at the next system call
entrance or exit.
PTRACE_GETREGS: Get a copy of the tracee’s registers.
The other two fields, addr and data, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.
The strace interface is essentially a prefix to another command.
$ strace [strace options] program [arguments]
My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is fork(2) and exec(2) the
tracee process on the tail of argv. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.
pid_t pid = fork();
switch (pid) {
case -1: /* error */
FATAL("%s", strerror(errno));
case 0: /* child */
ptrace(PTRACE_TRACEME, 0, 0, 0);
execvp(argv[1], argv + 1);
FATAL("%s", strerror(errno));
}
The parent waits for the child’s PTRACE_TRACEME using wait(2). When
wait(2) returns, the child will be paused.
Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
PTRACE_O_TRACEFORK.
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);
All that’s left is a simple, endless loop that catches on system calls
one at a time. The body of the loop has four steps:
- Wait for the process to enter the next system call.
- Print a representation of the system call.
- Allow the system call to execute and wait for the return.
- Print the system call return value.
The PTRACE_SYSCALL request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
wait(2) is needed to wait for the tracee to enter the desired state.
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
When wait(2) returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, the operating system has not yet serviced this system call.
This detail will be important later.
The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, the system call number is
passed in rax, and the arguments (up to 6) are passed in
rdi, rsi, rdx, r10, r8, and r9. Reading the registers is
another Ptrace call, though there’s no need to wait(2) since the
tracee isn’t changing state.
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
long syscall = regs.orig_rax;
fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
syscall,
(long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
(long)regs.r10, (long)regs.r8, (long)regs.r9);
There’s one caveat. For internal kernel purposes, the system
call number is stored in orig_rax rather than rax. All the other
system call arguments are straightforward.
Next it’s another PTRACE_SYSCALL and wait(2), then another
PTRACE_GETREGS to fetch the result. The result is stored in rax.
ptrace(PTRACE_GETREGS, pid, 0, ®s);
fprintf(stderr, " = %ld\n", (long)regs.rax);
The output from this simple program is very crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use process_vm_readv(2) to
read those buffers from the tracee in order to print them appropriately.
However, this does lay the groundwork for system call interception.
System call interception
Suppose we want to use Ptrace to implement something like OpenBSD’s
pledge(2), in which a process pledges to use only a
restricted set of system calls. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.
Before entering this main loop, a process can limit itself to the few
operations that it needs. If the program has a flaw allowing it
to be exploited by bad input, the pledge significantly limits what the
exploit can accomplish.
Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call exit(2) in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.
The tricky part is that there’s no way to abort a system call once
it’s started. When tracer returns from wait(2) on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.
However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” EPERM error
in errno via the normal in-band signaling.
for (;;) {
/* Enter next system call */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
/* Is this system call permitted? */
int blocked = 0;
if (is_syscall_blocked(regs.orig_rax)) {
blocked = 1;
regs.orig_rax = -1; // set to invalid syscall
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
/* Run system call and stop on exit */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
if (blocked) {
/* errno = EPERM */
regs.rax = -EPERM; // Operation not permitted
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
}
This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (open(2)) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.
How could the tracee communicate to the tracer? Use an artificial
system call!
Creating an artificial system call
For my new pledge-like system call — which I call xpledge() to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.
#define SYS_xpledge 10000
Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s pledge(2), which uses a string interface.
Actually designing robust and secure sets of privileges is really
complicated, as the pledge(2) manpage shows. Here’s the entire
interface and implementation of the system call for the tracee:
#define _GNU_SOURCE
#include <unistd.h>
#define XPLEDGE_RDWR (1 << 0)
#define XPLEDGE_OPEN (1 << 1)
#define xpledge(arg) syscall(SYS_xpledge, arg)
If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. brk(2)). The
PLEDGE_RDWR bit allows various read and write system calls
(read(2), readv(2), pread(2), preadv(2), etc.). The
PLEDGE_OPEN bit allows open(2).
To prevent privileges from being escalated back, pledge() blocks
itself — though this also prevents dropping more privileges later down
the line.
In the xpledge tracer, I just need to check for this system call:
/* Handle entrance */
switch (regs.orig_rax) {
case SYS_pledge:
register_pledge(regs.rdi);
break;
}
The operating system will return ENOSYS (Function not implemented)
since this isn’t a real system call. So on the way out I overwrite
this with a success (0).
/* Handle exit */
switch (regs.orig_rax) {
case SYS_pledge:
ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
break;
}
I wrote a little test program that opens /dev/urandom, makes a read,
tries to pledge, then tries to open /dev/urandom a second time, then
confirms it can read from the original /dev/urandom file descriptor.
Running without a pledge tracer, the output looks like this:
$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
Making an invalid system call doesn’t crash an application. It just
fails, which is a rather convenient fallback. When run under the
tracer, it looks like this:
$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
The pledge succeeds but the second fopen(3) does not since the tracer
blocked it with EPERM.
This concept could be taken much further, to, say, change file paths or
return fake results. A tracer could effectively chroot its tracee,
prepending some chroot path to the root of any path passed through a
system call. It could even lie to the process about what user it is,
claiming that it’s running as root. In fact, this is exactly how the
Fakeroot NG program works.
Foreign system emulation
Suppose you don’t just want to intercept some system calls, but
all system calls. You’ve got a binary intended to run on another
operating system, so none of the system calls it makes will ever
work.
You could manage all this using only what I’ve described so far. The
tracer would always replace the system call number with a dummy, allow
it to fail, then service the system call itself. But that’s really
inefficient. That’s essentially three context switches for each system
call: one to stop on the entrance, one to make the always-failing
system call, and one to stop on the exit.
The Linux version of PTrace has had a more efficient operation for
this technique since 2005: PTRACE_SYSEMU. PTrace stops only once
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.
for (;;) {
ptrace(PTRACE_SYSEMU, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
switch (regs.orig_rax) {
case OS_read:
/* ... */
case OS_write:
/* ... */
case OS_open:
/* ... */
case OS_exit:
/* ... */
/* ... and so on ... */
}
}
To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this PTRACE_SYSEMU
tracer, a loader (to take the place of exec(2)), and whatever system
libraries the binary needs (or only run static binaries).
In fact, this sounds like a fun weekend project.
See also
nullprogram.com/blog/2018/06/10/
In the past year I’ve written a number of minimalist C libraries,
particularly header libraries. The distinction for “minimalist” is, of
course, completely arbitrary and subjective. My definition in this
context isn’t about the library’s functionality being stupidly
trivial or even necessarily simple. I’m talking about
interface (API) complexity and the library’s run time requirements.
Complex functionality can, in some cases, be tucked behind a simple
interface.
In this article I’ll give my definition for minimalist C API, then take
you through some of my own recent examples.
Minimalist properties
A minimalist C library would generally have these properties.
(1) Small number of functions, perhaps even as little as one.
This one’s pretty obvious. More functions means more surface area in
the interface. Since these functions typically interact, the
relationship between complexity and number of functions will be
superlinear.
(2) No dynamic memory allocations.
The library mustn’t call malloc() internally. It’s up to the caller
to allocate memory for the library. What’s nice about this is that
it’s completely up to the application exactly how memory is allocated.
Maybe it’s using a custom allocator, or it’s not linked against the
standard library.
A common approach is for the application to provide allocation functions
to the library — e.g. function pointers at run time, or define functions
with specific, expected names. The library would call these instead of
malloc() and free(). While that’s perfectly reasonable, it’s not
really minimalist, so I’m not including this technique.
Instead a minimalist API is designed such that it’s natural for the
application to make the allocations itself. Perhaps the library only
needs a single, fixed allocation for all its operations. Or maybe the
application specifies its requirements and the library communicates
how much memory is needed to meet those requirements. I’ll give
specific examples shortly.
One nice result of this property is that it eliminates one of the
common failure conditions: the out of memory error. If the library
doesn’t allocate memory, then it can’t run out of it!
Another convenient, minor outcome is the lack of casts from void * to
the appropriate type (e.g. on the return from malloc()). These casts
are implicit in C but must be made explicit in C++. Often, completely by
accident, my minimalist C libraries can be compiled as C++ without any
changes. This is only a minor benefit since these casts could be made
explicit in C, too, if C++ compatibility was desired. It’s just ugly.
In simple terms, the library mustn’t use functions from stdio.h —
with the exception of the sprintf() family. Like with memory
allocation, it leaves input and output to the application, letting it
decide exactly how, where, and when information comes and goes.
Like with memory allocation, maybe the application prefers not to use
the C standard library’s buffered IO. Perhaps the application is using
cooperative or green threads, and it would be bad for the
library to block internally on IO.
Also like avoiding memory allocation, a library that doesn’t perform IO
can’t have IO errors. Combined, this means it’s quite possible that a
minimalist library may have no error cases at all. Eliminating those
error handling paths makes the library a lot simpler. The one major
error condition left that’s difficult to eliminate are those pesky
integer overflow checks.
Communicating IO preferences to libraries can be a real problem with
C, since the standard library lacks generic input and output. Putting
FILE * pointers directly into an API mingles it with the C standard
library in potentially bad ways. Passing file names as strings is an
option, but this limits IO to files — versus, say, sockets. On POSIX
systems, at least it could talk about IO in terms of file descriptors,
but even that’s not entirely flexible — e.g. output to a memory
buffer, or anything not sufficiently file-like.
Again, a common way to deal with this is for the application to
provide IO function pointers to the library. But a minimalist
library’s API would be designed such that not even this is needed,
instead operating strictly on buffers. I’ll also have a couple
examples of this shortly.
With IO and memory allocation out of the picture, another frequent,
accidental result is no dependency on the C standard library. The only
significant functionality left in the standard library are the
mathematical functions (math.h), float parsing, and a few of
the string functions (string.h), like memset() and memmove().
These are valuable since they’re handled specially by the
compiler.
(4) Define at most one structure, and perhaps even none.
More types means more complexity, perhaps even more so than having lots
of functions. Some minimalist libraries can be so straightforward that
they can operate solely on simple, homogeneous buffers. I’ll show some
examples of this, too.
As I said initially, minimalism is about interface, not implementation.
The library is free to define as many structures internally as it needs
since the application won’t be concerned with them.
One common way to avoid complicated types in an API is to make them
opaque. The structures aren’t defined in the API, and instead the
application only touches pointers, making them like handles.
struct foo;
struct foo *foo_create(...);
int foo_method(struct foo *, ...);
void foo_destroy(struct foo *);
However, this is difficult to pull off when the library doesn’t allocate
its own memory.
Bitmap library
The first example is a library for creating bitmap (BMP) images. As you
may already know, I strongly prefer Netpbm, which is so simple
that it doesn’t even need a library. But nothing is quite so universally
supported as BMP.
24-bit BMP (Bitmap) ANSI C header library
This library is a perfect example of minimalist properties 2, 3, and 4.
It also doesn’t use any of the C standard library, though only by
accident.
It’s not a general purpose BMP library. It only supports 24-bit true
color, ignoring most BMP features such as palettes. Color is
represented as a 24-bit integer, packed 0xRRGGBB.
unsigned long bmp_size(long width, long height);
void bmp_init(void *, long width, long height);
void bmp_set(void *, long x, long y, unsigned long color);
unsigned long bmp_get(const void *, long x, long y);
Strictly speaking, even the bmp_get() function could be tossed since
the library is not intended to load external bitmap images. The
application really shouldn’t need to read back previously set pixels.
There is no allocation, no IO, and no data structures. The application
indicates the dimensions of image it wants to create, and the library
says how large of a buffer it needs. The remaining functions all operate
on this opaque buffer. To write the image out, the application only
needs to dump the buffer to a file.
Here’s a complete, strict error checking example of its usage:
#define RED 0xff0000UL
#define BLUE 0x0000ffUL
unsigned long size = bmp_size(width, height);
if (!size || size > SIZE_MAX) die("invalid dimensions");
void *bmp = calloc(size, 1);
if (!bmp) die("out of memory");
bmp_init(bmp, width, height);
/* Checkerboard pattern */
for (long y = 0; y < height; y++)
for (long x = 0; x < width; x++)
bmp_set(bmp, x, y, x % 2 == y % 2 ? RED : BLUE);
if (!fwrite(bmp, size, 1, out))
die("output error");
free(bmp);
The only library function that can fail is bmp_size(). When the
given image dimensions would overflow one of the BMP header fields, it
returns zero to indicate as such.
In bmp_set(), how does it know the dimensions of the image so that
it can find the pixel? It reads that from the buffer just like a BMP
reader would — and in a endian-agnostic manner. There are
no bounds checks — that’s the caller’s job — so it only needs to read
the image’s width in order to find the pixel’s location.
Since IO is under control of the application, it can always choose load
the original buffer contents back from a file, allowing a minimal sort
of BMP loading. However, this only works for trusted input as there
are no validation checks on the buffer.
32-bit integer hash set library
The second example is an integer hash set library. It uses closed
hashing. I initially wrote this for r/dailyprogrammer solution and
then formalized it into a little reusable library.
C99 32-bit integer hash set header library
Here’s the entire API:
int set32_z(uint32_t max);
void set32_insert(uint32_t *table, int z, uint32_t v);
void set32_remove(uint32_t *table, int z, uint32_t v);
int set32_contains(uint32_t *table, int z, uint32_t v);
Again, it’s a good example of properties 2, 3, and 4. Like the BMP
library, the application indicates the maximum number of integers it
will store in the hash set, and the library returns the power of two
number of uint32_t it needs to allocate (and zero-initialize).
In this API I’m just barely skirting not defining a data structure.
The caller must pass both the table pointer and the power of two size,
and these two values would normally be bundled together into a
structure.
int z = set32_z(max);
unsigned long long n = 1ULL << z;
if (n > SIZE_MAX) die("table too large");
uint32_t *table = calloc(sizeof(*table), n);
if (!table) die("out of memory");
set32_insert(table, z, value);
if (set32_contains(table, z, value))
/* ... */;
set32_remove(table, z, value);
free(table);
Iteration is straightforward, which is why it’s not in the API: visit
each element in the allocated buffer. Zeroes are empty slots.
If a different maximum number of elements is needed, the application
initializes a new, separate table, then iterates over the old table
inserting each integer in turn.
Perhaps the most interesting part of the API is that it has no errors.
No function can fail.
Also, like the BMP library, it accidentally doesn’t use the standard
library, except for a typedef from stdint.h.
Fantasy name generator
Nearly a decade ago I cloned in Perl the RinkWorks Fantasy Name
Generator. This version was slow and terrible, and I’m sometimes
tempted to just delete it.
A few years later I rewrote it in JavaScript using an entirely
different approach. In order to improve performance, it has a template
compilation step. The compiled template is a hierarchical composition of
simple generator objects. It’s much faster, and easily enabled some
extensions to the syntax.
Germán Méndez Bravo ported the JavaScript version to C++. This
C++ implementation was recently adopted into IVAN, a
roguelike game.
This recent commotion made me realize something: I hadn’t yet
implemented it in C! So I did.
Fantasy name generator ANSI C header library
The entire API is just a single function with four possible return
values. It’s a perfect example of minimalist property 1.
#define NAMEGEN_SUCCESS 0
#define NAMEGEN_TRUNCATED 1 /* Output was truncated */
#define NAMEGEN_INVALID 2 /* Pattern is invalid */
#define NAMEGEN_TOO_DEEP 3 /* Exceeds maximum nesting depth */
int namegen(char *dest,
size_t len,
const char *pattern,
unsigned long *seed);
There’s no template compilation step, and it generates names straight
from the template.
There are three kinds of errors.
-
If the output buffer wasn’t large enough, it warns about the name
being truncated.
-
The template could be invalid — e.g. incorrectly paired brackets.
-
The template could have too much nesting. I decided to hard code the
maximum nesting depth to a generous 32 levels. This limitation makes
the generator a lot simpler without any practical impact. It also
protects against unbounded memory usage — particularly stack
overflows — by arbitrarily complex patterns. This means it’s
perfectly safe to generate names from untrusted, arbitrarily long
input patterns.
Here’s a usage example:
char name[64];
unsigned long seed = 0xb9584b61UL;
namegen(name, sizeof(name), "!sV'i (the |)!id", &seed);
/* name = "Engia'pin the Doltolph" */
The generator supports UTF-8, almost by accident. (I’d have to go out of
my way not to support it.)
Despite the lack of a compilation step, which requires the parsing the
template for each generated name, it’s an order of magnitude faster
than the C++ version, which caught me by surprise. The high
performance is due to name generation being a single pass over the
template using reservoir sampling.
Internally it maintains a stack of “reset” pointers, each pointing
into the output buffer where the current nesting level began its
output. Each time it hits an alternation (|), it generates a random
number and decides whether or not to use the new option. The first
time it’s a 1/2 chance it chooses the new option. The second time, a
1/3 chance. The third time a 1/4 chance, and so on. When the new
option is selected, the reset pointer is used to “undo” any previous
output for the current nesting level.
The reservoir sampling means it needs to generate more random numbers
(once per option) than the JavaScript and C++ version (once per nesting
level). However, it uses its own, fast internal PRNG rather than
rand(). Generating these random numbers is basically free.
Not using rand() means that, like the previous libraries, it doesn’t
need anything from the standard library. It also has better quality
results since the typical standard library rand() is total rubbish,
both in terms of speed and quality (and typically has a PLT
penalty). Finally it means the results are identical across all
platforms for the same template and seed, which is one reason it’s part
of the API.
Another slight performance boost comes from the representation of
pattern substitutions, i.e. i will select a random “idiot” name from a
fixed selection of strings. The obvious representation is an array of
string pointers, as seen in the C++ version. However, there are a lot of
these little strings, which makes for a lot of pointers cluttering up
the relocation table. Instead, I packed it all into few small
pointerless tables, which on x86-64 are accessed efficiently via
RIP-relative addressing. It’s efficient, though not friendly to
modification.
I’m very happy with how this library turned out.
UTF-7 encoder and decoder
The last example is a UTF-7 encoder and decoder. UTF-7 is a method
for encoding arbitrary Unicode text within ASCII text, created as a
nasty hack to allow Unicode messages to be sent over ASCII-limited email
infrastructure. The gist of it is that the Unicode parts of a message
are encoded as UTF-16, then base64 encoded, then interpolated into the
ASCII stream between delimiters.
Einstein (allegedly) said “If you can’t explain it to a six year old,
you don’t understand it yourself.” The analog for programming is to
replace the six year old with a computer, and explaining an idea to a
computer is done by writing a program. I wanted to understand UTF-7, so
I implemented it.
A UTF-7 stream encoder and decoder in ANSI C
Here’s the entire API. It’s modeled a little after the zlib API.
/* utf7_encode() special code points */
#define UTF7_FLUSH -1L
/* return codes */
#define UTF7_OK -1
#define UTF7_FULL -2
#define UTF7_INCOMPLETE -3
#define UTF7_INVALID -4
struct utf7 {
char *buf;
size_t len;
/* then some "private" internal fields */
};
void utf7_init(struct utf7 *, const char *indirect);
int utf7_encode(struct utf7 *, long codepoint);
long utf7_decode(struct utf7 *);
Finally a library that defines a structure! The other fields (not shown)
hold important state information, but the application is only concerned
with buf and len: an input or output buffer. The same structure is
used for encoding and decoding, though only for one task at a time.
Following the minimalist library principle, there is no memory
allocation. When encoding a UTF-7 stream, the application’s job is to
point buf to an output buffer, indicating its length with len. Then
it feeds code points one at a time into the encoder. When the output is
full, it returns UTF7_FULL. The application must provide a new buffer
and try again.
This example usage is more complicated than I anticipated it would be.
Properly pumping code points through the encoder requires a loop (or
at least a second attempt).
char buffer[1024];
struct utf7 ctx;
utf7_init(&ctx, 0);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
/* Assumes "wide character" input is Unicode */
for (;;) {
wint_t c = fgetwc(stdin);
if (c == WEOF)
break;
while (utf7_encode(ctx, c) != UTF7_OK) {
/* Flush output and reset buffer */
fwrite(buffer, sizeof(buffer), 1, stdout);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
}
}
/* Flush all pending output */
while (utf7_encode(ctx, UTF7_FLUSH) != UTF7_OK) {
fwrite(buffer, sizeof(buffer), 1, stdout);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
}
/* Write remaining output */
fwrite(buffer, sizeof(buffer) - ctx.len, 1, stdout);
/* Check for errors */
if (fflush(stdout))
die("output error");
if (ferror(stdin))
die("input error");
Flushing (UTF7_FLUSH) is necessary since, due to base64 encoding,
adjacent Unicode characters usually share a base64 character. Just
because a code point was absorbed into the encoder doesn’t mean it was
written into the output buffer. The encoding for that character may
depend on the next character to come. The special “flush” input forces
this out. It’s valid to flush in the middle of a stream, though this
may penalize encoding efficiency (e.g. the output may be
larger than necessary).
It’s not possible for the encoder to fail, so there are no error
conditions to worry about from the library.
Decoding is a different matter. It works almost in reverse from the
encoder: buf points to the input and the decoder is pumped to
return one code point at a time. It returns one of:
-
A non-negative value: a valid code point (including ASCII).
-
UTF7_OK: Input was exhausted. Stopping here would be valid. This is
what you should get when there’s no more input.
-
UTF7_INVALID: The input was invalid. buf points at the invalid byte.
-
UTF7_INCOMPLETE: Input was exhausted, but more is expected. If
there is no more input, then the input must have been truncated,
which is an error.
So there are two possible errors for two kinds of invalid input.
Parsing errors are unavoidable when parsing input.
Again, this library accidentally doesn’t require the standard library.
It doesn’t even depend on the compiler’s locale being
compatible with ASCII since none of its internal tables use string or
character literals. It behaves exactly the same across all conforming
platforms.
More examples
I had a few more examples in mind, but this article has gone on long
enough.
Instead I’ll save these for other articles!
nullprogram.com/blog/2018/05/31/
Emacs 26.1 was recently released. As you would expect from a
major release, it comes with lots of new goodies. Being a bit of an
Emacs Lisp enthusiast, the two most interesting new features
are generators (iter) and native threads
(thread).
Correction: Generators were actually introduced in Emacs 25.1
(Sept. 2016), not Emacs 26.1. Doh!
Generators
Generators are one of those cool language features that provide a lot of
power at a small implementation cost. They’re like a constrained form of
coroutines, but, unlike coroutines, they’re typically built entirely on
top of first-class functions (e.g. closures). This means no additional
run-time support is needed in order to add generators to a language.
The only complication is the changes the compiler. Generators are not
compiled the same way as normal functions despite looking so similar.
What’s perhaps coolest of all about lisp-family generators, including
Emacs Lisp, is that the compiler component can be implemented
entirely with macros. The compiler need not be modified at all,
making generators no more than a library, and not actually part of the
language. That’s exactly how they’ve been implemented in Emacs Lisp
(emacs-lisp/generator.el).
So what’s a generator? It’s a function that returns an iterator
object. When an iterator object is invoked (e.g. iter-next) it
evaluates the body of the generator. Each iterator is independent.
What makes them unusual (and useful) is that the evaluation is
paused in the middle of the body to return a value, saving all the
internal state in the iterator. Normally pausing in the middle of
functions isn’t possible, which is what requires the special compiler
support.
Emacs Lisp generators appear to be most closely modeled after Python
generators, though it also shares some similarities to
JavaScript generators. What makes it most like Python is the use
of signals for flow control — something I’m not personally enthused
about (though see also). When a Python generator
completes, it throws a StopItertion exception. In Emacs Lisp, it’s
an iter-end-of-sequence signal. A signal is out-of-band and avoids
the issue relying on some special in-band value to communicate the end
of iteration.
In contrast, JavaScript’s solution is to return a “rich” object wrapping
the actual yield value. This object has a done field that communicates
whether iteration has completed. This avoids the use of exceptions for
flow control, but the caller has to unpack the rich object.
Fortunately the flow control issue isn’t normally exposed to Emacs Lisp
code. Most of the time you’ll use the iter-do macro or (my preference)
the new cl-loop keyword iter-by.
To illustrate how a generator works, here’s a really simple iterator
that iterates over a list:
(iter-defun walk (list)
(while list
(iter-yield (pop list))))
Here’s how it might be used:
(setf i (walk '(:a :b :c)))
(iter-next i) ; => :a
(iter-next i) ; => :b
(iter-next i) ; => :c
(iter-next i) ; error: iter-end-of-sequence
The iterator object itself is opaque and you shouldn’t rely on any
part of its structure. That being said, I’m a firm believer that we
should understand how things work underneath the hood so that we can
make the most effective use of at them. No program should rely on the
particulars of the iterator object internals for correctness, but a
well-written program should employ them in a way that best exploits
their expected implementation.
Currently iterator objects are closures, and iter-next invokes the
closure with its own internal protocol. It asks the closure to return
the next value (:next operation), and iter-close asks it to clean
itself up (:close operation).
Since they’re just closures, another really cool thing about Emacs
Lisp generators is that iterator objects are generally readable.
That is, you can serialize them out with print and bring them back to
life with read, even in another instance of Emacs. They exist
independently of the original generator function. This will not work if
one of the values captured in the iterator object is not readable (e.g.
buffers).
How does pausing work? Well, one of other exciting new features of
Emacs 26 is the introduction of a jump table opcode, switch. I’d
lamented in the past that large cond and cl-case expressions could
be a lot more efficient if Emacs’ byte code supported jump tables. It
turns an O(n) sequence of comparisons into an O(1) lookup and jump.
It’s essentially the perfect foundation for a generator since it can
be used to jump straight back to the position where evaluation was
paused.
Buuut, generators do not currently use jump tables. The generator
library predates the new switch opcode, and, being independent of it,
its author, Daniel Colascione, went with the best option at the time.
Chunks of code between yields are packaged as individual closures. These
closures are linked together a bit like nodes in a graph, creating a
sort of state machine. To get the next value, the iterator object
invokes the closure representing the next state.
I’ve manually macro expanded the walk generator above into a form
that roughly resembles the expansion of iter-defun:
(defun walk (list)
(let (state)
(cl-flet* ((state-2 ()
(signal 'iter-end-of-sequence nil))
(state-1 ()
(prog1 (pop list)
(when (null list)
(setf state #'state-2))))
(state-0 ()
(if (null list)
(state-2)
(setf state #'state-1)
(state-1))))
(setf state #'state-0)
(lambda ()
(funcall state)))))
This omits the protocol I mentioned, and it doesn’t have yield results
(values passed to the iterator). The actual expansion is a whole lot
messier and less optimal than this, but hopefully my hand-rolled
generator is illustrative enough. Without the protocol, this iterator is
stepped using funcall rather than iter-next.
The state variable keeps track of where in the body of the generator
this iterator is currently “paused.” Continuing the iterator is
therefore just a matter of invoking the closure that represents this
state. Each state closure may update state to point to a new part of
the generator body. The terminal state is obviously state-2. Notice
how state transitions occur around branches.
I had said generators can be implemented as a library in Emacs Lisp.
Unfortunately theres a hole in this: unwind-protect. It’s not valid to
yield inside an unwind-protect form. Unlike, say, a throw-catch,
there’s no mechanism to trap an unwinding stack so that it can be
restarted later. The state closure needs to return and fall through the
unwind-protect.
A jump table version of the generator might look like the following.
I’ve used cl-labels since it allows for recursion.
(defun walk (list)
(let ((state 0))
(cl-labels
((closure ()
(cl-case state
(0 (if (null list)
(setf state 2)
(setf state 1))
(closure))
(1 (prog1 (pop list)
(when (null list)
(setf state 2))))
(2 (signal 'iter-end-of-sequence nil)))))
#'closure)))
When byte compiled on Emacs 26, that cl-case is turned into a jump
table. This “switch” form is closer to how generators are implemented in
other languages.
Iterator objects can share state between themselves if they
close over a common environment (or, of course, use the same global
variables).
(setf foo
(let ((list '(:a :b :c)))
(list
(funcall
(iter-lambda ()
(while list
(iter-yield (pop list)))))
(funcall
(iter-lambda ()
(while list
(iter-yield (pop list))))))))
(iter-next (nth 0 foo)) ; => :a
(iter-next (nth 1 foo)) ; => :b
(iter-next (nth 0 foo)) ; => :c
For years there has been a very crude way to “pause” a function and
allow other functions to run: accept-process-output. It only works in
the context of processes, but five years ago this was sufficient for me
to build primitives on top of it. Unlike this old process
function, generators do not block threads, including the user interface,
which is really important.
Threads
Emacs 26 also bring us threads, which have been attached in a very
bolted on fashion. It’s not much more than a subset of pthreads: shared
memory threads, recursive mutexes, and condition variables. The
interfaces look just like they do in pthreads, and there hasn’t been
much done to integrate more naturally into the Emacs Lisp ecosystem.
This is also only the first step in bringing threading to Emacs Lisp.
Right now there’s effectively a global interpreter lock (GIL), and
threads only run one at a time cooperatively. Like with generators, the
Python influence is obvious. In theory, sometime in the future this
interpreter lock will be removed, making way for actual concurrency.
This is, again, where I think it’s useful to contrast with JavaScript,
which was also initially designed to be single-threaded. Low-level
threading primitives weren’t exposed — though mostly because
JavaScript typically runs sandboxed and there’s no safe way to expose
those primitives. Instead it got a web worker API that exposes
concurrency at a much higher level, along with an efficient interface
for thread coordination.
For Emacs Lisp, I’d prefer something safer, more like the JavaScript
approach. Low-level pthreads are now a great way to wreck Emacs with
deadlocks (with no C-g escape). Playing around with the new
threading API for just a few days, I’ve already had to restart Emacs a
bunch of times. Bugs in Emacs Lisp are normally a lot more forgiving.
One important detail that has been designed well is that dynamic
bindings are thread-local. This is really essential for correct
behavior. This is also an easy way to create thread-local storage
(TLS): dynamically bind variables in the thread’s entrance function.
;;; -*- lexical-binding: t; -*-
(defvar foo-counter-tls)
(defvar foo-path-tls)
(defun foo-make-thread (path)
(make-thread
(lambda ()
(let ((foo-counter-tls 0)
(foo-name-tls path))
...))))
However, cl-letf “bindings” are not thread-local, which makes
this otherwise incredibly useful macro quite dangerous in the
presence of threads. This is one way that the new threading API feels
bolted on.
Building generators on threads
In my stack clashing article I showed a few different ways to
add coroutine support to C. One method spawned per-coroutine threads,
and coordinated using semaphores. With the new threads API in Emacs,
it’s possible to do exactly the same thing.
Since generators are just a limited form of coroutines, this means
threads offer another, very different way to implement them. The
threads API doesn’t provide semaphores, but condition variables can fill
in for them. To “pause” in the middle of the generator, just wait on a
condition variable.
So, naturally, I just had to see if I could make it work. I call it a
“thread iterator” or “thriter.” The API is very similar to iter:
https://github.com/skeeto/thriter
This is merely a proof of concept so don’t actually use this library
for anything. These thread-based generators are about 5x slower than
iter generators, and they’re a lot more heavy-weight, needing an
entire thread per iterator object. This makes thriter-close all the
more important. On the other hand, these generators have no problem
yielding inside unwind-protect.
Originally this article was going to dive into the details of how
these thread-iterators worked, but thriter turned out to be quite a
bit more complicated than I anticipated, especially as I worked
towards feature matching iter.
The gist of it is that each side of a next/yield transaction gets its
own condition variable, but share a common mutex. Values are passed
between the threads using slots on the iterator object. The side that
isn’t currently running waits on a condition variable until the other
side frees it, after which the releaser waits on its own condition
variable for the result. This is similar to asynchronous requests in
Emacs dynamic modules.
Rather than use signals to indicate completion, I modeled it after
JavaScript generators. Iterators return a cons cell. The car indicates
continuation and the cdr holds the yield result. To terminate an
iterator early (thriter-close or garbage collection), thread-signal
is used to essentially “cancel” the thread and knock it off the
condition variable.
Since threads aren’t (and shouldn’t be) garbage collected, failing to
run a thread-iterator to completion would normally cause a memory leak,
as the thread sits there forever waiting on a “next” that will never
come. To deal with this, there’s a finalizer is attached to the
iterator object in such a way that it’s not visible to the thread. A
lost iterator is eventually cleaned up by the garbage collector, but, as
usual with finalizers, this is only a last resort.
The future of threads
This thread-iterator project was my initial, little experiment with
Emacs Lisp threads, similar to why I connected a joystick to Emacs
using a dynamic module. While I don’t expect the current thread
API to go away, it’s not really suitable for general use in its raw
form. Bugs in Emacs Lisp programs should virtually never bring down
Emacs and require a restart. Outside of threads, the few situations
that break this rule are very easy to avoid (and very obvious that
something dangerous is happening). Dynamic modules are dangerous by
necessity, but concurrency doesn’t have to be.
There really needs to be a safe, high-level API with clean thread
isolation. Perhaps this higher-level API will eventually build on top of
the low-level threading API.
nullprogram.com/blog/2018/05/27/
Update: There’s a good discussion on Hacker News.
Over on GitHub, David Yu has an interesting performance benchmark for
function calls of various Foreign Function Interfaces (FFI):
https://github.com/dyu/ffi-overhead
He created a shared object (.so) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.
For the C “FFI” he used standard dynamic linking, not dlopen(). This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.
The most surprising result of the benchmark is that
LuaJIT’s FFI is substantially faster than C. It’s about
25% faster than a native C function call to a shared object function.
How could a weakly and dynamically typed scripting language come out
ahead on a benchmark? Is this accurate?
It’s actually quite reasonable. The benchmark was run on Linux, so the
performance penalty we’re seeing comes the Procedure Linkage Table
(PLT). I’ve put together a really simple experiment to demonstrate the
same effect in plain old C:
https://github.com/skeeto/dynamic-function-benchmark
Here are the results on an Intel i7-6700 (Skylake):
plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
These are three different types of function calls:
- Through the PLT
- An indirect function call (via
dlsym(3))
- A direct function call (via a JIT-compiled function)
As shown, the last one is the fastest. It’s typically not an option
for C programs, but it’s natural in the presence of a JIT compiler,
including, apparently, LuaJIT.
In my benchmark, the function being called is named empty():
And to compile it into a shared object:
$ cc -shared -fPIC -Os -o empty.so empty.c
Just as in my PRNG shootout, the benchmark calls this function
repeatedly as many times as possible before an alarm goes off.
Procedure Linkage Tables
When a program or library calls a function in another shared object,
the compiler cannot know where that function will be located in
memory. That information isn’t known until run time, after the program
and its dependencies are loaded into memory. These are usually at
randomized locations — e.g. Address Space Layout Randomization
(ASLR).
How is this resolved? Well, there are a couple of options.
One option is to make a note about each such call in the binary’s
metadata. The run-time dynamic linker can then patch in the correct
address at each call site. How exactly this would work depends on the
particular code model used when compiling the binary.
The downside to this approach is slower loading, larger binaries, and
less sharing of code pages between different processes. It’s
slower loading because every dynamic call site needs to be patched
before the program can begin execution. The binary is larger because
each of these call sites needs an entry in the relocation table. And the
lack of sharing is due to the code pages being modified.
On the other hand, the overhead for dynamic function calls would be
eliminated, giving JIT-like performance as seen in the benchmark.
The second option is to route all dynamic calls through a table. The
original call site calls into a stub in this table, which jumps to the
actual dynamic function. With this approach the code does not need to
be patched, meaning it’s trivially shared between processes.
Only one place needs to be patched per dynamic function: the entries
in the table. Even more, these patches can be performed lazily, on
the first function call, making the load time even faster.
On systems using ELF binaries, this table is called the Procedure
Linkage Table (PLT). The PLT itself doesn’t actually get patched —
it’s mapped read-only along with the rest of the code. Instead the
Global Offset Table (GOT) gets patched. The PLT stub fetches the
dynamic function address from the GOT and indirectly jumps to that
address. To lazily load function addresses, these GOT entries are
initialized with an address of a function that locates the target
symbol, updates the GOT with that address, and then jumps to that
function. Subsequent calls use the lazily discovered address.

The downside of a PLT is extra overhead per dynamic function call,
which is what shows up in the benchmark. Since the benchmark only
measures function calls, this appears to be pretty significant, but in
practice it’s usually drowned out in noise.
Here’s the benchmark:
/* Cleared by an alarm signal. */
volatile sig_atomic_t running;
static long
plt_benchmark(void)
{
long count;
for (count = 0; running; count++)
empty();
return count;
}
Since empty() is in the shared object, that call goes through the PLT.
Indirect dynamic calls
Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
dlsym(3).
void *h = dlopen("path/to/lib.so", RTLD_NOW);
void (*f)(void) = dlsym("f");
f();
Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then dlsym(3) may actually return the address of
the PLT stub.)
However, this is still an indirect function call. On conventional
architectures, direct function calls have an immediate relative
address. That is, the target of the call is some hard-coded offset from
the call site. The CPU can see well ahead of time where the call is
going.
An indirect function call has more overhead. First, the address has to
be stored somewhere. Even if that somewhere is just a register, it
increases register pressure by using up a register. Second, it
provokes the CPU’s branch predictor since the call target isn’t
static, making for extra bookkeeping in the CPU. In the worst case the
function call may even cause a pipeline stall.
Here’s the benchmark:
volatile sig_atomic_t running;
static long
indirect_benchmark(void (*f)(void))
{
long count;
for (count = 0; running; count++)
f();
return count;
}
The function passed to this benchmark is fetched with dlsym(3) so the
compiler can’t do something tricky like convert that indirect
call back into a direct call.
If the body of the loop was complicated enough that there was register
pressure, thereby requiring the address to be spilled onto the stack,
this benchmark might not fare as well against the PLT benchmark.
Direct function calls
The first two types of dynamic function calls are simple and easy to
use. Direct calls to dynamic functions is trickier business since it
requires modifying code at run time. In my benchmark I put together a
little JIT compiler to generate the direct call.
There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, empty(). If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.
To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this requires parsing virtual files under /proc.
Here’s what my JIT’s memory allocation looks like. It assumes
reasonable behavior for uintptr_t casts:
static void
jit_compile(struct jit_func *f, void (*empty)(void))
{
uintptr_t addr = (uintptr_t)empty;
void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK);
/* ... */
unsigned char *p = mmap(desired, len, prot, flags, fd, 0);
/* ... */
}
It allocates two pages, one writable and the other containing
non-writable code. Similar to my closure library, the lower
page is writable and holds the running variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:
jit_benchmark:
push rbx
xor ebx, ebx
.loop: mov eax, [rel running]
test eax, eax
je .done
call empty
inc ebx
jmp .loop
.done: mov eax, ebx
pop rbx
ret
The call empty is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the end of the instruction):
// call empty
uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5;
*p++ = 0xe8;
*p++ = rel >> 0;
*p++ = rel >> 8;
*p++ = rel >> 16;
*p++ = rel >> 24;
If empty() wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for plt_benchmark(), assuming somehow it didn’t inline
empty().
Ironically, calling the JIT-compiled code requires an indirect call
(e.g. via a function pointer), and there’s no way around this. What
are you going to do, JIT compile another function that makes the
direct call? Fortunately this doesn’t matter since the part being
measured in the loop is only a direct call.
It’s no mystery
Given these results, it’s really no mystery that LuaJIT can generate
more efficient dynamic function calls than a PLT, even if they still
end up being indirect calls. In my benchmark, the non-PLT indirect
calls were 28% faster than the PLT, and the direct calls 43% faster
than the PLT. That’s a small edge that JIT-enabled programs have over
plain old native programs, though it comes at the cost of absolutely
no code sharing between processes.
nullprogram.com/blog/2018/05/01/
Update: There are discussions on Reddit and on Hacker
News.
So far this year I’ve been bitten three times by compiler edge cases
in GCC and Clang, each time catching me totally by surprise. Two were
caused by historical artifacts, where an ambiguous specification lead
to diverging implementations. The third was a compiler optimization
being far more clever than I expected, behaving almost like an
artificial intelligence.
In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.
x86-64 ABI ambiguity
The first time I was bit — or, well, narrowly avoided being bit — was
when I examined a missed floating point optimization in both Clang and
GCC. Consider this function:
double
zero_multiply(double x)
{
return x * 0.0;
}
The function multiplies its argument by zero and returns the result. Any
number multiplied by zero is zero, so this should always return zero,
right? Unfortunately, no. IEEE 754 floating point arithmetic supports
NaN, infinities, and signed zeros. This function can return NaN,
positive zero, or negative zero. (In some cases, the operation could
also potentially produce a hardware exception.)
As a result, both GCC and Clang perform the multiply:
zero_multiply:
xorpd xmm1, xmm1
mulsd xmm0, xmm1
ret
The -ffast-math option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
consistency:
zero_multiply:
xorps xmm0, xmm0
ret
Side note: -ffast-math doesn’t necessarily mean “less precise.”
Sometimes it will actually improve precision.
Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a short:
double
zero_multiply_short(short x)
{
return x * 0.0;
}
It’s no longer possible for the argument to be one of those special
values. The short will be promoted to one of 65,535 possible double
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (-Os):
zero_multiply_short:
movsx edi, di ; sign-extend 16-bit argument
xorps xmm1, xmm1 ; xmm1 = 0.0
cvtsi2sd xmm0, edi ; convert int to double
mulsd xmm0, xmm1
ret
Clang also misses this optimization:
zero_multiply_short:
cvtsi2sd xmm1, edi
xorpd xmm0, xmm0
mulsd xmm0, xmm1
ret
But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (movsx)? Clang is treating that
short argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?
It turns out that the x86-64 ABI didn’t specify what happens with
the upper bits in argument registers. Are they garbage? Are they zeroed?
GCC takes the conservative position of assuming the upper bits are
arbitrary garbage. Clang takes the boldest position of assuming
arguments smaller than 32 bits have been promoted to 32 bits by the
caller. This is what the ABI specification should have said, but
currently it does not.
Fortunately GCC also conservative when passing arguments. It promotes
arguments to 32 bits as necessary, so there are no conflicts when
linking against Clang-compiled code. However, this is not true for
Intel’s ICC compiler: Clang and ICC are not ABI-compatible on
x86-64.
I don’t use ICC, so that particular issue wouldn’t bite me, but if I
was ever writing assembly routines that called Clang-compiled code, I’d
eventually get bit by this.
Floating point precision
Without looking it up or trying it, what does this function return?
Think carefully.
int
float_compare(void)
{
float x = 1.3f;
return x == 1.3f;
}
Confident in your answer? This is a trick question, because it can
return either 0 or 1 depending on the compiler. Boy was I confused when
this comparison returned 0 in my real world code.
$ gcc -std=c99 -m32 cmp.c # float_compare() == 0
$ clang -std=c99 -m32 cmp.c # float_compare() == 1
So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations all did it differently. The C99 specification
cleaned this all up and introduced FLT_EVAL_METHOD.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.
Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in long double
precision and truncated afterward (FLT_EVAL_METHOD == 2).
In float_compare() the left-hand side is truncated to a float by the
assignment, but the right-hand side, despite being a float literal,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!
The remnants of this high precision trend are still in JavaScript, where
all arithmetic is double precision (even if simulated using
integers), and great pains have been made to work around
the performance consequences of this. Until recently, Mono had
similar issues.
The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (FLT_EVAL_METHOD == 0). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the backwards compatible GCC on the old x86.
I’m a little ashamed that I’m only finding out about this now. However,
by the time I was competent enough to notice and understand this issue,
I was already doing nearly all my programming on the x86-64.
Built-in Function Elimination
I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, new_image(), that allocates a greyscale image
for, say, some multimedia library.
static unsigned char *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, shade, w * h);
}
}
return p;
}
It’s a static function because this would be part of some slick
header library (and, secretly, because it’s necessary for
illustrating the issue). Being a responsible citizen, the function
even checks for integer overflow before allocating anything.
I write a unit test to make sure it detects overflow. This function
should return 0.
/* expected return == 0 */
int
test_new_image_overflow(void)
{
void *p = new_image(2, SIZE_MAX, 0);
return !!p;
}
So far my test passes. Good.
I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make malloc() fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a malloc(SIZE_MAX), e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.
/* expected return == 0 */
int
test_new_image_oom(void)
{
void *p = new_image(1, SIZE_MAX, 0xff);
return !!p;
}
I compile with GCC, test passes. I compile with Clang and the test
fails. That is, the test somehow managed to allocate 16 exibytes of
memory, and initialize it. Wat?
Disassembling the test reveals what’s going on:
test_new_image_overflow:
xor eax, eax
ret
test_new_image_oom:
mov eax, 1
ret
The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with malloc() became dead code and
was trivially eliminated.
In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the memset(), so it eliminated the
allocation altogether and then simulated a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.
I soon realized I can take this further and trick Clang into
performing an invalid optimization, revealing a bug. Consider
this slightly-optimized version that uses calloc() when the shade is
zero (black). The calloc() function does its own overflow check, so
new_image() doesn’t need to do it.
static void *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (shade == 0) { // shortcut
p = calloc(w, h);
} else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, color, w * h);
}
}
return p;
}
With this change, my overflow unit test is now also failing. The
situation is even worse than before. The calloc() is being
eliminated despite the overflow, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, this could introduce a vulnerability in a
real program. The OpenBSD folks are so worried about this sort of
thing that they’ve disabled this optimization.
Here’s a slightly-contrived example of this. Imagine a program that
maintains a table of unsigned integers, and we want to keep track of
how many times the program has accessed each table entry. The “access
counter” table is initialized to zero, but the table of values need
not be initialized, since they’ll be written before first access (or
something like that).
struct table {
unsigned *counter;
unsigned *values;
};
static int
table_init(struct table *t, size_t n)
{
t->counter = calloc(n, sizeof(*t->counter));
if (t->counter) {
/* Overflow already tested above */
t->values = malloc(n * sizeof(*t->values));
if (!t->values) {
free(t->counter);
return 0; // fail
}
return 1; // success
}
return 0; // fail
}
This function relies on the overflow test in calloc() for the second
malloc() allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the counter table, and Clang is able to
statically determine this fact, it may eliminate the calloc(). This
would also eliminate the overflow test, introducing a
vulnerability. If an attacker can control n, then they can
overwrite arbitrary memory through that values pointer.
The takeaway
Besides this surprising little bug, the main lesson for me is that I
should probably isolate unit tests from the code being tested. The
easiest solution is to put them in separate translation units and don’t
use link-time optimization (LTO). Allowing tested functions to be
inlined into the unit tests is probably a bad idea.
The unit test issues in my real program, which was a bit more
sophisticated than what was presented here, gave me artificial
intelligence vibes. It’s that situation where a computer algorithm did
something really clever and I felt it outsmarted me. It’s creepy to
consider how far that can go. I’ve gotten that even from
observing AI I’ve written myself, and I know for sure no human
taught it some particularly clever trick.
My favorite AI story along these lines is about an AI that learned
how to play games on the Nintendo Entertainment System. It
didn’t understand the games it was playing. It’s optimization task was
simply to choose controller inputs that maximized memory values,
because that’s generally associated with doing well — higher scores,
more progress, etc. The most unexpected part came when playing Tetris.
Eventually the screen would fill up with blocks, and the AI would face
the inevitable situation of losing the game, with all that memory
being reinitialized to low values. So what did it do?
Just before the end it would pause the game and wait… forever.
nullprogram.com/blog/2018/04/13/
My first exposure to C and C++ was a little over 20 years ago. I
remember it being some version of Borland C++, either 4.x
or 5.x, running on Windows 95. I didn’t have a mentor, so I
did the best I could slowly working through what was probably a poorly
written beginner C++ book, typing out the examples and exercises with
little understanding. Since I didn’t learn much from the experience,
there was a 7 or 8 year gap before I’d revisit C and C++ in college.

I thought it would be interesting to revisit this software, to
reevaluate it from a far more experienced perspective. Keep in mind
that C++ wasn’t even standardized yet, and the most recent C standard
was from 1989. Given this, what was it like to be a professional
software developer using a Borland toolchain on Windows 20 years ago?
Was it miserable, made bearable only by ignorance of how much better
the tooling could be? Or maybe it actually wasn’t so bad, and these
tools are better than I expect?
Ultimately my conclusion is that it’s a little bit of both. There are
some significant capability gaps compared to today, but the core
toolchain itself is actually quite reasonable, especially for the mid
1990s.
The setup
Before getting into the evaluation, let’s discuss how I got it all up
and running. While it’s technically possible to run Windows 95 on a
modern x86-64 machine thanks to the architecture’s extreme backwards
compatibility, it’s more compatible, simpler, and safer to
virtualize it. Most importantly, I can emulate older hardware that
will have better driver support.
Despite that early start in Windows all those years ago, I’m primarily
a Linux user. The premier virtualization solution on Linux these days
is KVM, a kernel module that turns Linux into a hypervisor and
makes efficient use of hardware virtualization extensions.
Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m
using QEmu (with KVM disabled), a hardware emulator closely
associated with KVM. Since it doesn’t take advantage of hardware
virtualization extensions, it will be slower. This is fine since my
goal is to emulate slow, 20+ year old hardware anyway.
There’s very little practical difference between Windows 95 and
Windows 98. Since Windows 98 runs a lot smoother virtualized, I
decided to go with that instead. This will be perfectly sufficient for
my toolchain evaluation.
Software
To get started, I’ll need an installer for Windows 98. I thought this
would be difficult to find, but there’s a copy available on the
Internet Archive. I don’t know how “legitimate” this is, but it works.
Since it’s running in a virtual machine without network access, I also
don’t really care if this copy is somehow infected with malware.
Internet Archive: Windows 98 Second Edition
Also on the Internet Archive is a complete copy of Borland C++ 5.02,
with the same caveats of legitimacy. It works, which is good enough for
my purposes.
Internet Archive: Borland C++ 5.02
Thank you Internet Archive!
Hardware
I’ve got my software, now to set up the virtualized hardware. First I
create a drive image:
$ qemu-image create -fqcow2 win98.img 8G
I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a
virtual hard drive with modern sizes would probably break the
installer. This sort of issue is a common theme among old software,
where there may be complaints about negative available disk space due
to signed integer overflow.
I decided to give the machine 256MB of memory (-m 256). This is also a
little excessive, but I wanted to be sure memory didn’t limit Borland’s
capabilities. This amount of memory is close to the upper bound, and
going much beyond will likely cause problems with Windows 98.
For the CPU I settled on a Pentium (-cpu pentium). My original goal
was to go a little simpler with a 486 (-cpu 486), but the Windows 98
installer kept crashing when I tried this.
I experimented with different configurations for the network card, but
I couldn’t get anything to work. So I’ve disabled networking (-net
none). The only reason I’d want this is that it would be easier to
move files in and out of the virtual machine.
Finally, here’s how I ran QEmu. The last two lines are only needed when
installing.
$ qemu-system-x86_64 \
-localtime \
-cpu pentium \
-no-acpi \
-no-hpet \
-m 256 \
-hda win98.img \
-soundhw sb16 \
-vga cirrus \
-net none \
-cdrom "Windows 98 Second Edition.iso" \
-boot d

Installation
Installation is just a matter of following the instructions. You’ll
need that product key listed on the Internet Archive site.

That copy of Borland is just a big .zip file. This presents two
problems.
-
Without network access, I’ll need to figure out how to get this
inside the virtual machine.
-
This version of Windows doesn’t come with software to unzip this
file. I’d need to find and install an unzip tool first.
Fortunately I can kill two birds with one stone by converting that .zip
archive into a .iso and mounting it in the virtual machine.
unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"
Then in the QEmu console (C-A-2) I attach it:
change ide1-cd0 borland.iso
This little trick of generating .iso files and mounting them is how I
will be moving all the other files into the virtual machine.
Borland C++
The first thing I did was play around with with Borland IDE. This is
what I would have been using 20 years ago.

Despite being Borland C++, I’m personally most interested in its ANSI
C compiler. As I already pointed out, this software pre-dates C++’s
standardization, and a lot has changed over the past two decades. On the
other hand, C hasn’t really changed all that much. The 1999 update to
the C standard (e.g. “C99”) was big and important, but otherwise little
has changed. The biggest drawback is the lack of “declare anywhere”
variables, including in for-loop initializers. Otherwise it’s the same
as writing C today.
To test drive the IDE, I made a couple of test projects, built and ran
them with different options, and poked around with the debugger. The
debugger is actually pretty decent, especially for the 1990s. It can be
operated via the IDE or standalone, so I could use it without firing up
the IDE and making a project.
The toolchain includes an assembler, and I can inspect the compiler’s
assembly output. To nobody’s surprise this is Intel-flavored assembly,
which is very welcome. Imagining myself as a software developer
in the mid 1990s, this means I can see exactly what the compiler’s doing
as well as write some of the performance sensitive parts in assembly if
necessary.
The built-in editor is the worst part of the IDE, which is unfortunate
since it really spoils the whole experience. It’s easy to jump between
warnings and errors, it has incremental search, and it has good syntax
highlighting. But these are the only positive things I can say about it.
If I had to work with this editor full-time, I’d spend my days pretty
irritated.
Like with the debugger, the Borland people did a good job modularizing
their development tools. As part of the installation process, all of the
Borland command line tools are added to the system PATH (reminder:
this is a single-user system). This includes compiler, linker,
assembler, debugger, and even an incomplete implementation of
make.
With this, I can essentially pretend the IDE doesn’t exist and replace
that crummy editor with something better: Vim.
The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3,
released in 2010. I download those binaries, trim a few things from my
.vimrc, and smuggle it all into my virtual machine via a
virtual CD. I’ve now got a powerful text editor in Windows 98 and my
situation has drastically improved.

Since I hardly use features added since Vim 7.3, this feels right at
home to me. I can invoke the build from Vim, and it
can populate the quickfix list from Borland’s output, so I could
actually be fairly productive in these circumstances! I’m honestly
really impressed with how well this all works together.
At this point I only have two significant annoyances:
-
Borland’s command line tools belong to that category of irritating
programs that print their version banner on every invocation.
There’s not even a command line switch to turn this off. All this
noise is quickly tiresome. The Visual Studio toolchain does
the same thing by default, though it can be turned off (-nologo).
I dislike that some GNU tools also commit this sin, but at least
GNU limits this to interactive programs.
-
The Windows/DOS command shell and console is even worse than it
is today. I didn’t think that was possible. This is back when
it was still genuinely DOS and not just pretending to be (e.g. in
NT). The worst part by far is the lack of command history. There’s
no using the up-arrow to get previous commands. There’s no tab
completion. Forward slash is not a substitute for backslash in
paths. If I wanted to improve my productivity, replacing this
console and shell would be the first priority.
Update: In an email, Aristotle Pagaltzis informed me that Windows 98
comes with DOSKEY.COM, which provides command history for
COMMAND.EXE. Alternatively there’s Enhanced DOSKEY.com, an
open source, alternative implementation that also provides tab
completion for commands and filesnames. This makes the console a lot
more usable (and, honestly, in some ways better than the modern
defaults).
Building Enchive with Borland
Last year I wrote a backup encryption tool called Enchive,
and I still use it regularly. One of my design goals was high
portability since it may be needed to decrypt something important in
the distant future. It should be as bit-rot-proof as
possible. In software, the best way to future-proof is to
past-proof.
If I had a time machine that could send source code back in time, and
I sent Enchive to a competant developer 20 years ago, would they be
able to compile it and run it? If the answer is yes, then that means
Enchive already has 20 years of future-proofing built into it.
To accomplish this, Enchive is 3,300 lines of strict ANSI C,
1989-style, with no dependencies other than the C standard library and
a handful of operating system functions — e.g. functionality not in
the C standard library. In practice, any ANSI C compiler targeting
either POSIX, or Windows 95 or later, should be able to compile it.
My Windows 98 virtual machine includes an ANSI C compiler, and can be
used to simulate this time machine. I generated an “amalgamation” build
(make amalgamation) — essentially a concatenation of all the source
files — and sent this into the virtual machine. Before Borland was able
to compile it, I needed to make three small changes.
First, Enchive includes stdint.h to get fixed-width integers needed
for the encryption routines. This header comes from C99, and C89 has
no equivalent. I anticipated this problem from the beginning and made
it easy for the person performing the build to correct it. This header
is included exactly once, in config.h, and this is placed at the top
of the amalgamation build. The include only needs to be replaced with
a handful of manual typedefs. For Borland that looks like this:
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned long uint32_t;
typedef unsigned __int64 uint64_t;
typedef long int32_t;
typedef __int64 int64_t;
#define INT8_C(n) (n)
#define INT16_C(n) (n)
#define INT32_C(n) (n##U)
Second, in more recent versions of Windows, GetFileAttributes() can
return the value INVALID_FILE_ATTRIBUTES. Checking for an error that
cannot happen is harmless, but this value isn’t defined in Borland’s
SDK. I only had to eliminate that check.
Third, the CryptGenRandom() interface isn’t defined in
Borland’s SDK. This is used by Enchive to generate keys. MSDN reports
this function wasn’t available until Windows XP, but it’s definitely
there in Windows 98, exported by ADVAPI32.dll. I’m able to call it,
though it always reports an error. Perhaps it’s been disabled in this
version due to cryptographic export restrictions?
Regardless of what’s wrong, I ripped this out and replaced it with a
fatal error. This version of Enchive can’t generate new keys — unless
derived from a passphrase — nor encrypt files, including the use of a
protection key to encrypt the secret key. However, it can decrypt
files, which is the important part that needs to be future-proofed.
With this three changes — which took me about 10 minutes to sort out —
Enchive builds and runs, and it correctly decrypts files I encrypted on
Linux. So Enchive has at least 20 years of past-proofing! The
screenshot at the top of this article shows it running successfully in
an MS-DOS console window.
What’s wrong? What’s missing?
I mentioned that there were some gaps. The most obvious is the lack of
the standard POSIX utilities, especially a decent shell. I don’t know if
any had been ported to Windows in the mid 1990s. But that could be
solved one way or another without too much trouble, even if it meant
doing some of that myself.
No, the biggest capability I’d miss, and which wouldn’t be easily
obtained, is Git, or a least a decent source control system. I really
don’t want to work without proper source control. Git’s support for
Windows is second tier, and the port to modern Windows is already a
bit of a hack. Getting it to run in Windows 98 would probably be a
challenge, especially if I had to compile it with Borland.
The other major issue is the lack of stability. In this experiment, I’ve
been seeing this screen a lot:

I remember Windows crashing a lot back in those days, and it certainly
had a bad reputation for being unstable, but this is far worse than I
remembered. While the hardware emulator may be somewhat at fault here,
keep in mind that I never installed third party drivers. Most of these
crashes are Windows’ fault. I found I can reliably bring the whole
system down with a single GetProcAddress() call on a system DLL. The
only way I can imagine this instability was so tolerated back then was
general ignorance that computing could be so much better.
I was tempted to write this article in Vim on Windows 98, but all this
crashing made me too nervous. I didn’t want some stupid filesystem
corruption to wipe out my work. Too risky.
A better alternative
If I was stuck working in Windows 98 — or was at least targeting it as a
platform — but had access to a modern tooling ecosystem, could I do
better than Borland? Yes! Programs built by Mingw-w64 can be
run even as far back as Windows 95.
Now, there’s a catch. I thought it would be this simple:
$ i686-w64-mingw32-gcc -Os hello.c
But when I brought the resulting binary into the virtual machine it
crashed when ran it: illegal instruction. Turns out it contained a
conditional move (cmov) which is an instruction not available until
the Pentium Pro (686). The “pentium” emulation is just a 586.
I tried to disable cmov by picking the specific architecture:
$ i686-w64-mingw32-gcc -march=pentium -Os hello.c
This still didn’t work because the statically-linked part of the CRT
contained the cmov. I’d have to recompile that as well.
I could have switched the QEmu options to “upgrade” to a Pentium Pro,
but remember that my goal was really the 486. Fortunately this was easy
to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number
of times before, so I knew it wouldn’t be difficult.
I could go step by step, but it’s all fairly well documented in the
Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version),
and for the target I picked “i486-w64-mingw32”. When it was done I could
compile binaries on Linux to run in my Windows 98 virtual machine:
$ i486-w64-mingw32-gcc -Os hello.c
This should enable quite a bit of modern software to run inside my
virtual machine if I so wanted. I didn’t actually try this (yet?),
but, to take this concept all the way, I could use this cross-compiler
to cross-compile Mingw-w64 itself to run inside the virtual machine,
directly replacing Borland C++.
And the only thing I’d miss about Borland is its debugger.