Articles tagged tutorial at null program

Lessons learned from my first dive into WebAssembly

2025-04-04T04:01:20Z

It began as a water sort puzzle solver, constructed similarly to my British Square solver. It was nearly playable, so I added a user interface with SDL2. My wife enjoyed it on her desktop, but wished to play on her phone. So then I needed to either rewrite it in JavaScript and hope the solver was still fast enough for real-time use, or figure out WebAssembly (Wasm). I succeeded, and now my game runs in browsers (source). Like before, next I ported my pkg-config clone to the Wasm System Interface (WASI), whipped up a proof-of-concept UI, and it too runs in browsers. Neither use a language runtime, resulting in little 8kB and 28kB Wasm binaries respectively. In this article I share my experiences and techniques.

Wasm is a specification defining an abstract stack machine with a Harvard architecture, and related formats. There are just four types, i32, i64, f32, and f64. It also has “linear” octet-addressable memory starting at zero, with no alignment restrictions on loads and stores. Address zero is a valid, writable address, which resurfaces some, old school, high level language challenges regarding null pointers. There are 32-bit and 64-bit flavors, though the latter remains experimental. That suits me: I appreciate smaller pointers on 64-bit hosts, and I wish I could opt into it more often (e.g. x32).

As browser tech goes, they chose an apt name: WebAssembly is to the web as JavaScript is to Java.

There are distinct components at play, and much of the online discussion doesn’t do a great job drawing lines between them:

Wasm module: A compiled and linked image — like ELF or PE — containing sections for code, types, globals, import table, export table, and so on. The export table lists the module’s entry points. It has an optional start section indicating which function initializes a loaded image. (In practice almost nobody actually uses the start section.) A Wasm module can only affect the outside world through imported functions. Wasm itself defines no external interfaces for Wasm programs, not even printing or logging.
Wasm runtime: Loads Wasm modules, linking import table entries into the module. Because Wasm modules include types, the runtime can type check this linkage at load time. With imports resolved, it executes the start function, if any, then executes zero or more of its entry points, which hopefully invokes import functions such a way as to produce useful results, or perhaps simply return useful outputs.
Wasm compiler: Converts a high-level language to low-level Wasm. In order to do so, it requires some kind of Application Binary Interface (ABI) to map the high-level language concepts onto the machine. This typically introduces additional execution elements, and it’s important that we distinguish them from the abstract machine’s execution elements. Clang is the only compiler we’ll be discussing in this article, though there are many. During compilation the function indices are yet unknown and so references will need to be patched in by a linker.
Wasm linker: Settles the shape of the Wasm module and links up the functions emitted by the compiler. LLVM comes with wasm-ld, and it goes hand-in-hand with Clang as a compiler.
Language runtime: Unless you’re hand-writing raw Wasm, your high-level language probably has a standard library with operating system interfaces. C standard library, POSIX interfaces, etc. This runtime likely maps onto some standardized set of imports, most likely the aforementioned WASI, which defines a set of POSIX-like functions that Wasm modules may import. Because I think we could do better, as usual around here, in this article we’re going to eschew the language runtime and code directly against raw WASI. You still have easy access hash tables and dynamic arrays.

A combination of compiler-linker-runtime is conventionally called a toolchain. However, because almost any Clang installation can target Wasm out-of-the-box, and we’re skipping the language runtime, you can compile any of programs discussed in this article, including my game, with nothing more than Clang (invoking wasm-ld implicitly). If you have a Wasm runtime, which includes your browser, you can run them, too! Though this article will mostly focus on WASI, and you’ll need a WASI-capable runtime to run those examples, which doesn’t include browsers (short of implementing the API with JavaScript).

I wasn’t particularly happy with the Wasm runtimes I tried, so I cannot enthusiastically recommend one. I’d love if I could point to one and say, “Use the same Clang to compile the runtime that you’re using to compile Wasm!” Alas, I had issues compiling, the runtime was buggy, or WASI was incomplete. However, wazero (Go) was the easiest for me to use and it worked well enough, so I will use it in examples:

$ go install github.com/tetratelabs/wazero/cmd/wazero@latest

The Wasm Binary Toolkit (WABT) is good to have on hand when working with Wasm, particularly wasm2wat to inspect Wasm modules, sort of like objdump or readelf. It converts Wasm to the WebAssembly Text Format (WAT).

Learning Wasm I had quite some difficulty finding information. Outside of the Wasm specification, which, despite its length, is merely a narrow slice of the ecosystem, important technical details are scattered all over the place. Some is only available as source code, some buried comments in GitHub issues, and some lost behind dead links as repositories have moved. Large parts of LLVM are undocumented beyond an mention of existence. WASI has no documentation in a web-friendly format — so I have nothing to link from here when I mention its system calls — just some IDL sources in a Git repository. An old wasi.h was the most readable, complete source of truth I could find.

Fortunately Wasm is old enough that LLMs are well-versed in it, and simply asking questions, or for usage examples, was more effective than searching online. If you’re stumped on how to achieve something in the Wasm ecosystem, try asking a state-of-the-art LLM for help.

Example programs

Let’s go over concrete examples to lay some foundations. Consider this simple C function:

float norm(float x, float y)
{
    return x*x + y*y;
}

To compile to Wasm (32-bit) with Clang, we use the --target=wasm32:

$ clang -c --target=wasm32 -O example.c

The object file example.o is in Wasm format, so WABT can examine it. Here’s the output of wasm2wat -f, where -f produces output in the “folded” format, which is how I prefer to read it.

(module
  (type (;0;) (func (param f32 f32) (result f32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $norm (type 0) (param f32 f32) (result f32)
    (f32.add
      (f32.mul
        (local.get 0)
        (local.get 0))
      (f32.mul
        (local.get 1)
        (local.get 1)))))

We can see the ABI taking shape: Clang has predictably mapped float into f32. It similarly maps char, short, int and long onto i32. In 64-bit Wasm, the Clang ABI is LP64 and maps long onto i64. There’s a also $norm function which takes two f32 parameters and returns an f32.

Getting a little more complex:

__attribute((import_name("f")))
void f(int *);

__attribute((export_name("example")))
void example(int x)
{
    f(&x);
}

The import_name function attribute indicates the module will not define it, even in another translation unit, and that it intends to import it. That is, wasm-ld will place it in the import table. The export_name function attribute indicates it’s an entry point, and so wasm-ld will list it in the export table. Linking it will make things a little clearer:

$ clang --target=wasm32 -nostdlib -Wl,--no-entry -O example.c

The -nostdlib is because we won’t be using a language runtime, and --no-entry to tell the linker not to implicitly export a function (default: _start) as an entry point. You might think this is connected with the Wasm start function, but wasm-ld does not support the start section at all! We’ll have use for an entry point later. The folded WAT:

(module $a.out
  (type (;0;) (func (param i32)))
  (import "env" "f" (func $f (type 0)))
  (func $example (type 0) (param i32)
    (local i32)
    (global.set $__stack_pointer
      (local.tee 1
        (i32.sub
          (global.get $__stack_pointer)
          (i32.const 16))))
    (i32.store offset=12
      (local.get 1)
      (local.get 0))
    (call $f
      (i32.add
        (local.get 1)
        (i32.const 12)))
    (global.set $__stack_pointer
      (i32.add
        (local.get 1)
        (i32.const 16))))
  (table (;0;) 1 1 funcref)
  (memory (;0;) 2)
  (global $__stack_pointer (mut i32) (i32.const 66560))
  (export "memory" (memory 0))
  (export "example" (func $example)))

There’s a lot to unfold:

Pointers were mapped onto i32. Pointers are a high-level concept, and linear memory is addressed by an integral offset. This is typical of assembly after all.
There’s now a __stack_pointer, which is part of the Clang ABI, not Wasm. The Wasm abstract machine is a stack machine, but that stack doesn’t exist in linear memory. So you cannot take the address of values on the Wasm stack. There are lots of things C needs from a stack that Wasm doesn’t provide. So, in addition to the Wasm stack, Clang maintains another downward-growing stack in linear memory for these purposes, and the __stack_pointer global is the stack register of its ABI. We can see it’s allocated something like 64kB for the stack. (It’s a little more because program data is placed below the stack.)
It should be mostly readable without knowing Wasm: The function subtracts a 16-byte stack frame, stores a copy of the argument in it, then uses its memory offset for the first parameter to the import f. Why 16 bytes when it only needs 4? Because the stack is kept 16-byte aligned. Before returning, the function restores the stack pointer.

As mentioned earlier, address zero is valid as far as the Wasm runtime is concerned, though dereferences are still undefined in C. This makes it more difficult to catch bugs. Given a null pointer this function would most likely read a zero at address zero and the program keeps running:

int get(int *p)
{
    return *p;
}

In WAT:

(func $get (type 0) (param i32) (result i32)
  (i32.load
    (local.get 0)))

Since the “hardware” won’t fault for us, ask Clang to do it instead:

$ clang ... -fsanitize=undefined -fsanitize-trap ...

Now in WAT:

(module
  (type (;0;) (func (param i32) (result i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $get (type 0) (param i32) (result i32)
    (block  ;; label = @1
      (block  ;; label = @2
        (br_if 0 (;@2;)
          (i32.eqz
            (local.get 0)))
        (br_if 1 (;@1;)
          (i32.eqz
            (i32.and
              (local.get 0)
              (i32.const 3)))))
      (unreachable))
    (i32.load
      (local.get 0))))

Given a null pointer, get executes the unreachable instruction, causing the runtime to trap. In practice this is unrecoverable. Consider: nothing will restore __stack_pointer, and so the stack will “leak” the existing frames. (This can be worked around by exporting __stack_pointer and __stack_high via the --export linker flag, then restoring the stack pointer in the runtime after traps.)

Wasm was extended with bulk memory operations, and so there are single instructions for memset and memmove, which Clang maps onto the built-ins:

void clear(void *buf, long len)
{
    __builtin_memset(buf, 0, len);
}

(Below LLVM 20 you will need the undocumented -mbulk-memory option.) In WAT we see this as memory.fill:

(module
  (type (;0;) (func (param i32 i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $clear (type 0) (param i32 i32)
    (block  ;; label = @1
      (br_if 0 (;@1;)
        (i32.eqz
          (local.get 1)))
      (memory.fill
        (local.get 0)
        (i32.const 0)
        (local.get 1)))))

That’s great! I wish this worked so well outside of Wasm. It’s one reason w64devkit has -lmemory, after all. Similarly __builtin_trap() maps onto the unreachable instruction, so we can reliably generate those as well.

What about structures? They’re passed by address. Parameter structures go on the stack, then its address passed. To return a structure, a function accepts an implicit out parameter in which to write the return. This isn’t unusual, except that it’s challenging to manage across module boundaries, i.e. in imports and exports, because caller and callee are in different address spaces. It’s especially tricky to return a structure from an export, as the caller must somehow allocate space in the callee’s address space for the result. The multi-value extension solves this, but using it in C involves an ABI change, which is still experimental.

Water Sort Game

Something you might not have expected: My water sort game imports no functions! It only exports three functions:

void      game_init(i32 seed);
DrawList *game_render(i32 width, i32 height, i32 mousex, i32 mousey);
void      game_update(i32 input, i32 mousex, i32 mousey, i64 now);

The game uses IMGUI-style rendering. The caller passes in the inputs, and the game returns a kind of display list telling it what to draw. In the SDL version these turn into SDL renderer calls. In the web version, these turn into canvas draws, and “mouse” inputs may be touch events. It plays and feels the same on both platforms. Simple!

I didn’t realize it at the time, but building the SDL version first was critical to my productivity. Debugging Wasm programs is really dang hard! Wasm tooling has yet to catch up with 1995, let alone 2025. Source-level debugging is still experimental and impractical. Developing applications on the Wasm platform. It’s about as ergonomic as developing in MS-DOS. Instead, develop on a platform much better suited for it, then port your application to Wasm after you’ve got the issues worked out. The less Wasm-specific code you write, the better, even if it means writing more code overall. Treat it as you would some weird embedded target.

The game comes with 10,000 seeds. I generated ~200 million puzzles, sorted them by difficulty, and skimmed the top 10k most challenging. In the game they’re still sorted by increading difficulty, so it gets harder as you make progress.

Wasm System Interface

WASI allows us to get a little more hands on. Let’s start with a Hello World program. A WASI application exports a traditional _start entry point which returns nothing and takes no arguments. I’m also going to set up some basic typedefs:

typedef unsigned char       u8;
typedef   signed int        i32;
typedef   signed long long  i64;
typedef   signed long       iz;

void _start(void)
{
}

wasm-ld will automatically export this function, so we don’t need an export_name attribute. This program successfully does nothing:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm && echo ok
ok

To write output WASI defines fd_write():

typedef struct {
    u8 *buf;
    iz  len;
} IoVec;

#define WASI(s) __attribute((import_module("wasi_unstable"),import_name(s)))
WASI("fd_write")  i32  fd_write(i32, IoVec *, iz, iz *);

Technically those iz variables are supposed to be size_t, passed through Wasm as i32, but this is a foreign function, I know the ABI, and so I can do as I please. I absolutely love that WASI barely uses null-terminated strings, not even for paths, which is a breath of fresh air, but they still marred the API with unsigned sizes. Which I choose to ignore.

This function is shaped like POSIX writev(). I’ve also set it up for import, including a module name. The oldest, most stable version of WASI is called wasi_unstable. (I suppose it shouldn’t be surprising that finding information in this ecosystem is difficult.)

Every returning WASI function returns an errno value, with zero as success rather than some kind of in-band signaling. Hence the final out parameter unlike POSIX writev().

Armed with this function, let’s use it:

void _start(void)
{
    u8    msg[] = "hello world\n";
    IoVec iov   = {msg, sizeof(msg)-1};
    iz    len   = 0;
    fd_write(1, &iov, 1, &len);
}

Then:

$ clang --target=wasm32 -nostdlib -o hello.wasm hello.c
$ wazero run hello.wasm
hello world

Keep going and you’ll have something like printf before long. If the write fails, we should probably communicate the error with at least the exit status. Because _start doesn’t return a status, we need to exit, for which we have proc_exit. It doesn’t return, so no errno return value.

WASI("proc_exit") void proc_exit(i32);

void _start(void)
{
    // ...
    i32 err = fd_write(1, &iov, 1, &len);
    proc_exit(!!err);
}

To get the command line arguments, call args_sizes_get to get the size, allocate some memory, then args_get to read the arguments. Same goes for the environment with a similar pair of functions. The sizes do not include a null pointer terminator, which is sensible.

Now that you know how to find and use these functions, you don’t need me to go through each one. However, opening files is a special, complicated case:

WASI("path_open") i32 path_open(i32,i32,u8*,iz,i32,i64,i64,i32,i32*);

That’s 9 parameters — and I had thought Win32 CreateFileW was over the top. It’s even more complex than it looks. It works more like POSIX openat(), except there’s no current working directory and so no AT_FDCWD. Every file and directory is opened relative to another directory, and absolute paths are invalid. If there’s no AT_FDCWD, how does one open the first directory? That’s called a preopen and it’s core to the file system security mechanism of WASI.

The Wasm runtime preopens zero or more directories before starting the program and assigns them the lowest numbered file descriptors starting at file descriptor 3 (after standard input, output, and error). A program intending to use path_open must first traverse the file descriptors, probing for preopens with fd_prestat_get and retrieving their path name with fd_prestat_dir_name. This name may or may not map back onto a real system path, and so this is a kind of virtual file system for the Wasm module. The probe stops on the first error.

To open an absolute path, it must find a matching preopen, then from it construct a path relative to that directory. This part I much dislike, as the module must contain complex path parsing functionality even in the simple case. Opening files is the most complex piece of the whole API.

I mentioned before that program data is below the Clang stack. With the stack growing down, this sounds like a bad idea. A stack overflow quietly clobbers your data, and is difficult to recognize. More sensible to put the stack at the bottom so that it overflows off the bottom of memory and causes a fast fault. Fortunately there’s a switch for that:

$ clang --target=wasm32 ... -Wl,--stack-first ...

This is what you want by default. The actual default layout is left over from an early design flaw in wasm-ld, and it’s an oversight that it has not yet been corrected.

u-config

The above is in action in the u-config Wasm port. You can download the Wasm module, pkg-config.wasm, used in the web demo to run it in your favorite WASI-capable Wasm runtime:

$ wazero run pkg-config.wasm --modversion pkg-config
0.33.3

Though there are no preopens, so it cannot read any files. The -mount option maps real file system paths to preopens. This mounts the entire root file system read-only (ro) as /.

$ wazero run -mount /::ro pkg-config.wasm --cflags sdl2
-I/usr/include/SDL2 -D_REENTRANT

I doubt this is useful for anything, but it was a vehicle for learning and trying Wasm, and the results are pretty neat.

In the next article I discuss allocating the allocator.

Robust Wavefront OBJ model parsing in C

2025-03-02T23:22:58Z

Wavefront OBJ is a line-oriented, text format for 3D geometry. It’s widely supported by modeling software, easy to parse, and trivial to emit, much like Netpbm for 2D image data. Poke around hobby 3D graphics projects and you’re likely to find a bespoke OBJ parser. While typically only loading their own model data, so robustness doesn’t much matter, they usually have hard limitations and don’t stand up to fuzz testing. This article presents a robust, partial OBJ parser in C with no hard-coded limitations, written from scratch. Like similar articles, it’s not really about OBJ but demonstrating some techniques you’ve probably never seen before.

If you’d like to see the ready-to-run full source: objrender.c. All images are screenshots of this program.

First let’s establish the requirements. By robust I mean no undefined behavior for any input, valid or invalid; no out of bounds accesses, no signed overflows. Input is otherwise not validated. Invalid input may load as valid by chance, which will render as either garbage or nothing. The behavior will also not vary by locale.

We’re also only worried about vertices, normals, and triangle faces with normals. In OBJ these are v, vn, and f elements. Normals let us light the model effectively while checking our work. A cube fitting this subset of OBJ might look like:

v  -1.00 -1.00 -1.00
v  -1.00 +1.00 -1.00
v  +1.00 +1.00 -1.00
v  +1.00 -1.00 -1.00
v  -1.00 -1.00 +1.00
v  -1.00 +1.00 +1.00
v  +1.00 +1.00 +1.00
v  +1.00 -1.00 +1.00

vn +1.00  0.00  0.00
vn -1.00  0.00  0.00
vn  0.00 +1.00  0.00
vn  0.00 -1.00  0.00
vn  0.00  0.00 +1.00
vn  0.00  0.00 -1.00

f   3//1  7//1  8//1
f   3//1  8//1  4//1
f   1//2  5//2  6//2
f   1//2  6//2  2//2
f   7//3  3//3  2//3
f   7//3  2//3  6//3
f   4//4  8//4  5//4
f   4//4  5//4  1//4
f   8//5  7//5  6//5
f   8//5  6//5  5//5
f   3//6  4//6  1//6
f   3//6  1//6  2//6

Take note:

Some fields are separated by more than one space.
Vertices and normals are fractional (floating point).
Faces use 1-indexing instead of 0-indexing.
Faces in this model lack a texture index, hence // (empty).

Inputs may have other data, but we’ll skip over it, including face texture indices, or face elements beyond the third. Some of the models I’d like to test have relative indices, so I want to support those, too. A relative index refers backwards from the last vertex, so the order of the lines in an OBJ matter. For example, the cube faces above could have instead been written:

f  -6//-6 -2//-6 -1//-6
f  -6//-6 -1//-6 -5//-6
f  -8//-5 -4//-5 -3//-5
f  -8//-5 -3//-5 -7//-5
f  -2//-4 -6//-4 -7//-4
f  -2//-4 -7//-4 -3//-4
f  -5//-3 -1//-3 -4//-3
f  -5//-3 -4//-3 -8//-3
f  -1//-2 -2//-2 -3//-2
f  -1//-2 -3//-2 -4//-2
f  -6//-1 -5//-1 -8//-1
f  -6//-1 -8//-1 -7//-1

Due to this the parser cannot be blind to line order, and it must handle negative indices. Relative indexing has the nice effect that we can group faces, and those groups are relocatable. We can reorder them without renumbering the faces, or concatenate models just by concatenating their OBJ files.

The fundamentals

To start off, we’ll be using an arena of course, trivializing memory management while swiping aside all hard-coded limits. A quick reminder of the interface:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;

// Always returns an aligned pointer inside the arena. Allocations are
// zeroed. Does not return on OOM (never returns a null pointer).
void *alloc(Arena *, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align);

Also, no null terminated strings, perhaps the main source of problems with bespoke parsers.

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

Pointer arithmetic is error prone, so the tricky stuff is relegated to a handful of functions, each of which can be exhaustively validated almost at a glance:

Str span(char *beg, char *end)
{
    Str r = {0};
    r.data = beg;
    r.len  = beg ? end-beg : 0;
    return r;
}

_Bool equals(Str a, Str b)
{
    return a.len==b.len && (!a.len || !memcmp(a.data, b.data, a.len));
}

Str trimleft(Str s)
{
    for (; s.len && *s.data<=' '; s.data++, s.len--) {}
    return s;
}

Str trimright(Str s)
{
    for (; s.len && s.data[s.len-1]<=' '; s.len--) {}
    return s;
}

Str substring(Str s, ptrdiff_t i)
{
    if (i) {
        s.data += i;
        s.len  -= i;
    }
    return s;
}

Each avoids the purposeless special cases around null pointers (i.e. zero-initialized Str objects) that would otherwise work out naturally. The space character and all control characters are treated as whitespace for simplicity. When I started writing this parser, I didn’t define all these functions up front. I defined them as needed. (A good standard library would have provided similar definitions out-of-the-box.) If you’re worried about misuse, add the appropriate assertions.

A powerful and useful string function I’ve discovered, and which I use in every string-heavy program, is cut, a concept I shamelessly stole from the Go standard library:

typedef struct {
    Str   head;
    Str   tail;
    _Bool ok;
} Cut;

Cut cut(Str s, char c)
{
    Cut r = {0};
    if (!s.len) return r;  // null pointer special case
    char *beg = s.data;
    char *end = s.data + s.len;
    char *cut = beg;
    for (; cut<end && *cut!=c; cut++) {}
    r.ok   = cut < end;
    r.head = span(beg, cut);
    r.tail = span(cut+r.ok, end);
    return r;
}

It slices, it dices, it juliennes! Need to iterate over lines? Cut it up:

    Cut c = {0};
    c.tail = input;
    while (c.tail.len) {
        c = cut(c.tail, '\n');
        Str line = c.head;
        // ... process line ...
    }

Need to iterate over the fields in a line? Cut the line on the field separator. Then cut the field on the element separator. No allocation, no mutation (strtok).

Reading input

Unlike a program designed to process arbitrarily large inputs, the intention here is to load the entire model into memory. We don’t need to fiddle around with loading a line of input at at time (fgets, getline, etc.) — the usual approach with OBJ parsers. If the OBJ source cannot fit in memory, then the model won’t fit in memory. This greatly simplifies the parser, not to mention faster while lifting hard-coded limits like maximum line length.

The simple arena I use makes whole-file loading so easy. Read straight into the arena without checking the file size (ftell, etc.), which means streaming inputs (i.e. pipes) work automatically.

Str loadfile(Arena *a, FILE *f)
{
    Str r  = {0};
    r.data = a->beg;
    r.len  = a->end - a->beg;
    r.len  = fread(r.data, 1, r.len, f);
    return r;
}

Without buffered input, you may need a loop around the read:

Str loadfile(Arena *a, int fd)
{
    Str r = {0};
    r.data = a.beg;
    ptrdiff_t cap = a->end - a->beg;
    for (;;) {
        ptrdiff_t r = read(fd, r.data+r.len, cap-r.len);
        if (r < 1) {
            return r;  // ignoring read errors
        }
        r.len += r;
    }
}

You might consider triggering an out-of-memory error if the arena was filled to the brim, which almost certainly means the input was truncated. Though that’s likely to happen anyway because the next allocation from that arena will fail.

Side note: When using a multi GB arena, issuing such huge read requests stress tests the underlying IO system. I’ve found libc bugs this way. In this case I used SDL2 for the demo, and SDL lost the ability to read files after I increased the arena size to 4GB in order to test a gigantic model (“Power Plant”). I’ve run into this before, and I assumed it was another Microsoft CRT bug. After investigating deeper for this article, I learned it’s an ancient SDL bug that’s made it all the way into SDL3. -Wconversion warns about it, but was accidentally squelched in the 64-bit port back in 2009. It seems nobody else loads files this way, so watch out for platform bugs if you use this technique!

Parsing data

In practice, rendering systems limit counts to the 32-bit range, which is reasonable. So in the OBJ parser, vertex and normal indices will be 32-bit integers. Negatives will be needed for at least relative indexing. Parsing from a Str means null-terminated functions like strtol are off limits. So here’s a function to parse a signed integer out of a Str:

int32_t parseint(Str s)
{
    uint32_t r    = 0;
    int32_t  sign = 1;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        default : r = 10*r + s.data[i] - '0';
        }
    }
    return r * sign;
}

The uint32_t means its free to overflow. If it overflows, the input was invalid. If it doesn’t hold an integer, the input was invalid. In either case it will read a harmless, garbage result. Despite being unsigned, it works just fine with negative inputs thanks to two’s complement.

For floats I didn’t intend to parse exponential notation, but some models I wanted to test actually did use it — probably by accident — so I added it anyway. That requires a function to compute the exponent.

float expt10(int32_t e)
{
    float   y = 1.0f;
    float   x = e<0 ? 0.1f : e>0 ? 10.0f : 1.0f;
    int32_t n = e<0 ? e : -e;
    for (; n < -1; n /= 2) {
        y *= n%2 ? x : 1.0f;
        x *= x;
    }
    return x * y;
}

That’s exponentiation by squaring, avoiding signed overflow on the exponent. Traditionally a negative exponent is inverted, but applying unary - to an arbitrary integer might overflow (consider -2147483648). So instead I iterate from the negative end. The negative range is larger than the positive, after all. Finally we can parse floats:

float parsefloat(Str s)
{
    float r    = 0.0f;
    float sign = 1.0f;
    float exp  = 0.0f;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        switch (s.data[i]) {
        case '+':            break;
        case '-': sign = -1; break;
        case '.': exp  =  1; break;
        case 'E':
        case 'e': exp  = exp ? exp : 1.0f;
                  exp *= expt10(parseint(substring(s, i+1)));
                  i    = s.len;
                  break;
        default : r = 10.0f*r + (s.data[i] - '0');
                  exp *= 0.1f;
        }
    }
    return sign * r * (exp ? exp : 1.0f);
}

Probably not as precise as strtof, but good enough for loading a model. It’s also ~30% faster for this purpose than my system’s strtof. If it hits an exponent, it combines parseint and expt10 to augment the result so far. At least for all the models I tried, the exponent only appeared for tiny values. They round to zero with no visible effects, so you can cut the implementation by more than half in one fell swoop if you wish (no more expt10 nor substring either):

        switch (s.data[i]) {
        // ...
        case 'E':
        case 'e': return 0;  // probably small *shrug*
        // ...
        }

Why not strtof? That has the rather annoying requirement that input is null terminated, which is not the case here. Worse, it’s affected by the locale and doesn’t behave consistently nor reliably.

A vertex is three floats separated by whitespace. So combine cut and parsefloat to parse one.

typedef struct {
    float v[3];
} Vert;

Vert parsevert(Str s)
{
    Vert r = {0};
    Cut c = cut(trimleft(s), ' ');
    r.v[0] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[1] = parsefloat(c.head);
    c = cut(trimleft(c.tail), ' ');
    r.v[2] = parsefloat(c.head);
    return r;
}

cut parses a field between every space, including empty fields between adjacent spaces, so trimleft discards extra space before cutting. If the line ends early, this passes empty strings into parsefloat which come out as zeros. No special checks required for invalid input.

Faces are a set of three vertex indices and three normal indices, and parses almost the same way. Relative indices are immediately converted to absolute indices using the number of vertices/normals so far.

typedef struct {
    int32_t v[3];
    int32_t n[3];
} Face;

static Face parseface(Str s, ptrdiff_t nverts, ptrdiff_t nnorms)
{
    Face r      = {0};
    Cut  fields = {0};
    fields.tail = s;
    for (int i = 0; i < 3; i++) {
        fields = cut(trimleft(fields.tail), ' ');
        Cut elem = cut(fields.head, '/');
        r.v[i] = parseint(elem.head);
        elem = cut(elem.tail, '/');  // skip texture
        elem = cut(elem.tail, '/');
        r.n[i] = parseint(elem.head);

        // Process relative subscripts
        if (r.v[i] < 0) {
            r.v[i] = (int32_t)(r.v[i] + 1 + nverts);
        }
        if (r.n[i] < 0) {
            r.n[i] = (int32_t)(r.n[i] + 1 + nnorms);
        }
    }
    return r;
}

Since nverts must be non-negative, and a relative index is negative by definition, adding them together can never overflow. If there are too many vertices, the result might be truncated, as indicated by the cast. That’s fine. Just invalid input.

There’s an interesting interview question here: Consider this alternative to the above, maintaining the explicit cast to dismiss the -Wconversion warning.

            r.v[i] += (int32_t)(1 + nverts);

Is it equivalent? Can this overflow? (Answers: No and yes.) If yes, under what conditions? Unfortunately a fuzz test would never hit it.

Putting it together

For this case, a model is three arrays of vertices, normals, and indices. While faces only support 32-bit indexing, I use ptrdiff_t in order to skip overflow checks. There cannot possibly be more vertices than bytes of source, so these counts cannot overflow.

typedef struct {
    Vert     *verts;
    ptrdiff_t nverts;
    Vert     *norms;
    ptrdiff_t nnorms;
    Face     *faces;
    ptrdiff_t nfaces;
} Model;

Model parseobj(Arena *, Str);

They’d probably look a little nicer as dynamic arrays, but we won’t need that machinery. That’s because the parser makes two passes over the OBJ source, the first time to count:

    Model m     = {0};
    Cut   lines = {0};

    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.nverts++;
        } else if (equals(S("vn"), kind)) {
            m.nnorms++;
        } else if (equals(S("f"), kind)) {
            m.nfaces++;
        }
    }

It’s a lightweight pass, skipping over the numeric data. With that information collected, we can allocate the model:

    m.verts  = new(a, m.nverts, Vert);
    m.norms  = new(a, m.nnorms, Vert);
    m.faces  = new(a, m.nfaces, Face);
    m.nverts = m.nnorms = m.nfaces = 0;

On the next pass we call parsevert and parseface to fill it out.

    lines.tail = obj;
    while (lines.tail.len) {
        lines = cut(lines.tail, '\n');
        Cut fields = cut(trimright(lines.head), ' ');
        Str kind = fields.head;
        if (equals(S("v"), kind)) {
            m.verts[m.nverts++] = parsevert(fields.tail);
        } else if (equals(S("vn"), kind)) {
            m.norms[m.nnorms++] = parsevert(fields.tail);
        } else if (equals(S("f"), kind)) {
            m.faces[m.nfaces++] = parseface(fields.tail, m.nverts, m.nnorms);
        }
    }

At this point the model is parsed, though its not necessarily consistent. Faces indices may still be out of range. The next step is to transform it into a more useful representation.

Transformation

Rendering the model is the easiest way to verify it came out alright, and it’s generally useful for debugging problems. Because it basically does all the hard work for us, and doesn’t require ridiculous contortions to access, I’m going to render with old school OpenGL 1.1. It provides a glInterleavedArrays function with a bunch of predefined formats. The one that interests me is GL_N3F_V3F, where each vertex is a normal and a position. Each face is three such elements. I came up with this:

typedef struct {  // GL_N3F_V3F
    Vert n, v;
} N3FV3F[3];

typedef struct {
    N3FV3F   *data;
    ptrdiff_t len;
} N3FV3Fs;

// Transform a model into a GL_N3F_V3F representation.
N3FV3Fs n3fv3fize(Arena *, Model);

If you’re being precise you’d use GLfloat, but this is good enough for me. By using a different arena for this step, we can discard the OBJ data once it’s in the “local” format. For example:

    Arena perm    = {...};
    Arena scratch = {...};

    N3FV3Fs *scene = new(&perm, nmodels, N3FV3Fs);
    for (int i = 0; i < nmodels; i++) {
        Arena temp  = scratch;  // free OBJ at end of iteration
        Str   obj   = loadfile(&temp, path[i]);
        Model model = parseobj(&temp, obj);
        scene[i]    = n3fv3fize(&perm, model);
    }

The conversion allocates the GL_N3F_V3F array, discards invalid faces, and copies the valid faces into the array:

N3FV3Fs n3fv3fize(Arena *a, Model m)
{
    N3FV3Fs r = {0};
    r.data = new(a, m.nfaces, N3FV3F);
    for (ptrdiff_t f = 0; f < m.nfaces; f++) {
        _Bool valid = 1;
        for (int i = 0; i < 3; i++) {
            valid &= m.faces[f].v[i]>0 && m.faces[f].v[i]<=m.nverts;
            valid &= m.faces[f].n[i]>0 && m.faces[f].n[i]<=m.nnorms;
        }

        if (valid) {
            ptrdiff_t t = r.len++;
            for (int i = 0; i < 3; i++) {
                r.data[t][i].n = m.norms[m.faces[f].n[i]-1];
                r.data[t][i].v = m.verts[m.faces[f].v[i]-1];
            }
        }
    }
    return r;
}

Here’s what that looks like in OpenGL with suzanne.obj and bmw.obj:

This was a fun little project, and perhaps you learned a new technique or two after checking it out.

Tips for more effective fuzz testing with AFL++

2025-02-05T18:03:55Z

Fuzz testing is incredibly effective for mechanically discovering software defects, yet remains underused and neglected. Pick any program that must gracefully accept complex input, written in any language, which has not yet been been fuzzed, and fuzz testing usually reveals at least one bug. At least one program currently installed on your own computer certainly qualifies. Perhaps even most of them. Everything is broken and low-hanging fruit is everywhere. After fuzz testing ~1,000 projects over the past six years, I’ve accumulated tips for picking that fruit. The checklist format has worked well in the past (1, 2), so I’ll use it again. This article discusses AFL++ on source-available C and C++ targets, running on glibc-based Linux distributions, currently the indisputable best fuzzing platform for C and C++.

My tips complement the official, upstream documentation, so consult them, too:

Performance Tips on the AFL++ website
Technical “whitepaper” for afl-fuzz

Even if a program has been fuzz tested, applying the techniques in this article may reveal defects missed by previous fuzz testing.

(1) Configure sanitizers and assertions

More assertions means more effective fuzzing, and sanitizers are a kind of automatically-inserted assertions. By default, fuzz with both Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan):

$ afl-gcc-fast -g3 -fsanitize=address,undefined ...

ASan’s default configuration is not ideal, and should be adjusted via the ASAN_OPTIONS environment variable. If customized at all, AFL++ requires at least these options:

export ASAN_OPTIONS="abort_on_error=1:halt_on_error=1:symbolize=0"

Except symbolize=0, this ought to be the ASan default. When debugging a discovered crash, you’ll want UBSan set up the same way so that it behaves under in a debugger. To improve fuzzing, make ASan even more sensitive to defects by detecting use-after-return bugs. It slows fuzzing slightly, but it’s well worth the cost:

ASAN_OPTIONS+=":detect_stack_use_after_return=1"

By default ASan fills the first 4KiB of fresh allocations with a pattern, to help detect use-after-free bugs. That’s not nearly enough for fuzzing. Crank it up to completely fill virtually all allocations with a pattern:

ASAN_OPTIONS+=":max_malloc_fill_size=$((1<<30))"

In the default configuration, if a program allocates more than 4KiB with malloc then, say, uses strlen on the uninitialized memory, no bug will be detected. There’s almost certainly a zero somewhere after 4KiB. Until I noticed it, the 4KiB limit hid a number of bugs from my fuzz testing. Per (4), fulling filling allocations with a pattern better isolates tests when using persistent mode.

When fuzzing C++ and linking GCC’s libstdc++, consider -D_GLIBCXX_DEBUG. ASan cannot “see” out-of-bounds accesses within a container’s capacity, and the extra assertions fill in the gaps. Mind that it changes the ABI, though fuzz testing will instantly highlight such mismatches.

(2) Prefer the persistent mode

While AFL++ can fuzz many programs in-place without writing a single line of code (afl-gcc, afl-clang), prefer AFL++’s persistent mode (afl-gcc-fast, afl-clang-fast). It’s typically an order of magnitude faster and worth the effort. Though it also has pitfalls (see (4), (5)). I keep a file on hand, fuzztmpl.c — the progenitor of all my fuzz testers:

#include 

__AFL_FUZZ_INIT();

int main(void)
{
    __AFL_INIT();
    char *src = 0;
    unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF;
    while (__AFL_LOOP(10000)) {
        int len = __AFL_FUZZ_TESTCASE_LEN;
        src = realloc(src, len);
        memcpy(src, buf, len);
        // ... send src to target ...
    }
}

I :r this into my Vim buffer, then modify as needed. It’s a stripped and improved version of the official template, which itself has a serious flaw (see (5)). There are unstated constraints about the position of buf and len in the code, so if in doubt, refer to the original template.

(3) Include source files, not header files

We’re well into the 21st century. Nobody is compiling software on 16-bit machines anymore. Don’t get hung up on the one translation unit (TU) per source file mindset. When fuzz testing, we need at most two TUs: One TU for instrumented code and one TU for uninstrumented code. In most cases the latter takes the form of a library (libc, libstdc++, etc.) and we don’t need to think about it.

Fuzz testing typically requires only a subset of the program. Including just those sources straight in the template is both effective and simple. In my template I put includes just above unistd.h so that the header isn’t visible to the sources unless they include it themselves.

#include "src/utils.c"
#include "src/parser.c"
#include 

I know, if you’ve never seen this before it looks bonkers. This isn’t what they taught you in college. Trust me, this simple technique will save you a thousand lines of build configuration. Otherwise you’ll need to manage different object files between fuzz testing and otherwise.

Perhaps more importantly, you can now fuzz test any arbitrary function in the program, including static functions! They’re all right there in the same TU. You’re not limited to public-facing interfaces. Perhaps you can skip (7) and test against a better internal interface. It also gives you direct access to static variables so that you can clear/reset them between tests, per (4).

Programs are often not designed for fuzz testing, or testing generally, and it may be difficult to tease apart tightly-coupled components. Many of the programs I’ve fuzz tested look like this. This technique lets you take a hacksaw to the program and substitute troublesome symbols just for fuzz testing without modifying a single original source line. For example, if the source I’m testing contains a main function, I can remove it:

#define main oldmain
#  include "src/utils.c"
#  include "src/parser.c"
#undef main
#include 

Sure, better to improve the program so that such hacks are unnecessary, but most cases I’m fuzz testing as part of a drive-by review of some open source project. It allows me to quickly discover defects in the original, unmodified program, and produces simpler bug reports like, “Compile with ASan, open this 50-byte file, and then the program will crash.”

(4) Isolate fuzz tests from each other

Tests should be unaffected by previous tests. This is challenging in persistent mode, sometimes even impractical. That means resetting all global state, even something like the internal strtok buffer if that function is used. Add fuzz testing to your list of reasons to eschew global variables.

It’s mitigated by (1), but otherwise uninitialized heap memory may hold contents from previous tests, breaking isolation. Besides interference with fuzzing instrumentation, bugs found this way are wickedly difficult to reproduce.

Don’t pass uninitialized memory into a test, e.g. an output parameter allocated on the stack. Zero-initialize or fill it with a pattern. If it accepts an arena, fill it with a pattern before each test.

Typically you have little control over heap addresses, which likely varies across tests and depends on the behavior previous tests. If the program depends on address values, this may affect the results and make reproduction difficult, so watch for that.

(5) Do not test directly on the fuzz test buffer

Passing buf and len straight into the target is the most common mistake, especially when fuzzing better-designed C programs, and particularly because the official template encourages it.

    myprogram(buf, len);  // BAD!

While it’s a great sign the program doesn’t depend on null termination, it creates a subtle trap. The underlying buffer allocated by AFL++ is larger than len, and ASan will not detect read overflows on inputs! Instead pass a copy sized to fit, which is the purpose of src in my template. Adjust the type of src as needed.

If the program expects null-terminated input then you’ll need to do this anyway in order to append the null byte. If it accepts an “owning” type like std::string, then it’s also already done on your behalf. With “non-owning” views like std::string_view you’ll still want to your own size-fit copy.

If you see a program’s checked in fuzz test using buf directly, make this change and see if anything new pops out. It’s worked for me on a number of occasions.

(6) Don’t bother freeing memory

In general, avoid doing work irrelevant to the fuzz test. The official tips say to “use a simpler target” and “instrument just what you need,” and keeping destructors out of the tests helps in both cases. Unless the program is especially memory-hungry, you won’t run out of memory before AFL++ resets the target process.

If not for (1), it also helps with isolation (4), as different tests are less likely contaminated with uninitialized memory from previous tests.

As an exception, if you want your destructor included in the fuzz test, then use it in the test. Also, it’s easy to exhaust non-memory resources, particularly file descriptors, and you may need to clean those up in order to fuzz test reliably.

Of course, if the target uses arena allocation then none of this matters! It also makes for perfect isolation, as even addresses won’t vary between tests.

(7) Use a memory file descriptor to back named paths

Many interfaces are, shall we say, not so well-designed and only accept input from a named file system path, insisting on opening and reading the file themselves. Testing such interfaces presents challenges, especially if you’re interested in parallel fuzzing. Fortunately there’s usually an easy out: Create a memory file descriptor and use its /proc name.

int fd = memfd_create("fuzz", 0);
assert(fd == 3);
while (...) {
    // ...
    ftruncate(fd, 0);
    pwrite(fd, buf, len, 0);
    myprogram("/proc/self/fd/3");
}

With standard input as 0, output as 1, and error as 2, I’ve assumed the memory file descriptor will land on 3, which makes the test code a little simpler. If it’s not 3 then something’s probably gone wrong anyway, and aborting is the best option. If you don’t want to assume, use snprintf or whatever to construct the path name from fd.

Using pwrite (instead of write) leaves the file description offset at the beginning of the file.

Thanks to the memory file descriptor, fuzz test data doesn’t land in permanent storage, so less wear and tear on your SSD from the occasional flush. Because of /proc, the file is unique to the process despite the common path name, so no problems parallel fuzzing. No cleanup needed, either.

If the program wants a file descriptor — i.e. it wants a socket because you’re fuzzing some internal function — pass the file descriptor directly:

    myprogram(fd);

If it accepts a FILE *, you could fopen the /proc path, but better to use fdmemopen to create a FILE * on the object:

    myprogram(fdmemopen(buf, len, "rb"));

Note how, per (6), we don’t need to bother with fclose because it’s not associated with a file descriptor.

(8) Configure the target for smaller buffers

A common sight in diseased programs are “generous” fixed buffer sizes:

#define MY_MAX_BUFFER_LENGTH 65536

void example(...)
{
    char path[PATH_MAX];  // typically 4,096
    char buf[MY_MAX_BUFFER_LENGTH];
    // ...
}

These huge buffers tend to hide bugs. Turn those stones over! It takes a lot of fuzzing time to max them out and excite the unhappy paths — or the super-unhappy paths, overflows. Better if the fuzz test can reach worst case conditions quickly and explore the execution paths out of it.

So when you see these, cut them way down, possibly using (3). Change 65536 to, say, 16 and see what happens. If fuzzing finds a crash on the short buffer, typically extending the input to crash on the original buffer size is straightforward, e.g. repeat one of the bytes even more than it already repeats.

Conclusion and samples

Hopefully something here will help you catch a defect that would have otherwise gone unnoticed. Even better, perhaps awareness of these fuzzing techniques will prevent the bug in the first place. Thanks to my template, some solid tooling, and the know-how in this article, I can whip up a fuzz test in a couple of minutes. But that ease means I discard it as just as casually, and so I don’t take time to capture and catalog most. If you’d like to see some samples, I do have an old, short list. Perhaps after another kiloproject of fuzz testing I’ll pick up more techniques.

Examples of quick hash tables and dynamic arrays in C

2025-01-19T04:10:33Z

This article durably captures my reddit comment showing techniques for std::unordered_map and std::vector equivalents in C programs. The core, important features of these data structures require only a dozen or so lines of code apiece. They compile quickly, and tend to run faster in debug builds than release builds of their C++ equivalents. What they lack in genericity they compensate in simplicity. Nothing here will be new. Everything has been covered in greater detail previously, which I will reference when appropriate.

For a concrete goal, we will build a data structure representing an process environment, along with related functionality to make it more interesting. That is, we’ll build a string-to-string map.

Allocator

The foundation is our allocator, a simple bump allocator, so we’ll start there:

#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char *beg;
    char *end;
} Arena;

void *alloc(Arena *a, ptrdiff_t count, ptrdiff_t size, ptrdiff_t align)
{
    ptrdiff_t pad = -(uintptr_t)a->beg & (align - 1);
    assert(count < (a->end - a->beg - pad)/size);  // TODO: OOM policy
    void *r = a->beg + pad;
    a->beg += pad + count*size;
    return memset(r, 0, count*size);
}

Allocating through the new macro eliminates several classes of common defects in C programs. If we get our types mixed up we get errors, or at least warnings. Our size calculations cannot overflow. We cannot accidentally use uninitialized memory. We cannot leak memory; deallocating is implicit. The main downside is that it doesn’t fit some less common allocator requirements.

Strings

Next, a string representation. Classic null-terminated strings are an error-prone paradigm, so we’ll use counted strings instead:

#define S(s)    (Str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

This is equivalent to a std::string_view in C++. The macro allows us to efficiently convert string literals into Str objects. Because our data structures are backed by arenas, we won’t care whether a particular string is backed by a static string, arena, memory map, etc. We’ll also need a function to compare strings for equality:

_Bool equals(Str a, Str b)
{
    if (a.len != b.len) {
        return 0;
    }
    return !a.len || !memcmp(a.data, b.data, a.len);
}

!a.len appears superfluous, but it’s necessary: memcmp arbitrarily forbids null pointers, and we may be passed a zero-initialized Str. Though this is scheduled to be corrected.

We’ll need a string hash function, too:

uint64_t hash64(Str s)
{
    uint64_t h = 0x100;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

This is an FNV-style hash. The “basis” keeps strings of nulls from getting stuck at zero, and the multiplier is my favorite prime number. Character data is fixed to 0–255 rather than allowing the signedness of char to influence the results. As a multiplicative hash, the high bits are mixed better than the low bits, and our maps will take that into account.

Flat hash map

We have a couple string-to-string map options. The more restrictive, but more efficient — in terms of memory use and speed — is a Mask-Step-Index (MSI) hash table. I don’t think it fits our problem as well as the next option, particularly because it puts a hard limit on unique keys, but it’s worth evaluating. Let’s call it FlatEnv:

enum { ENVEXP = 10 };  // support up to 1,000 unique keys
typedef struct {
    Str keys[1<<ENVEXP];
    Str vals[1<<ENVEXP];
} FlatEnv;

It’s nothing more than two fixed-length arrays, storing keys and values separately. Keys with null pointers are empty slots, so a zero-initialized FlatEnv is an empty table. They come out of an arena ready-to-use:

    FlatEnv *env = new(a, 1, FlatEnv);  // new, empty environment

Now we leverage equals and hash64 for a double-hashed, open address search on the keys array:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key);
    uint32_t mask = (1<<ENVEXP) - 1;
    uint32_t step = (hash>>(64 - ENVEXP)) | 1;
    for (int32_t i = hash;;) {
        i = (i + step) & mask;
        if (!env->keys[i].data) {
            env->keys[i] = key;
            return env->vals + i;
        } else if (equals(env->keys[i], key)) {
            return env->vals + i;
        }
    }
}

By returning a pointer to the unmodified value slot, this function covers both lookup and insertion. So that’s the entire hash table implementation. To insert, the caller assigns the slot. For mere lookup, check the slot for a null pointer.

    FlatEnv *env = new(a, 1, FlatEnv);

    // insert
    *flatlookup(env, S("hello")) = S("world");

    // lookup
    Str val = *flatlookup(env, key);
    if (val.data) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val.len, val.data);
    }

To iterate over the map entries, iterate over the arrays, skipping null entries. Per the ENVEXP comment, it’s hard-coded to support up to 1,000 unique keys (1,024 slots, leaving some to spare). The table itself doesn’t enforce this limit and will turn into an infinite loop if you insert too many keys. To support scaling, we could design the map to have dynamic table sizes, track the number of unique keys, and resize the table (allocate new arrays) when the load factor crosses a threshold. Resizing sounds messy and complicated, so fortunately there’s another option.

Hierarchical hash map

If the number of keys is unbounded, hash tries work better. Trees scale well, and we can allocate nodes out of the arena as it grows. We’ll use a 4-ary trie, a good default that balances size and performance:

typedef struct Env Env;
struct Env {
    Env *child[4];
    Str  key;
    Str  value;
};

An empty map is just a null pointer, and so, again, these maps come ready-to-use in their zero state:

    Env *env = 0;  // new, empty environment

The implementation is equally as brief:

Str *lookup(Env **env, Str key, Arena *a)
{
    for (uint64_t h = hash64(key); *env; h <<= 2) {
        if (equals(key, (*env)->key)) {
            return &(*env)->value;
        }
        env = &(*env)->child[h>>62];
    }
    if (!a) return 0;
    *env = new(a, 1, Env);
    (*env)->key = key;
    return &(*env)->value;
}

Like before, this covers both lookup and insertion, though the mode is determined explicitly by the arena pointer. Without an arena, it’s a lookup, which doesn’t require allocation. With an arena, it creates an entry if necessary and, like before, returns a pointer into the map so that the caller can assign it. Usage differs only slightly:

    Env *env = 0;

    // insert
    *lookup(env, S("hello"), &scratch) = S("world");

    // lookup
    Str *val = lookup(env, key, 0);
    if (val) {
        printf("%.*s = %.*s\n", (int)key.len, key.data,
                                (int)val->len, val->data);
    }

We’ll come back around to iteration later.

String concatenation

Next I’d like a function that takes an Env and produces an envp data structure as expected by execve(2). Then we can use this map as the environment in a child process. We’ll need some string manipulation, particularly string concatenation. The core is a copy function:

Str copy(Arena *a, Str s)
{
    Str r = s;
    r.data = new(a, s.len, char);
    if (r.len) memcpy(r.data, s.data, r.len);
    return r;
}

Like with memcmp, because it’s memcpy we need to handle the arbitrary special case around null pointers should the input be a zero Str. Now we can easily concatenate strings, in-place if possible:

Str concat(Arena *a, Str head, Str tail)
{
    if (!head.data || head.data+head.len != a->beg) {
        head = copy(a, head);
    }
    head.len += copy(a, tail).len;
    return head;
}

Yet again, !head.data is special check because pointer arithmetic on null (i.e. adding zero to null) is arbitrarily disallowed. Worrying about this is exhausting, isn’t it? That language fix can’t come soon enough. This one’s already fixed in C++.

That’s enough to get the ball rolling on FlatEnv:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    int    cap  = 1<<ENVEXP;
    char **envp = new(a, cap, char *);
    int    len  = 0;
    for (int i = 0; i < cap; i++) {
        if (env->vals[i].data) {
            Str pair = env->keys[i];
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, env->vals[i]);
            pair = concat(a, pair, S("\0"));
            envp[len++] = pair.data;
        }
    }
    return envp;
}

Simple, right? Traditional string handling in C is an error-prone pain, but with a better set of primitives it’s a breeze. Plus we’re doing this all with essentially no runtime. In use this might look like:

void shellexec(char *cmd, FlatEnv *env, Arena scratch)
{
    char  *argv[] = {"sh", "-c", cmd, 0};
    char **envp   = flat_to_envp(env, &scratch);
    execve("/bin/sh", argv, envp);
}

By virtue of the scratch arena, the envp object is automatically freed should execve fail. (If that should even matter.) Considering this, if you’re itching to write the fastest shell ever devised, arena allocation and the techniques in this article would probably get you most of the way there. Nobody writes shells this way.

Dynamic arrays

To implement the envp conversion for the hash trie Env, let’s add one more tool to our toolbox: dynamic arrays. Our std::vector equivalent. We’ll start with a familiar slice header:

typedef struct {
    char    **data;
    ptrdiff_t len;
    ptrdiff_t cap;
} EnvpSlice;

The bad news is that we don’t have templates, and so we’ll need to define one such structure for each type of which we want a dynamic array. This one is set up to create an envp array. The good news is that manipulation occurs through generic code, so everything else is reusable.

I want a push macro that creates an empty slot in which to insert a new value, evaluating to a pointer to this slot. Usually that means incrementing len, but when out of room it will need to expand the underlying storage. It’s clearer to start with example usage. Imagine using it with the previous flat_to_envp:

char **flat_to_envp(FlatEnv *env, Arena *a)
{
    EnvpSlice r = {0};
    for (int i = 0; i < 1<<ENVEXP; i++) {
        if (env->vals[i].data) {
            // ... concat as before ...
            *push(a, &r) = pair.data;
        }
    }
    push(a, &r);  // terminal null pointer
    return r.data;
}

Continuing the theme, a zero-initialized slice is a ready-to-use empty slice, and most begin life this way. The immediate dereference on push is just like those calls to lookup. If expansion is needed, the push macro’s job is to pull fields off the slice, pass them into a helper function which agnostically, strict-aliasing-legally, manipulates the slice header:

void *push_(Arena *, void *data, ptrdiff_t *pcap, ptrdiff_t size);

#define push(a, s) \
  ((s)->len == (s)->cap \
    ? (s)->data = push_((a), (s)->data, &(s)->cap, sizeof(*(s)->data)), \
      (s)->data + (s)->len++ \
    : (s)->data + (s)->len++)

The internals of that helper look an awful lot like concat, with the same in-place-if-possible behavior:

enum { SLICE_INITIAL_CAP = 4 };

void *push_(Arena *a, void *data, ptrdiff_t *pcap, ptrdiff_t size)
{
    ptrdiff_t cap   = *pcap;
    ptrdiff_t align = _Alignof(void *);

    if (!data || a->beg != (char *)data + cap*size) {
        void *copy = alloc(a, cap, size, align);
        if (data) memcpy(copy, data, cap*size);
        data = copy;
    }

    ptrdiff_t extend = cap ? cap : SLICE_INITIAL_CAP;
    alloc(a, extend, size, 1);  // already aligned
    *pcap = cap + extend;
    return data;
}

(Update: Aleh pointed out an inefficiency in the original code: applying alignment in the second alloc may introduce unnecessary fragmentation. This has been corrected above.)

For unfathomable reasons, standard C does not permit _Alignof on expressions, so slice data is simply pointer-aligned. (The more shrewd might consider max_align_t.) Like concatenation, we copy the object to the beginning of the arena if necessary, and extend the allocation by allocating the usual way, being careful not to increment the capacity until after it succeeds.

Update: NRK points out we can use __typeof__ (extension) or typeof (C23), to work around this syntactical limitation of _Alignof. Convert the align local variable into a parameter:

void *push_(..., ptrdiff_t align);

Then in the macro pass it via _Alignof(__typeof__(…)):

#define push(a, s) \
  ((s)->len == (s)->cap \
    ? (s)->data = push_((a), (s)->data, &(s)->cap, \
          sizeof(*(s)->data), _Alignof(__typeof__(*(s)->data))), \
      (s)->data + (s)->len++ \
    : (s)->data + (s)->len++)

Spelled as an extension, it already works with all major C compilers from the past decade, and without requiring special compiler flags.

We can now use push on any structure with data, len, and cap fields of the appropriate types.

Putting it all together

With that in place, we can define a simple, recursive version of the envp builder for Env:

#define countof(a)  ((ptrdiff_t)(sizeof(a) / sizeof(*(a))))

EnvpSlice env_to_envp_(EnvpSlice r, Env *env, Arena *a)
{
    if (env) {
        Str pair = env->key;
        pair = concat(a, pair, S("="));
        pair = concat(a, pair, env->value);
        pair = concat(a, pair, S("\0"));
        *push(a, &r) = pair.data;
        for (int i = 0; i < countof(env->child); i++) {
            r = env_to_envp_(r, env->child[i], a);
        }
    }
    return r;
}

char **env_to_envp(Env *env, Arena *a)
{
    EnvpSlice r = {0};
    r = env_to_envp_(r, env, a);
    push(a, &r);  // null pointer terminator
    return r.data;
}

As is often the case, the recursive part doesn’t fit the final interface, so the core is a helper, and the caller-facing part is an adapter. I’m not entirely comfortable with this function, though. When working with huge environments — over a ~100k entries — then the recursive implementation will non-deterministically blow the stack if the trie winds up lopsided. Or deterministically for chosen pathological inputs, because the hash function isn’t seeded.

Instead we could use a stack data structure backed by the arena to traverse the trie. If passed a secondary scratch arena, we’d use that arena for this stack, but I’m sticking to the original interface. Here’s what that looks like, with an extra trick thrown in just to show off:

char **env_to_envp_safe(Env *env, Arena *a)
{
    EnvpSlice r = {0};

    typedef struct {
        Env *env;
        int  index;
    } Frame;
    Frame init[16];  // small size optimization

    struct {
        Frame    *data;
        ptrdiff_t len;
        ptrdiff_t cap;
    } stack = {init, 0, countof(init)};

    *push(a, &stack) = (Frame){env, 0};
    while (stack.len) {
        Frame *top = stack.data + stack.len - 1;

        if (!top->env) {
            stack.len--;

        } else if (top->index == countof(top->env->child)) {
            Str pair = top->env->key;
            pair = concat(a, pair, S("="));
            pair = concat(a, pair, top->env->value);
            pair = concat(a, pair, S("\0"));
            *push(a, &r) = pair.data;
            stack.len--;

        } else {
            int i = top->index++;
            *push(a, &stack) = (Frame){top->env->child[i], 0};
        }
    }

    push(a, &r);
    return r.data;
}

The init array is a form of small-size optimization. It’s used at first, and sufficient for nearly all inputs. So no stack litter in the arena. If it’s not enough, then push will automatically move the stack into the arena. I think that’s a super duper neato trick!

Alternative to this, and as discussed in the original hash trie article, we could instead add a next field to Env as an intrusive linked list that chains the nodes together in insertion order. Or another way to look at it, Env is a linked list with an intrusive hash trie for O(log n) searches on the list. That’s a lot simpler, has other useful properties, and only costs one extra pointer per entry. And we wouldn’t need slices, which was my motivation for choosing non-linked-list approach above.

Hash hardening (bonus)

Okay, I lied, this is something new. Think of it as your special treat for sticking with me so far.

Hash map non-determinism comes with a classic security vulnerability: If populated with untrusted keys, an attacker could choose colliding keys and produce worst case behavior in the hash map. That is, MSI hash tables reduce to linear scans, and hash tries reduce to linked lists. Worse, the recursive envp function blows the stack, though we already solved that issue.

If we want to foil such attacks, we can seed the hash so that an attacker cannot devise collisions. They’d need to discover the seed. We might even call that seed a “key,” but this is a non-cryprographic hash so I’m going to avoid that term. The usual implementation of this concept involves generating a seed, sometimes per table, and storing it somewhere. However, we can leverage an existing security mechanism, gaining this feature at basically no cost: Address Space Layout Randomization (ASLR). First, let’s augment the string hash function:

uint64_t hash64(Str s, uint64_t seed)
{
    uint64_t h = seed;
    for (ptrdiff_t i = 0; i < s.len; i++) {
        h ^= s.data[i] & 255;
        h *= 1111111111111111111;
    }
    return h;
}

In flatlookup we can use the address of the FlatEnv as our seed:

Str *flatlookup(FlatEnv *env, Str key)
{
    uint64_t hash = hash64(key, (uintptr_t)env);
    // ...
}

Recall it’s allocated out of our arena (via new), and ASLR gives our arena a random offset. On top of that, a FlatEnv seed depends precisely on the amount of memory allocated earlier. An environment variable name or value being slightly longer or shorter will reshuffle the whole table if allocated in the arena before the FlatEnv.

It’s slightly trickier with hash tries. The root pointer isn’t required to be fixed. For example:

    Env *env = 0;
    // ... insert keys ...
    Env *myenv = env;
    // ... lookup keys in myenv ...

We could disallow this, but it would be easy to forget (e.g. while you’re refactoring and not thinking about it) and difficult to detect. Difficult-to-detect bugs keep me awake at night. Instead we can use the root node to seed the trie:

Str *lookup(Env **env, Str key, Arena *a)
{
    uint64_t seed = env ? (uintptr_t)*env : 0;
    for (uint64_t h = hash64(key, seed); *env; h <<= 2) {
    // ...
}

At first this seems like it couldn’t work, like a chicken-and-egg problem. There’s no root node at first, so we can’t know the seed yet. Though think about it a little longer and it should be obvious: The hash is unused when inserting the very first element. It simply becomes the root of the trie. The seed is irrelevant until the second insert, at which point we’ve established a seed. This delay establishing the seed means hash tries are even more randomized.

With the proper tools and representations, working in C isn’t difficult even if you need containers and string manipulation. Aside from memcmp and memcpy — each easily replaceable — we did all this without runtime assistance, not even its allocator. What a pleasant way to work!

Source from this article in runnable form, which I used to test my samples: example.c

Everything I've learned so far about running local LLMs

2024-11-10T05:05:20Z

This article was discussed on Hacker News.

Over the past month I’ve been exploring the rapidly evolving world of Large Language Models (LLM). It’s now accessible enough to run a LLM on a Raspberry Pi smarter than the original ChatGPT (November 2022). A modest desktop or laptop supports even smarter AI. It’s also private, offline, unlimited, and registration-free. The technology is improving at breakneck speed, and information is outdated in a matter of months. This article snapshots my practical, hands-on knowledge and experiences — information I wish I had when starting. Keep in mind that I’m a LLM layman, I have no novel insights to share, and it’s likely I’ve misunderstood certain aspects. In a year this article will mostly be a historical footnote, which is simultaneously exciting and scary.

In case you’ve been living under a rock — as an under-the-rock inhabitant myself, welcome! — LLMs are neural networks that underwent a breakthrough in 2022 when trained for conversational “chat.” Through it, users converse with a wickedly creative artificial intelligence indistinguishable from a human, which smashes the Turing test and can be wickedly creative. Interacting with one for the first time is unsettling, a feeling which will last for days. When you bought your most recent home computer, you probably did not expect to have a meaningful conversation with it.

I’ve found this experience reminiscent of the desktop computing revolution of the 1990s, where your newly purchased computer seemed obsolete by the time you got it home from the store. There are new developments each week, and as a rule I ignore almost any information more than a year old. The best way to keep up has been r/LocalLLaMa. Everything is hyped to the stratosphere, so take claims with a grain of salt.

I’m wary of vendor lock-in, having experienced the rug pulled out from under me by services shutting down, changing, or otherwise dropping my use case. I want the option to continue, even if it means changing providers. So for a couple of years I’d ignored LLMs. The “closed” models, accessibly only as a service, have the classic lock-in problem, including silent degradation. That changed when I learned I can run models close to the state-of-the-art on my own hardware — the exact opposite of vendor lock-in.

This article is about running LLMs, not fine-tuning, and definitely not training. It’s also only about text, and not vision, voice, or other “multimodal” capabilities, which aren’t nearly so useful to me personally.

To run a LLM on your own hardware you need software and a model.

The software

I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies. From just two files, EXE and GGUF (model), both designed to load via memory map, you could likely still run the same LLM 25 years from now, in exactly the same way, out-of-the-box on some future Windows OS.

Full disclosure: I’m biased because the official Windows build process is w64devkit. What can I say? These folks have good taste! That being said, you should only do CPU inference if GPU inference is impractical. It works reasonably up to ~10B parameter models on a desktop or laptop, but it’s slower. My primary use case is not built with w64devkit because I’m using CUDA for inference, which requires a MSVC toolchain. Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

The bottleneck for GPU inference is video RAM, or VRAM. These models are, well, large. The more RAM you have, the larger the model and the longer the context window. Larger models are smarter, and longer contexts let you process more information at once. GPU inference is not worth it below 8GB of VRAM. If “GPU poor”, stick with CPU inference. On the plus side, it’s simpler and easier to get started with CPU inference.

There are many utilities in llama.cpp, but this article is concerned with just one: llama-server is the program you want to run. It’s an HTTP server (default port 8080) with a chat UI at its root, and APIs for use by programs, including other user interfaces. A typical invocation:

$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf

The context size is the largest number of tokens the LLM can handle at once, input plus output. Contexts typically range from 8K to 128K tokens, and depending on the model’s tokenizer, normal English text is ~1.6 tokens per word as counted by wc -w. If the model supports a large context you may run out of memory. If so, set a smaller context size, like --ctx-size $((1<<13)) (i.e. 8K tokens).

I do not yet understand what flash attention is about, and I don’t know why --flash-attn/-fa is not the default (lower accuracy?), but you should always request it because it reduces memory requirements when active and is well worth the cost.

If the server started successfully, visit it (http://localhost:8080/) to try it out. Though of course you’ll need a model first.

The models

Hugging Face (HF) is “the GitHub of LLMs.” It’s an incredible service that has earned that title. “Small” models are around a few GBs, large models are hundreds of GBs, and HF hosts it all for free. With a few exceptions that do not matter in practice, you don’t even need to sign up to download models! (I’ve been so impressed that after a few days they got a penny-pincher like me to pay for pro account.) That means you can immediately download and try any of the stuff I’m about to discuss.

If you look now, you’ll wonder, “There’s a lot of stuff here, so what the heck am I supposed to download?” That was me one month ago. For llama.cpp, the answer is GGUF. None of the models are natively in GGUF. Instead GGUFs are in a repository with “GGUF” in the name, usually by a third party: one of the heroic, prolific GGUF quantizers.

(Note how nowhere does the official documentation define what “GGUF” stands for. Get used that. This is a technological frontier, and if the information exists at all, it’s not in the obvious place. If you’re considering asking your LLM about this once it’s running: Sweet summer child, we’ll soon talk about why that doesn’t work. As far as I can tell, “GGUF” has no authoritative definition (update: the U stands for “Unified”, but the rest is still ambiguous).)

Since llama.cpp is named after the Meta’s flagship model, their model is a reasonable start, though it’s not my personal favorite. The latest is Llama 3.2, but at the moment only the 1B and 3B models — that is, ~1 billion and ~3 billion parameters — work in Llama.cpp. Those are a little too small to be of much use, and your computer can likely to better if it’s not a Raspberry Pi, even with CPU inference. Llama 3.1 8B is a better option. (If you’ve got at least 24GB of VRAM then maybe you can even do Llama 3.1 70B.)

If you search for Llama 3.1 8B you’ll find two options, one qualified “instruct” and one with no qualifier. Instruct means it was trained to follow instructions, i.e. to chat, and that’s nearly always what you want. The other is the “base” model which can only continue a text. (Technically the instruct model is still just completion, but we’ll get to that later.) It would be great if base models were qualified “Base” but, for dumb path dependency reasons, they’re usually not.

You will not find GGUF in the “Files” for the instruct model, nor can you download the model without signing up in order to agree to the community license. Go back to the search, add GGUF, and look for the matching GGUF model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF. bartowski is one of the prolific and well-regarded GGUF quantizers. Not only will this be in the right format for llama.cpp, you won’t need to sign up.

In “Files” you will now see many GGUFs. These are different quantizations of the same model. The original model has bfloat16 tensors, but for merely running the model we can throw away most of that precision with minimal damage. It will be a tiny bit dumber and less knowledgeable, but will require substantially fewer resources. The general recommendation, which fits my experience, is to use Q4_K_M, a 4-bit quantization. In general, better to run a 4-bit quant of a larger model than an 8-bit quant of a smaller model. Once you’ve got the basics understood, experiment with different quants and see what you like!

My favorite models

Models are trained for different trade-offs and differ in strengths and weaknesses, so no model is best at everything — especially on “GPU-poor” configurations. My desktop system has an RTX 3050 Ti with 8GB VRAM, and its limitations have shaped my choices. I can comfortably run ~10B models, and ~30B models just barely enough to test their capabilities. For ~70B I rely on third-party hosts. My “t/s” numbers are all on this system running 4-bit quants.

This list omits “instruct” from the model name, but assume the instruct model unless I say otherwise. A few are bona fide open source, at least as far as LLMs practically can be, and I’ve noted the license when that’s the case. The rest place restrictions on both use and distribution.

Mistral-Nemo-2407 (12B) [Apache 2.0]

A collaboration between Mistral AI and Nvidia (“Nemo”), the most well-rounded ~10B model I’ve used, and my default. Inference starts at a comfortable 30 t/s. It’s strengths are writing and proofreading, and it can review code nearly as well as ~70B models. It was trained for a context length of 128K, but its effective context length is closer to 16K — a limitation I’ve personally observed.

The “2407” is a date (July 2024) as version number, a versioning scheme I wholeheartedly support. A date tells you about its knowledge cut-off and tech level. It sorts well. Otherwise LLM versioning is a mess. Just as open source is bad with naming, AI companies do not comprehend versioning.
Qwen2.5-14B [Apache 2.0]

Qwen models, by Alibaba Cloud, impressively punch above their weight at all sizes. 14B inference starts at 11 t/s, with capabilities on par with Mistral Nemo. If I could run 72B on my own hardware, it would probably be my default. I’ve been trying it through Hugging Face’s inference API. There’s a 32B model, but it’s impractical for my hardware, so I haven’t spent much time with it.
Gemma-2-2B

Google’s model is popular, perhaps due to its playful demeanor. For me, the 2B model is great for fast translation. It’s amazing that LLMs have nearly obsoleted Google Translate, and you can run it on your home computer. Though it’s more resource-intensive, and refuses to translate texts it finds offensive, which sounds like a plot element from a sci-fi story. In my translation script, I send it text marked up with HTML. Simply asking Gemma to preserve the markup Just Works! The 9B model is even better, but slower, and I’d use it instead of 2B for translating my own messages into another language.
Phi3.5-Mini (4B) [MIT]

Microsoft’s niche is training on synthetic data. The result is a model that does well in tests, but doesn’t work so well in practice. For me, its strength is document evaluation. I’ve loaded the context with up to 40K-token documents — it helps that it’s a 4B model — and successfully queried accurate summaries and data listings.
SmolLM2-360M [Apache 2.0]

Hugging Face doesn’t just host models; their 360M model is unusually good for its size. It fits on my 2008-era, 1G RAM, Celeron, and 32-bit operating system laptop. It also runs well on older Raspberry Pis. It’s creative, fast, converses competently, can write poetry, and a fun toy in cramped spaces.
Mixtral-8x7B (48B) [Apache 2.0]

Another Mistral AI model, and more of a runner up. 48B seems too large, but this is a Mixture of Experts (MoE) model. Inference uses only 13B parameters at a time. It’s reasonably-suited to CPU inference on a machine with at least 32G of RAM. The model retains more of its training inputs, more like a database, but for reasons we’ll see soon, it isn’t as useful as it might seem.
Llama-3.1-70B and Llama-3.1-Nemotron-70B

More models I cannot run myself, but which I access remotely. The latter bears “Nemo” because it’s an Nvidia fine-tune. If I could run 70B models myself, Nemotron might just be my default. I’d need to spent more time evaluating it against Qwen2.5-72B.

Most of these models have abliterated or “uncensored” versions, in which refusal is partially fine-tuned out at a cost of model degradation. Refusals are annoying — such as Gemma refusing to translate texts it dislikes — but doesn’t happen enough for me to make that trade-off. Maybe I’m just boring. Also refusals seem to decrease with larger contexts, as though “in for a penny, in for a pound.”

The next group are “coder” models trained for programming. In particular, they have fill-in-the-middle (FIM) training for generating code inside an existing program. I’ll discuss what that entails in a moment. As far as I can tell, they’re no better at code review nor other instruct-oriented tasks. It’s the opposite: FIM training is done in the base model, with instruct training applied later on top, so instruct works against FIM! In other words, base model FIM outputs are markedly better, though you lose the ability to converse with them.

There will be a section on evaluation later, but I want to note now that LLMs produce mediocre code, even at the state-of-the-art. The rankings here are relative to other models, not about overall capability.

DeepSeek-Coder-V2-Lite (16B)

A self-titled MoE model from DeepSeek. It uses 2B parameters during inference, making it as fast as Gemma 2 2B but as smart as Mistral Nemo, striking a great balance, especially because it out-competes ~30B models at code generation. If I’m playing around with FIM, this is my default choice.
Qwen2.5-Coder-7B [Apache 2.0]

Qwen Coder is a close second. Output is nearly as good, but slightly slower since it’s not MoE. It’s a better choice than DeepSeek if you’re memory-constrained. While writing this article, Alibaba Cloud released a new Qwen2.5-Coder-7B but failed to increment the version number, which is horribly confusing. The community has taken to calling it Qwen2.5.1. Remember what I said about AI companies and versions? (Update: One day publication, 14B and 32B coder models were released. I tried both, and neither are quite as good as DeepSeek-Coder-V2-Lite, so my rankings are unchanged.)
Granite-8B-Code [Apache 2.0]

IBM’s line of models is named Granite. In general Granite models are disappointing, except that they’re unusually good at FIM. It’s tied in second place with Qwen2.5 7B in my experience.

I also evaluated CodeLlama, CodeGemma, Codestral, and StarCoder. Their FIM outputs were so poor as to be effectively worthless at that task, and I found no reason to use these models. The negative effects of instruct training were most pronounced for CodeLlama.

The user interfaces

I pointed out Llama.cpp’s built-in UI, and I’d used similar UIs with other LLM software. As is typical, no UI is to my liking, especially in matters of productivity, so I built my own, Illume. This command line program converts standard input into an API query, makes the query, and streams the response to standard output. Should be simple enough to integrate into any extensible text editor, but I only needed it for Vim. Vimscript is miserable, probably the second worst programming language I’ve ever touched, so my goal was to write as little as possible.

I created Illume to scratch my own itch, to support my exploration of the LLM ecosystem. I actively break things and add features as needed, and I make no promises about interface stability. You probably don’t want to use it.

Lines that begin with ! are directives interpreted by Illume, chosen because it’s unlikely to appear in normal text. A conversation alternates between !user and !assistant in a buffer.

!user
Write a Haiku about time travelers disguised as frogs.

!assistant
Green, leaping through time,
Frog tongues lick the future's rim,
Disguised in pond's guise.

It’s still a text editor buffer, so I can edit the assistant response, reword my original request, etc. before continuing the conversation. For composing fiction, I can request it to continue some text (which does not require instruct training):

!completion
Din the Wizard stalked the dim castle

I can stop it, make changes, add my own writing, and keep going. I ought to spend more time practicing with it. If you introduce out-of-story note syntax, the LLM will pick up on it, and then you can use notes to guide the LLM’s writing.

While the main target is llama.cpp, I query different APIs, implemented by different LLM software, with incompatibilities across APIs (a parameter required by one API is forbidden by another), so directives must be flexible and powerful. So directives can set arbitrary HTTP and JSON parameters. Illume doesn’t try to abstract the API, but exposes it at a low level, so effective use requires knowing the remote API. For example, the “profile” for talking to llama.cpp looks like this:

!api http://localhost:8080/v1
!:cache_prompt true

Where cache_prompt is a llama.cpp-specific JSON parameter (!:). Prompt cache nearly always better enabled, yet for some reason it’s disabled by default. Other APIs refuse requests with this parameter, so then I must omit or otherwise disable it. The Hugging Face “profile” looks like this:

!api https://api-inference.huggingface.co/models/{model}/v1
!:model Qwen/Qwen2.5-72B-Instruct
!>x-use-cache false

For the sake of HF, Illume can interpolate JSON parameters into the URL. The HF API caches also aggressively caches. I never want this, so I supply an HTTP parameter (!>) to turn it off.

Unique to llama.cpp is an /infill endpoint for FIM. It requires a model with extra metadata, trained a certain way, but this is usually not the case. So while Illume can use /infill, I also added FIM configuration so, after reading the model’s documentation and configuring Illume for that model’s FIM behavior, I can do FIM completion through the normal completion API on any FIM-trained model, even on non-llama.cpp APIs.

Fill-in-the-Middle (FIM) tokens

It’s time to discuss FIM. To get to the bottom of FIM I needed to go to the source of truth, the original FIM paper: Efficient Training of Language Models to Fill in the Middle. This allowed me to understand how these models are FIM-trained, at least enough to put that training to use. Even so, model documentation tends to be thin on FIM because they expect you to run their code.

Ultimately an LLM can only predict the next token. So pick some special tokens that don’t appear in inputs, use them to delimit a prefix and suffix, and middle (PSM) — or sometimes ordered suffix-prefix-middle (SPM) — in a large training corpus. Later in inference we can use those tokens to provide a prefix, suffix, and let it “predict” the middle. Crazy, but this actually works!

{prefix}{suffix}

For example when filling the parentheses of dist = sqrt(x*x + y*y):

dist = sqrt()x*x + y*y

To have the LLM fill in the parentheses, we’d stop at and let the LLM predict from there. Note how is essentially the cursor. By the way, this is basically how instruct training works, but instead of prefix and suffix, special tokens delimit instructions and conversation.

Some LLM folks interpret the paper quite literally and use

, etc.
for their FIM tokens, although these look nothing like their other special
tokens. More thoughtful trainers picked <|fim_prefix|>, etc. Illume
accepts FIM templates, and I wrote templates for the popular models. For
example, here’s Qwen (PSM):

<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>


Mistral AI prefers square brackets, SPM, and no “middle” token:

[SUFFIX]{suffix}[PREFIX]{prefix}


With these templates I could access the FIM training in models unsupported
by llama.cpp’s /infill API.

Besides just failing the prompt, the biggest problem I’ve had with FIM is
LLMs not know when to stop. For example, if I ask it to fill out this
function (i.e. assign something r):

def norm(x: float, y: float) -> float):
    return r


(Side note: Static types, including the hints here, produce better results
from LLMs, acting as guardrails.) It’s not unusual to get something like:

def norm(x: float, y: float) -> float):
    r = sqrt(x*x + y*y)
    return r

def norm3(x: float, y: float, z: float) -> float):
    r = sqrt(x*x + y*y + z*z)
    return r

def norm4(x: float, y: float, z: float, w: float) -> float):
    r = sqrt(x*x + y*y + z*z + w*w)
    return r


Where the original return r became the return for norm4. Technically
it fits the prompt, but it’s obviously not what I want. So be ready to
mash the “stop” button when it gets out of control. The three coder models
I recommended exhibit this behavior less often. It might be more robust to
combine it with a non-LLM system that understands the code semantically
and automatically stops generation when the LLM begins generating tokens
in a higher scope. That would make more coder models viable, but this goes
beyond my own fiddling.

Figuring out FIM and putting it into action revealed to me that FIM is
still in its early stages, and hardly anyone is generating code via FIM. I
guess everyone’s just using plain old completion?

So what are LLMs good for?

LLMs are fun, but what the productive uses do they have? That’s a question
I’ve been trying to answer this past month, and it’s come up shorter than
I hoped. It might be useful to establish boundaries — tasks that LLMs
definitely cannot do.

First, LLMs are no good if correctness cannot be readily verified.
They are untrustworthy hallucinators. Often if you’re in position to
verify LLM output, you didn’t need it in the first place. This is why
Mixtral, with its large “database” of knowledge, isn’t so useful. It also
means it’s reckless and irresponsible to inject LLM output into search
results — just shameful.

LLM enthusiasts, who ought to know better, fall into this trap anyway and
propagate hallucinations. It makes discourse around LLMs less trustworthy
than normal, and I need to approach LLM information with extra skepticism.
Case in point: Recall how “GGUF” doesn’t have an authoritative definition.
Search for one and you’ll find an obvious hallucination that made it all
the way into official IBM documentation. I won’t repeat it hear as to not
make things worse.

Second, LLMs have goldfish-sized working memory. That is, they’re held
back by small context lengths. Some models are trained on larger contexts,
but their effective context length is usually much smaller. In
practice, an LLM can hold several book chapters worth of comprehension “in
its head” at a time. For code it’s 2k or 3k lines (code is token-dense).
That’s the most you can work with at once. Compared to a human, it’s tiny.
There are tools like retrieval-augmented generation and fine-tuning
to mitigate it… slightly.

Third, LLMs are poor programmers. At best they write code at maybe an
undergraduate student level who’s read a lot of documentation. That sounds
better than it is. The typical fresh graduate enters the workforce knowing
practically nothing about software engineering. Day one on the job is the
first day of their real education. In that sense, LLMs today
haven’t even begun their education.

To be fair, that LLMs work as well as they do is amazing! Thrown into the
middle of a program in my unconvential style, LLMs figure it out
and make use of the custom interfaces. (Caveat: My code and writing is in
the training data of most of these LLMs.) So the more context, the better,
within the effective context length. The challenge is getting something
useful out of an LLM in less time than writing it myself.

Writing new code is the easy part. The hard part is maintaining code,
and writing new code with that maintenance in mind. Even when an LLM
produces code that works, there’s no thought to maintenance, nor could
there be. In general the reliability of generate code follows the inverse
square law by length, and generating more than a dozen lines at a time is
fraught. I really tried, but never saw LLM output beyond 2–3 lines of code
which I would consider acceptable.

Quality varies substantially by language. LLMs are better at Python than
C, and better at C than assembly. I suspect it’s related to the difficulty
of the language and the quality of the input. It’s trained on lots of
terrible C — the internet is loaded with it after all — and probably the
only labeled x86 assembly it’s seen is crummy beginner tutorials. Ask it
to use SDL2 and it reliably produces the common mistakes because
it’s been trained to do so.

What about boilerplate? That’s something an LLM could probably do with a
low error rate, and perhaps there’s merit to it. Though the fastest way to
deal with boilerplate is to not write it at all. Change your problem to
not require boilerplate.

Without taking my word for it, consider how it show up in the economics:
If AI companies could deliver the productivity gains they claim, they
wouldn’t sell AI. They’d keep it to themselves and gobble up the software
industry. Or consider the software products produced by companies on the
bleeding edge of AI. It’s still the same old, bloated web garbage everyone
else is building. (My LLM research has involved navigating their awful web
sites, and it’s made be bitter.)

In code generation, hallucinations are less concerning. You already knew
what you wanted when you asked, so you can review it, and your compiler
will help catch problems you miss (e.g. calling a hallucinated method).
However, small context and poor code generation remain roadblocks, and I
haven’t yet made this work effectively.

So then, what can I do with LLMs? A list is apt because LLMs love lists:


  
    Proofreading has been most useful for me. I give it a document such as
an email or this article (~8,000 tokens), tell it to look over grammar,
call out passive voice, and so on, and suggest changes. I accept or
reject its suggestions and move on. Most suggestions will be poor, and
this very article was long enough that even ~70B models suggested
changes to hallucinated sentences. Regardless, there’s signal in the
noise, and it fits within the limitations outlined above. I’m still
trying to apply this technique (“find bugs, please”) to code review, but
so far success is elusive.
  
  
    Writing short fiction. Hallucinations are not a problem; they’re a
feature! Context lengths are the limiting factor, though perhaps you can
stretch it by supplying chapter summaries, also written by LLM. I’m
still exploring this. If you’re feeling lazy, tell it to offer you three
possible story branches at each turn, and you pick the most interesting.
Or even tell it to combine two of them! LLMs are clever and will figure
it out. Some genres work better than others, and concrete works better
than abstract. (I wonder if professional writers judge its writing as
poor as I judge its programming.)
  
  
    Generative fun. Have an argument with Benjamin Franklin (note: this
probably violates the Acceptable Use Policy of some models), hang
out with a character from your favorite book, or generate a new scene of
Falstaff’s blustering antics. Talking to historical figures
has been educational: The character says something unexpected, I look it
up the old-fashioned way to see what it’s about, then learn something
new.
  
  
    Language translation. I’ve been browsing foreign language subreddits
through Gemma-2-2B translation, and it’s been insightful. (I had no idea
German speakers were so distrustful of artificial sweeteners.)
  


Despite the short list of useful applications, this is the most excited
I’ve been about a new technology in years!

Guidelines for computing sizes and subscripts

2024-05-24T22:25:10Z

Occasionally we need to compute the size of an object that does not yet exist, or a subscript that may fall out of bounds. It’s easy to miss the edge cases where results overflow, creating a nasty, subtle bug, even in the presence of type safety. Ideally such computations happen in specialized code, such as inside an allocator (calloc, reallocarray) and not outside by the allocatee (i.e. malloc). Mitigations exist with different trade-offs: arbitrary precision, or using a wider fixed integer — i.e. 128-bit integers on 64-bit hosts. In the typical case, working only with fixed size-type integers, I’ve come up with a set of guidelines to avoid overflows in the edge cases.

Range check before computing a result. No exceptions.
Do not cast unless you know a priori the operand is in range.
Never mix unsigned and signed operands. Prefer signed. If you need to convert an operand, see (2).
Do not add unless you know a priori the result is in range.
Do not multiply unless you know a priori the result is in range.
Do not subtract unless you know a priori both signed operands are non-negative. For unsigned, that the second operand is not larger than the first (treat it like (4)).
Do not divide unless you know a prior the denominator is positive.
Make it correct first. Make it fast later, if needed.

These guidelines are also useful when reviewing code, tracking in your mind whether the invariants are held at each step. If not, you’ve likely found a bug. If in doubt, use assertions to document and check invariants. I compiled this list during code review, so for me that’s where it’s most useful.

Range check, then compute

Not strictly necessary when overflow is well-defined, i.e. wraparound, but it’s like defensive driving. It’s simpler and clearer to check with basic arithmetic rather than reason from a wraparound, i.e. a negative result. Checked math functions are fine, too, if you check the overflow boolean before accessing the result.

// bad
len++;
if (len <= 0) error();

// good
if (len == MAX) error();
len++;

Casting

Casting from signed to unsigned, it’s as simple as knowing the value is non-negative, which is likely if you’re following (1). If a negative size has appeared, there’s already been a bug earlier in the program, and the only reasonable course of action is to abort, not handle it like an error.

Addition

To check if addition will overflow, subtract one of the operands from the maximum value.

if (b > MAX - a) error();
r = a + b;

In pointer arithmetic addition, it’s a common mistake to compute the result pointer then compare it to the bounds. If the check failed, then the pointer already overflowed, i.e. undefined behavior. Major pieces software, like glibc, are riddled with such pointer overflows. (Now that you’re aware of it, you’ll start noticing it everywhere. Sorry.)

// bad: never do this
beg += size;
if (beg > end) error();

To do this correctly, check integers not pointers. Like before, subtract before adding.

available = end - beg;
if (size > available) error();
beg += size;

Mind mixing signed and unsigned operands for the comparison operator (3), e.g. an unsigned size on the left and signed difference on the right.

Multiplication and division

If you’re working this out on your own, multiplication seems tricky until you’ve internalized a simple pattern. Just as we subtracted before adding, we need to divide before multiplying. Divide the maximum value by one of the operands:

if (a>0 && b>MAX/a) error();
r = a * b;

It’s often permitted for one or both to be zero, so mind divide-by-zero, which is handled above by the first condition. Sometimes size must be positive, e.g. the result of the sizeof operator in C, in which case we should prefer it as the denominator.

assert(size  >  0);
assert(count >= 0);
if (count > MAX/size) error();
total = count * size;

With arena allocation there are usually two concerns. First, will it overflow when computing the total size, i.e. count * size? Second, is the total size within the arena capacity. Naively that’s two checks, but we can kill two birds with one stone: Check both at once by using the current arena capacity as the maximum value when considering overflow.

if (count > (end - beg)/size) error();
total = count * size;

One condition pulling double duty.

Subtraction

With signed sizes, the negative range is a long “runway” allowing a single unchecked subtraction before overflow might occur. In essence, we were exploiting this in order to check addition. The most common mistake with unsigned subtraction is not accounting for overflow when going below zero.

// note: signed "i" only
for (i = end - stride; i >= beg; i -= stride) ...

This loop will go awry if i is unsigned and beg <= stride.

In special cases we can get away with a second subtraction without an overflow check if we know some properties of our operands. For example, my arena allocators look like this:

padding = -beg & (align - 1);
if (count >= (end - beg - padding)/size) error();

That’s two subtractions in a row. However, end - beg describes the size of a realized object, and align is a small constant (e.g. 2^(0–6)). It could only overflow if the entirety of memory was occupied by the arena.

Bonus, advanced note: This check is actually pulling triple duty. Notice that I used >= instead of >. The arena can’t fill exactly to the brim, but it handles the extreme edge case where count is zero, the arena is nearly full, but the bump pointer is unaligned. The result of subtracting padding is negative, which rounds to zero by integer division, and would pass a > check. That wouldn’t be a problem except that aligning the bump pointer would break the invariant beg <= end.

Try it for yourself

Next time you’re reviewing code that computes sizes or subscripts, bring the list up and see how well it follows the guidelines. If it misses one, try to contrive an input that causes an overflow. If it follows guidelines and you can still contrive such an input, then perhaps the list could use another item!

Conventions for Command Line Options

2020-08-01T00:34:23Z

This article was discussed on Hacker News and critiqued on Wandering Thoughts (2, 3).

Command line interfaces have varied throughout their brief history but have largely converged to some common, sound conventions. The core originates from unix, and the Linux ecosystem extended it, particularly via the GNU project. Unfortunately some tools initially appear to follow the conventions, but subtly get them wrong, usually for no practical benefit. I believe in many cases the authors simply didn’t know any better, so I’d like to review the conventions.

Short Options

The simplest case is the short option flag. An option is a hyphen — specifically HYPHEN-MINUS U+002D — followed by one alphanumeric character. Capital letters are acceptable. The letters themselves have conventional meanings and are worth following if possible.

program -a -b -c

Flags can be grouped together into one program argument. This is both convenient and unambiguous. It’s also one of those often missed details when programs use hand-coded argument parsers, and the lack of support irritates me.

program -abc
program -acb

The next simplest case are short options that take arguments. The argument follows the option.

program -i input.txt -o output.txt

The space is optional, so the option and argument can be packed together into one program argument. Since the argument is required, this is still unambiguous. This is another often-missed feature in hand-coded parsers.

program -iinput.txt -ooutput.txt

This does not prohibit grouping. When grouped, the option accepting an argument must be last.

program -abco output.txt
program -abcooutput.txt

This technique is used to create another category, optional option arguments. The option’s argument can be optional but still unambiguous so long as the space is always omitted when the argument is present.

program -c       # omitted
program -cblue   # provided
program -c blue  # omitted (blue is a new argument)

program -c -x   # two separate flags
program -c-x    # -c with argument "-x"

Optional option arguments should be used judiciously since they can be surprising, but they have their uses.

Options can typically appear in any order — something parsers often achieve via permutation — but non-options typically follow options.

program -a -b foo bar
program -b -a foo bar

GNU-style programs usually allow options and non-options to be mixed, though I don’t consider this to be essential.

program -a foo -b bar
program foo -a -b bar
program foo bar -a -b

If a non-option looks like an option because it starts with a hyphen, use -- to demarcate options from non-options.

program -a -b -- -x foo bar

An advantage of requiring that non-options follow options is that the first non-option demarcates the two groups, so -- is less often needed.

# note: without argument permutation
program -a -b foo -x bar  # 2 options, 3 non-options

Long options

Since short options can be cryptic, and there are such a limited number of them, more complex programs support long options. A long option starts with two hyphens followed by one or more alphanumeric, lowercase words. Hyphens separate words. Using two hyphens prevents long options from being confused for grouped short options.

program --reverse --ignore-backups

Occasionally flags are paired with a mutually exclusive inverse flag that begins with --no-. This avoids a future flag day where the default is changed in the release that also adds the flag implementing the original behavior.

program --sort
program --no-sort

Long options can similarly accept arguments.

program --output output.txt --block-size 1024

These may optionally be connected to the argument with an equals sign =, much like omitting the space for a short option argument.

program --output=output.txt --block-size=1024

Like before, this opens up the doors for optional option arguments. Due to the required = this is still unambiguous.

program --color --reverse
program --color=never --reverse

The -- retains its original behavior of disambiguating option-like non-option arguments:

program --reverse -- --foo bar

Subcommands

Some programs, such as Git, have subcommands each with their own options. The main program itself may still have its own options distinct from subcommand options. The program’s options come before the subcommand and subcommand options follow the subcommand. Options are never permuted around the subcommand.

program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz

Above, the -a, -b, and -c options are for program, and the others are for subcommand. So, really, the subcommand is another command line of its own.

Option parsing libraries

There’s little excuse for not getting these conventions right assuming you’re interested in following the conventions. Short options can be parsed correctly in just ~60 lines of C code. Long options are just slightly more complex.

GNU’s getopt_long() supports long option abbreviation — with no way to disable it (!) — but this should be avoided.

Go’s flag package intentionally deviates from the conventions. It only supports long option semantics, via a single hyphen. This makes it impossible to support grouping even if all options are only one letter. Also, the only way to combine option and argument into a single command line argument is with =. It’s sound, but I miss both features every time I write programs in Go. That’s why I wrote my own argument parser. Not only does it have a nicer feature set, I like the API a lot more, too.

Python’s primary option parsing library is argparse, and I just can’t stand it. Despite appearing to follow convention, it actually breaks convention and its behavior is unsound. For instance, the following program has two options, --foo and --bar. The --foo option accepts an optional argument, and the --bar option is a simple flag.

import argparse
import sys

parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))

Here are some example runs:

$ python parse.py
Namespace(bar=False, foo='X')

$ python parse.py --foo
Namespace(bar=False, foo=None)

$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')

$ python parse.py --bar --foo
Namespace(bar=True, foo=None)

$ python parse.py --foo arg
Namespace(bar=False, foo='arg')

Everything looks good except the last. If the --foo argument is optional then why did it consume arg? What happens if I follow it with --bar? Will it consume it as the argument?

$ python parse.py --foo --bar
Namespace(bar=True, foo=None)

Nope! Unlike arg, it left --bar alone, so instead of following the unambiguous conventions, it has its own ambiguous semantics and attempts to remedy them with a “smart” heuristic: “If an optional argument looks like an option, then it must be an option!” Non-option arguments can never follow an option with an optional argument, which makes that feature pretty useless. Since argparse does not properly support --, that does not help.

$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg

Please, stick to the conventions unless you have really good reasons to break them!

How to Read UTF-8 Passwords on the Windows Console

2020-05-04T02:14:34Z

This article was discussed on Hacker News.

Suppose you’re writing a command line program that prompts the user for a password or passphrase, and Windows is one of the supported platforms (even very old versions). This program uses UTF-8 for its string representation, as it should, and so ideally it receives the password from the user encoded as UTF-8. On most platforms this is, for the most part, automatic. However, on Windows finding the correct answer to this problem is a maze where all the signs lead towards dead ends. I recently navigated this maze and found the way out.

I knew it was possible because my passphrase2pgp tool has been using the golang.org/x/crypto/ssh/terminal package, which gets it very nearly perfect. Though they were still fixing subtle bugs as recently as 6 months ago.

The first step is to ignore just everything you find online, because it’s either wrong or it’s solving a slightly different problem. I’ll discuss the dead ends later and focus on the solution first. Ultimately I want to implement this on Windows:

// Display prompt then read zero-terminated, UTF-8 password.
// Return password length with terminator, or zero on error.
int read_password(char *buf, int len, const char *prompt);

I chose int for the length rather than size_t because it’s a password and should not even approach INT_MAX.

The correct way

For the impatient: complete, working, ready-to-use example

On a unix-like system, the program would:

open(2) the special /dev/tty file for reading and writing
write(2) the prompt
tcgetattr(3) and tcsetattr(3) to disable ECHO
read(2) a line of input
Restore the old terminal attributes with tcsetattr(3)
close(2) the file

A great advantage of this approach is that it doesn’t depend on standard input and standard output. Either or both can be redirected elsewhere, and this function still interacts with the user’s terminal. The Windows version will have the same advantage.

Despite some tempting shortcuts that don’t work, the steps on Windows are basically the same but with different names. There are a couple subtleties and extra steps. I’ll be ignoring errors in my code snippets below, but the complete example has full error handling.

Create console handles

Instead of /dev/tty, the program opens two files: CONIN$ and CONOUT$ using CreateFileA(). Note: The “A” stands for ANSI, as opposed to “W” for wide (Unicode). This refers to the encoding of the file name, not to how the file contents are encoded. CONIN$ is opened for both reading and writing because write permissions are needed to change the console’s mode.

HANDLE hi = CreateFileA(
    "CONIN$",
    GENERIC_READ | GENERIC_WRITE,
    0,
    0,
    OPEN_EXISTING,
    0,
    0
);
HANDLE ho = CreateFileA(
    "CONOUT$",
    GENERIC_WRITE,
    0,
    0,
    OPEN_EXISTING,
    0,
    0
);

Print the prompt

To write the prompt, call WriteConsoleA() on the output handle. On its own, this assumes the prompt is plain ASCII (i.e. "password: "), not UTF-8 (i.e. "contraseña: "):

WriteConsoleA(ho, prompt, strlen(prompt), 0, 0);

If the prompt may contain UTF-8 data, perhaps because it displays a username or isn’t in English, you have two options:

Convert the prompt to UTF-16 and call WriteConsoleW() instead.
Use SetConsoleOutputCP() with CP_UTF8 (65001). This is a global (to the console) setting and should be restored when done.

Disable echo

Next use GetConsoleMode() and SetConsoleMode() to disable echo. The console usually has ENABLE_PROCESSED_INPUT already set, which tells the console to handle CTRL-C and such, but I set it explicitly just in case. I also set ENABLE_LINE_INPUT so that the user can use backspace and so that the entire line is delivered at once.

DWORD orig = 0;
GetConsoleMode(hi, &orig);

DWORD mode = orig;
mode |= ENABLE_PROCESSED_INPUT;
mode &= ~ENABLE_ECHO_INPUT;
SetConsoleMode(hi, mode);

There are reports that ENABLE_LINE_INPUT limits reads to 254 bytes, but I was unable to reproduce it. My full example can read huge passwords without trouble.

The old mode is saved in orig so that it can be restored later.

Read the password

Here’s where you have to pay the piper. As of the date of this article, the Windows API offers no method for reading UTF-8 input from the console. Give up on that hope now. If you use the “ANSI” functions to read input under any configuration, they will to the usual Windows thing of silently mangling your input.

So you must use the UTF-16 API, ReadConsoleW(), and then encode it yourself. Fortunately Win32 provides a UTF-8 encoder, WideCharToMultiByte(), which will even handle surrogate pairs for all those people who like putting PILE OF POO (U+1F4A9) in their passwords:

SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0;  // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);

I use SecureZeroMemory() to erase the UTF-16 version of the password before freeing the buffer. The + 2 in the allocation is for the CRLF line ending that will later be chopped off. The error handling version checks that the input did indeed end with CRLF. Otherwise it was truncated (too long).

Clean up

Finally print a newline since the user-typed one wasn’t echoed, restore the old console mode, close the console handles, and return the final encoded length:

WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;

The error checking version doesn’t check for errors from any of these functions since either they cannot fail, or there’s nothing reasonable to do in the event of an error.

Dead ends

If you look around the Win32 API you might notice SetConsoleCP(). A reasonable person might think that setting the “code page” to UTF-8 (CP_UTF8) might configure the console to encode input in UTF-8. The good news is Windows will no longer mangle your input as before. The bad news is that it will be mangled differently.

You might think you can use the CRT function _setmode() with _O_U8TEXT on the FILE * connected to the console. This does nothing useful. (The only use for _setmode() is with _O_BINARY, to disable braindead character translation on standard input and output.) The best you’ll be able to do with the CRT is the same sort of wide character read using non-standard functions, followed by conversion to UTF-8.

CredUICmdLinePromptForCredentials() promises to be both a mouthful of a function name, and a prepacked solution to this problem. It only delivers on the first. This function seems to have broken some time ago and nobody at Microsoft noticed — probably because nobody has ever used this function. I couldn’t find a working example, nor a use in any real application. When I tried to use it, I got a nonsense error code it never worked. There’s a GUI version of this function that does work, and it’s a viable alternative for certain situations, though not mine.

At my most desperate, I hoped ENABLE_VIRTUAL_TERMINAL_PROCESSING would be a magical switch. On Windows 10 it magically enables some ANSI escape sequences. The documentation in no way suggests it would work, and I confirmed by experimentation that it does not. Pity.

I spent a lot of time searching down these dead ends until finally settling with ReadConsoleW() above. I hoped it would be more automatic, but I’m glad I have at least some solution figured out.

Render Multimedia in Pure C

2017-11-03T22:31:15Z

Update 2020: I’ve produced many more examples over the years (even more).

In a previous article I demonstrated video filtering with C and a unix pipeline. Thanks to the ubiquitous support for the ridiculously simple Netpbm formats — specifically the “Portable PixMap” (.ppm, P6) binary format — it’s trivial to parse and produce image data in any language without image libraries. Video decoders and encoders at the ends of the pipeline do the heavy lifting of processing the complicated video formats actually used to store and transmit video.

Naturally this same technique can be used to produce new video in a simple program. All that’s needed are a few functions to render artifacts — lines, shapes, etc. — to an RGB buffer. With a bit of basic sound synthesis, the same concept can be applied to create audio in a separate audio stream — in this case using the simple (but not as simple as Netpbm) WAV format. Put them together and a small, standalone program can create multimedia.

Here’s the demonstration video I’ll be going through in this article. It animates and visualizes various in-place sorting algorithms (see also). The elements are rendered as colored dots, ordered by hue, with red at 12 o’clock. A dot’s distance from the center is proportional to its corresponding element’s distance from its correct position. Each dot emits a sinusoidal tone with a unique frequency when it swaps places in a particular frame.

Original credit for this visualization concept goes to w0rthy.

All of the source code (less than 600 lines of C), ready to run, can be found here:

https://github.com/skeeto/sort-circle

On any modern computer, rendering is real-time, even at 60 FPS, so you may be able to pipe the program’s output directly into your media player of choice. (If not, consider getting a better media player!)

$ ./sort | mpv --no-correct-pts --fps=60 -

VLC requires some help from ppmtoy4m:

$ ./sort | ppmtoy4m -F60:1 | vlc -

Or you can just encode it to another format. Recent versions of libavformat can input PPM images directly, which means x264 can read the program’s output directly:

$ ./sort | x264 --fps 60 -o video.mp4 /dev/stdin

By default there is no audio output. I wish there was a nice way to embed audio with the video stream, but this requires a container and that would destroy all the simplicity of this project. So instead, the -a option captures the audio in a separate file. Use ffmpeg to combine the audio and video into a single media file:

$ ./sort -a audio.wav | x264 --fps 60 -o video.mp4 /dev/stdin
$ ffmpeg -i video.mp4 -i audio.wav -vcodec copy -acodec mp3 \
         combined.mp4

You might think you’ll be clever by using mkfifo (i.e. a named pipe) to pipe both audio and video into ffmpeg at the same time. This will only result in a deadlock since neither program is prepared for this. One will be blocked writing one stream while the other is blocked reading on the other stream.

Several years ago my intern and I used the exact same pure C rendering technique to produce these raytracer videos:

I also used this technique to illustrate gap buffers.

Pixel format and rendering

This program really only has one purpose: rendering a sorting video with a fixed, square resolution. So rather than write generic image rendering functions, some assumptions will be hard coded. For example, the video size will just be hard coded and assumed square, making it simpler and faster. I chose 800x800 as the default:

#define S     800

Rather than define some sort of color struct with red, green, and blue fields, color will be represented by a 24-bit integer (long). I arbitrarily chose red to be the most significant 8 bits. This has nothing to do with the order of the individual channels in Netpbm since these integers are never dumped out. (This would have stupid byte-order issues anyway.) “Color literals” are particularly convenient and familiar in this format. For example, the constant for pink: 0xff7f7fUL.

In practice the color channels will be operated upon separately, so here are a couple of helper functions to convert the channels between this format and normalized floats (0.0–1.0).

static void
rgb_split(unsigned long c, float *r, float *g, float *b)
{
    *r = ((c >> 16) / 255.0f);
    *g = (((c >> 8) & 0xff) / 255.0f);
    *b = ((c & 0xff) / 255.0f);
}

static unsigned long
rgb_join(float r, float g, float b)
{
    unsigned long ir = roundf(r * 255.0f);
    unsigned long ig = roundf(g * 255.0f);
    unsigned long ib = roundf(b * 255.0f);
    return (ir << 16) | (ig << 8) | ib;
}

Originally I decided the integer form would be sRGB, and these functions handled the conversion to and from sRGB. Since it had no noticeable effect on the output video, I discarded it. In more sophisticated rendering you may want to take this into account.

The RGB buffer where images are rendered is just a plain old byte buffer with the same pixel format as PPM. The ppm_set() function writes a color to a particular pixel in the buffer, assumed to be S by S pixels. The complement to this function is ppm_get(), which will be needed for blending.

static void
ppm_set(unsigned char *buf, int x, int y, unsigned long color)
{
    buf[y * S * 3 + x * 3 + 0] = color >> 16;
    buf[y * S * 3 + x * 3 + 1] = color >>  8;
    buf[y * S * 3 + x * 3 + 2] = color >>  0;
}

static unsigned long
ppm_get(unsigned char *buf, int x, int y)
{
    unsigned long r = buf[y * S * 3 + x * 3 + 0];
    unsigned long g = buf[y * S * 3 + x * 3 + 1];
    unsigned long b = buf[y * S * 3 + x * 3 + 2];
    return (r << 16) | (g << 8) | b;
}

Since the buffer is already in the right format, writing an image is dead simple. I like to flush after each frame so that observers generally see clean, complete frames. It helps in debugging.

static void
ppm_write(const unsigned char *buf, FILE *f)
{
    fprintf(f, "P6\n%d %d\n255\n", S, S);
    fwrite(buf, S * 3, S, f);
    fflush(f);
}

Dot rendering

If you zoom into one of those dots, you may notice it has a nice smooth edge. Here’s one rendered at 30x the normal resolution. I did not render, then scale this image in another piece of software. This is straight out of the C program.

In an early version of this program I used a dumb dot rendering routine. It took a color and a hard, integer pixel coordinate. All the pixels within a certain distance of this coordinate were set to the color, everything else was left alone. This had two bad effects:

Dots jittered as they moved around since their positions were rounded to the nearest pixel for rendering. A dot would be centered on one pixel, then suddenly centered on another pixel. This looked bad even when those pixels were adjacent.
There’s no blending between dots when they overlap, making the lack of anti-aliasing even more pronounced.

Instead the dot’s position is computed in floating point and is actually rendered as if it were between pixels. This is done with a shader-like routine that uses smoothstep — just as found in shader languages — to give the dot a smooth edge. That edge is blended into the image, whether that’s the background or a previously-rendered dot. The input to the smoothstep is the distance from the floating point coordinate to the center (or corner?) of the pixel being rendered, maintaining that between-pixel smoothness.

Rather than dump the whole function here, let’s look at it piece by piece. I have two new constants to define the inner dot radius and the outer dot radius. It’s smooth between these radii.

#define R0    (S / 400.0f)  // dot inner radius
#define R1    (S / 200.0f)  // dot outer radius

The dot-drawing function takes the image buffer, the dot’s coordinates, and its foreground color.

static void
ppm_dot(unsigned char *buf, float x, float y, unsigned long fgc);

The first thing to do is extract the color components.

    float fr, fg, fb;
    rgb_split(fgc, &fr, &fg, &fb);

Next determine the range of pixels over which the dot will be draw. These are based on the two radii and will be used for looping.

    int miny = floorf(y - R1 - 1);
    int maxy = ceilf(y + R1 + 1);
    int minx = floorf(x - R1 - 1);
    int maxx = ceilf(x + R1 + 1);

Here’s the loop structure. Everything else will be inside the innermost loop. The dx and dy are the floating point distances from the center of the dot.

    for (int py = miny; py <= maxy; py++) {
        float dy = py - y;
        for (int px = minx; px <= maxx; px++) {
            float dx = px - x;
            /* ... */
        }
    }

Use the x and y distances to compute the distance and smoothstep value, which will be the alpha. Within the inner radius the color is on 100%. Outside the outer radius it’s 0%. Elsewhere it’s something in between.

            float d = sqrtf(dy * dy + dx * dx);
            float a = smoothstep(R1, R0, d);

Get the background color, extract its components, and blend the foreground and background according to the computed alpha value. Finally write the pixel back into the buffer.

            unsigned long bgc = ppm_get(buf, px, py);
            float br, bg, bb;
            rgb_split(bgc, &br, &bg, &bb);

            float r = a * fr + (1 - a) * br;
            float g = a * fg + (1 - a) * bg;
            float b = a * fb + (1 - a) * bb;
            ppm_set(buf, px, py, rgb_join(r, g, b));

That’s all it takes to render a smooth dot anywhere in the image.

Rendering the array

The array being sorted is just a global variable. This simplifies some of the sorting functions since a few are implemented recursively. They can call for a frame to be rendered without needing to pass the full array. With the dot-drawing routine done, rendering a frame is easy:

#define N     360           // number of dots

static int array[N];

static void
frame(void)
{
    static unsigned char buf[S * S * 3];
    memset(buf, 0, sizeof(buf));
    for (int i = 0; i < N; i++) {
        float delta = abs(i - array[i]) / (N / 2.0f);
        float x = -sinf(i * 2.0f * PI / N);
        float y = -cosf(i * 2.0f * PI / N);
        float r = S * 15.0f / 32.0f * (1.0f - delta);
        float px = r * x + S / 2.0f;
        float py = r * y + S / 2.0f;
        ppm_dot(buf, px, py, hue(array[i]));
    }
    ppm_write(buf, stdout);
}

The buffer is static since it will be rather large, especially if S is cranked up. Otherwise it’s likely to overflow the stack. The memset() fills it with black. If you wanted a different background color, here’s where you change it.

For each element, compute its delta from the proper array position, which becomes its distance from the center of the image. The angle is based on its actual position. The hue() function (not shown in this article) returns the color for the given element.

With the frame() function complete, all I need is a sorting function that calls frame() at appropriate times. Here are a couple of examples:

static void
shuffle(int array[N], uint64_t *rng)
{
    for (int i = N - 1; i > 0; i--) {
        uint32_t r = pcg32(rng) % (i + 1);
        swap(array, i, r);
        frame();
    }
}

static void
sort_bubble(int array[N])
{
    int c;
    do {
        c = 0;
        for (int i = 1; i < N; i++) {
            if (array[i - 1] > array[i]) {
                swap(array, i - 1, i);
                c = 1;
            }
        }
        frame();
    } while (c);
}

Synthesizing audio

To add audio I need to keep track of which elements were swapped in this frame. When producing a frame I need to generate and mix tones for each element that was swapped.

Notice the swap() function above? That’s not just for convenience. That’s also how things are tracked for the audio.

static int swaps[N];

static void
swap(int a[N], int i, int j)
{
    int tmp = a[i];
    a[i] = a[j];
    a[j] = tmp;
    swaps[(a - array) + i]++;
    swaps[(a - array) + j]++;
}

Before we get ahead of ourselves I need to write a WAV header. Without getting into the purpose of each field, just note that the header has 13 fields, followed immediately by 16-bit little endian PCM samples. There will be only one channel (monotone).

#define HZ    44100         // audio sample rate

static void
wav_init(FILE *f)
{
    emit_u32be(0x52494646UL, f); // "RIFF"
    emit_u32le(0xffffffffUL, f); // file length
    emit_u32be(0x57415645UL, f); // "WAVE"
    emit_u32be(0x666d7420UL, f); // "fmt "
    emit_u32le(16,           f); // struct size
    emit_u16le(1,            f); // PCM
    emit_u16le(1,            f); // mono
    emit_u32le(HZ,           f); // sample rate (i.e. 44.1 kHz)
    emit_u32le(HZ * 2,       f); // byte rate
    emit_u16le(2,            f); // block size
    emit_u16le(16,           f); // bits per sample
    emit_u32be(0x64617461UL, f); // "data"
    emit_u32le(0xffffffffUL, f); // byte length
}

Rather than tackle the annoying problem of figuring out the total length of the audio ahead of time, I just wave my hands and write the maximum possible number of bytes (0xffffffff). Most software that can read WAV files will understand this to mean the entire rest of the file contains samples.

With the header out of the way all I have to do is write 1/60th of a second worth of samples to this file each time a frame is produced. That’s 735 samples (1,470 bytes) at 44.1kHz.

The simplest place to do audio synthesis is in frame() right after rendering the image.

#define FPS   60            // output framerate
#define MINHZ 20            // lowest tone
#define MAXHZ 1000          // highest tone

static void
frame(void)
{
    /* ... rendering ... */

    /* ... synthesis ... */
}

With the largest tone frequency at 1kHz, Nyquist says we only need to sample at 2kHz. 8kHz is a very common sample rate and gives some overhead space, making it a good choice. However, I found that audio encoding software was a lot happier to accept the standard CD sample rate of 44.1kHz, so I stuck with that.

The first thing to do is to allocate and zero a buffer for this frame’s samples.

    int nsamples = HZ / FPS;
    static float samples[HZ / FPS];
    memset(samples, 0, sizeof(samples));

Next determine how many “voices” there are in this frame. This is used to mix the samples by averaging them. If an element was swapped more than once this frame, it’s a little louder than the others — i.e. it’s played twice at the same time, in phase.

    int voices = 0;
    for (int i = 0; i < N; i++)
        voices += swaps[i];

Here’s the most complicated part. I use sinf() to produce the sinusoidal wave based on the element’s frequency. I also use a parabola as an envelope to shape the beginning and ending of this tone so that it fades in and fades out. Otherwise you get the nasty, high-frequency “pop” sound as the wave is given a hard cut off.

    for (int i = 0; i < N; i++) {
        if (swaps[i]) {
            float hz = i * (MAXHZ - MINHZ) / (float)N + MINHZ;
            for (int j = 0; j < nsamples; j++) {
                float u = 1.0f - j / (float)(nsamples - 1);
                float parabola = 1.0f - (u * 2 - 1) * (u * 2 - 1);
                float envelope = parabola * parabola * parabola;
                float v = sinf(j * 2.0f * PI / HZ * hz) * envelope;
                samples[j] += swaps[i] * v / voices;
            }
        }
    }

Finally I write out each sample as a signed 16-bit value. I flush the frame audio just like I flushed the frame image, keeping them somewhat in sync from an outsider’s perspective.

    for (int i = 0; i < nsamples; i++) {
        int s = samples[i] * 0x7fff;
        emit_u16le(s, wav);
    }
    fflush(wav);

Before returning, reset the swap counter for the next frame.

    memset(swaps, 0, sizeof(swaps));

Font rendering

You may have noticed there was text rendered in the corner of the video announcing the sort function. There’s font bitmap data in font.h which gets sampled to render that text. It’s not terribly complicated, but you’ll have to study the code on your own to see how that works.

Learning more

This simple video rendering technique has served me well for some years now. All it takes is a bit of knowledge about rendering. I learned quite a bit just from watching Handmade Hero, where Casey writes a software renderer from scratch, then implements a nearly identical renderer with OpenGL. The more I learn about rendering, the better this technique works.

Before writing this post I spent some time experimenting with using a media player as a interface to a game. For example, rather than render the game using OpenGL or similar, render it as PPM frames and send it to the media player to be displayed, just as game consoles drive television sets. Unfortunately the latency is horrible — multiple seconds — so that idea just doesn’t work. So while this technique is fast enough for real time rendering, it’s no good for interaction.

A Tutorial on Portable Makefiles

2017-08-20T03:03:51Z

In my first decade writing Makefiles, I developed the bad habit of liberally using GNU Make’s extensions. I didn’t know the line between GNU Make and the portable features guaranteed by POSIX. Usually it didn’t matter much, but it would become an annoyance when building on non-Linux systems, such as on the various BSDs. I’d have to specifically install GNU Make, then remember to invoke it (i.e. as gmake) instead of the system’s make.

I’ve since become familiar and comfortable with make’s official specification, and I’ve spend the last year writing strictly portable Makefiles. Not only has are my builds now portable across all unix-like systems, my Makefiles are cleaner and more robust. Many of the common make extensions — conditionals in particular — lead to fragile, complicated Makefiles and are best avoided anyway. It’s important to be able to trust your build system to do its job correctly.

This tutorial should be suitable for make beginners who have never written their own Makefiles before, as well as experienced developers who want to learn how to write portable Makefiles. Regardless, in order to understand the examples you must be familiar with the usual steps for building programs on the command line (compiler, linker, object files, etc.). I’m not going to suggest any fancy tricks nor provide any sort of standard starting template. Makefiles should be dead simple when the project is small, and grow in a predictable, clean fashion alongside the project.

I’m not going to cover every feature. You’ll need to read the specification for yourself to learn it all. This tutorial will go over the important features as well as the common conventions. It’s important to follow established conventions so that people using your Makefiles will know what to expect and how to accomplish the basic tasks.

If you’re running Debian, or a Debian derivative such as Ubuntu, the bmake and freebsd-buildutils packages will provide the bmake and fmake programs respectively. These alternative make implementations are very useful for testing your Makefiles’ portability, should you accidentally make use of a GNU Make feature. It’s not perfect since each implements some of the same extensions as GNU Make, but it will catch some common mistakes.

What’s in a Makefile?

I am free, no matter what rules surround me. If I find them tolerable, I tolerate them; if I find them too obnoxious, I break them. I am free because I know that I alone am morally responsible for everything I do. ―Robert A. Heinlein

At make’s core are one or more dependency trees, constructed from rules. Each vertex in the tree is called a target. The final products of the build (executable, document, etc.) are the tree roots. A Makefile specifies the dependency trees and supplies the shell commands to produce a target from its prerequisites.

In this illustration, the “.c” files are source files that are written by hand, not generated by commands, so they have no prerequisites. The syntax for specifying one or more edges in this dependency tree is simple:

target [target...]: [prerequisite...]

While technically multiple targets can be specified in a single rule, this is unusual. Typically each target is specified in its own rule. To specify the tree in the illustration above:

game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

The order of these rules doesn’t matter. The entire Makefile is parsed before any actions are taken, so the tree’s vertices and edges can be specified in any order. There’s one exception: the first non-special target in a Makefile is the default target. This target is selected implicitly when make is invoked without choosing a target. It should be something sensible, so that a user can blindly run make and get a useful result.

A target can be specified more than once. Any new prerequisites are appended to the previously-given prerequisites. For example, this Makefile is identical to the previous, though it’s typically not written this way:

game: graphics.o
game: physics.o
game: input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

There are six special targets that are used to change the behavior of make itself. All have uppercase names and start with a period. Names fitting this pattern are reserved for use by make. According to the standard, in order to get reliable POSIX behavior, the first non-comment line of the Makefile must be .POSIX. Since this is a special target, it’s not a candidate for the default target, so game will remain the default target:

.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c

In practice, even a simple program will have header files, and sources that include a header file should also have an edge on the dependency tree for it. If the header file changes, targets that include it should also be rebuilt.

.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h

Adding commands to rules

We’ve constructed a dependency tree, but we still haven’t told make how to actually build any targets from its prerequisites. The rules also need to specify the shell commands that produce a target from its prerequisites.

If you were to create the source files in the example and invoke make, you will find that it actually does know how to build the object files. This is because make is initially configured with certain inference rules, a topic which will be covered later. For now, we’ll add the .SUFFIXES special target to the top, erasing all the built-in inference rules.

Commands immediately follow the target/prerequisite line in a rule. Each command line must start with a tab character. This can be awkward if your text editor isn’t configured for it, and it will be awkward if you try to copy the examples from this page.

Each line is run in its own shell, so be mindful of using commands like cd, which won’t affect later lines.

The simplest thing to do is literally specify the same commands you’d type at the shell:

.POSIX:
.SUFFIXES:
game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c

Invoking make and choosing targets

I tried to walk into Target, but I missed. ―Mitch Hedberg

When invoking make, it accepts zero or more targets from the dependency tree, and it will build these targets — e.g. run the commands in the target’s rule — if the target is out-of-date. A target is out-of-date if it is older than any of its prerequisites.

# build the "game" binary (default target)
$ make

# build just the object files
$ make graphics.o physics.o input.o

This effect cascades up the dependency tree and causes further targets to be rebuilt until all of the requested targets are up-to-date. There’s a lot of room for parallelism since different branches of the tree can be updated independently. It’s common for make implementations to support parallel builds with the -j option. This is non-standard, but it’s a fantastic feature that doesn’t require anything special in the Makefile to work correctly.

Similar to parallel builds is make’s -k (“keep going”) option, which is standard. This tells make not to stop on the first error, and to continue updating targets that are unaffected by the error. This is nice for fully populating Vim’s quickfix list or Emacs’ compilation buffer.

It’s common to have multiple targets that should be built by default. If the first rule selects the default target, how do we solve the problem of needing multiple default targets? The convention is to use phony targets. These are called “phony” because there is no corresponding file, and so phony targets are never up-to-date. It’s convention for a phony “all” target to be the default target.

I’ll make game a prerequisite of a new “all” target. More real targets could be added as necessary to turn them into defaults. Users of this Makefile will also expect make all to build the entire project.

Another common phony target is “clean” which removes all of the built files. Users will expect make clean to delete all generated files.

.POSIX:
.SUFFIXES:
all: game
game: graphics.o physics.o input.o
    cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
    cc -c graphics.c
physics.o: physics.c physics.h
    cc -c physics.c
input.o: input.c input.h graphics.h physics.h
    cc -c input.c
clean:
    rm -f game graphics.o physics.o input.o

Customize the build with macros

So far the Makefile hardcodes cc as the compiler, and doesn’t use any compiler flags (warnings, optimization, hardening, etc.). The user should be able to easily control all these things, but right now they’d have to edit the entire Makefile to do so. Perhaps the user has both gcc and clang installed, and wants to choose one or the other without changing which is installed as cc.

To solve this, make has macros that expand into strings when referenced. The convention is to use the macro named CC when talking about the C compiler, CFLAGS when talking about flags passed to the C compiler, LDFLAGS for flags passed to the C compiler when linking, and LDLIBS for flags about libraries when linking. The Makefile should supply defaults as needed.

A macro is expanded with $(...). It’s valid (and normal) to reference a macro that hasn’t been defined, which will be an empty string. This will be the case with LDFLAGS below.

Macro values can contain other macros, which will be expanded recursively each time the macro is expanded. Some make implementations allow the name of the macro being expanded to itself be a macro, which is turing complete, but this behavior is non-standard.

.POSIX:
.SUFFIXES:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
    $(CC) -c $(CFLAGS) graphics.c
physics.o: physics.c physics.h
    $(CC) -c $(CFLAGS) physics.c
input.o: input.c input.h graphics.h physics.h
    $(CC) -c $(CFLAGS) input.c
clean:
    rm -f game graphics.o physics.o input.o

Macros are overridden by macro definitions given as command line arguments in the form name=value. This allows the user to select their own build configuration. This is one of make’s most powerful and under-appreciated features.

$ make CC=clang CFLAGS='-O3 -march=native'

If the user doesn’t want to specify these macros on every invocation, they can (cautiously) use make’s -e flag to set overriding macros definitions from the environment.

$ export CC=clang
$ export CFLAGS=-O3
$ make -e all

Some make implementations have other special kinds of macro assignment operators beyond simple assignment (=). These are unnecessary, so don’t worry about them.

Inference rules so that you can stop repeating yourself

The road itself tells us far more than signs do. ―Tom Vanderbilt, Traffic: Why We Drive the Way We Do

There’s repetition across the three different object files. Wouldn’t it be nice if there was a way to communicate this pattern? Fortunately there is, in the form of inference rules. It says that a target with a certain extension, with a prerequisite with another certain extension, is built a certain way. This will make more sense with an example.

In an inference rule, the target indicates the extensions. The $< macro expands to the prerequisite, which is essential to making inference rules work generically. Unfortunately this macro is not available in target rules, as much as that would be useful.

For example, here’s an inference rule that teaches make how to build an object file from a C source file. This particular rule is one that is pre-defined by make, so you’ll never need to write this one yourself. I’ll include it for completeness.

.c.o:
    $(CC) $(CFLAGS) -c $<

These extensions must be added to .SUFFIXES before they will work. With that, the commands for the rules about object files can be omitted.

.POSIX:
.SUFFIXES:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm

all: game
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
    rm -f game graphics.o physics.o input.o

.SUFFIXES: .c .o
.c.o:
    $(CC) $(CFLAGS) -c $<

The first empty .SUFFIXES clears the suffix list. The second one adds .c and .o to the now-empty suffix list.

Other target conventions

Conventions are, indeed, all that shield us from the shivering void, though often they do so but poorly and desperately. ―Robert Aickman

Users usually expect an “install” target that installs the built program, libraries, man pages, etc. By convention this target should use the PREFIX and DESTDIR macros.

The PREFIX macro should default to /usr/local, and since it’s a macro the user can override it to install elsewhere, such as in their home directory. The user should override it for both building and installing, since the prefix may need to be built into the binary (e.g. -DPREFIX=$(PREFIX)).

The DESTDIR is macro is used for staged builds, so that it gets installed under a fake root directory for the sake of packaging. Unlike PREFIX, it will not actually be run from this directory.

.POSIX:
CC     = cc
CFLAGS = -W -O
LDLIBS = -lm
PREFIX = /usr/local

all: game
install: game
    mkdir -p $(DESTDIR)$(PREFIX)/bin
    mkdir -p $(DESTDIR)$(PREFIX)/share/man/man1
    cp -f game $(DESTDIR)$(PREFIX)/bin
    gzip < game.1 > $(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz
game: graphics.o physics.o input.o
    $(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
    rm -f game graphics.o physics.o input.o

You may also want to provide an “uninstall” phony target that does the opposite.

make PREFIX=$HOME/.local install

Other common targets are “mostlyclean” (like “clean” but don’t delete some slow-to-build targets), “distclean” (delete even more than “clean”), “test” or “check” (run the test suite), and “dist” (create a package).

Complexity and growing pains

One of make’s big weak points is scaling up as a project grows in size.

Recursive Makefiles

As your growing project is broken into subdirectories, you may be tempted to put a Makefile in each subdirectory and invoke them recursively.

Don’t use recursive Makefiles. It breaks the dependency tree across separate instances of make and typically results in a fragile build. There’s nothing good about it. Have one Makefile at the root of your project and invoke make there. You may have to teach your text editor how to do this.

When talking about files in subdirectories, just include the subdirectory in the name. Everything will work the same as far as make is concerned, including inference rules.

src/graphics.o: src/graphics.c
src/physics.o: src/physics.c
src/input.o: src/input.c

Out-of-source builds

Keeping your object files separate from your source files is a nice idea. When it comes to make, there’s good news and bad news.

The good news is that make can do this. You can pick whatever file names you like for targets and prerequisites.

obj/input.o: src/input.c

The bad news is that inference rules are not compatible with out-of-source builds. You’ll need to repeat the same commands for each rule as if inference rules didn’t exist. This is tedious for large projects, so you may want to have some sort of “configure” script, even if hand-written, to generate all this for you. This is essentially what CMake is all about. That, plus dependency management.

Dependency management

Another problem with scaling up is tracking the project’s ever-changing dependencies across all the source files. Missing a dependency means the build may not be correct unless you make clean first.

If you go the route of using a script to generate the tedious parts of the Makefile, both GCC and Clang have a nice feature for generating all the Makefile dependencies for you (-MM, -MT), at least for C and C++. There are lots of tutorials for doing this dependency generation on the fly as part of the build, but it’s fragile and slow. Much better to do it all up front and “bake” the dependencies into the Makefile so that make can do its job properly. If the dependencies change, rebuild your Makefile.

For example, here’s what it looks like invoking gcc’s dependency generator against the imaginary input.c for an out-of-source build:

$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h

Notice the output is in Makefile’s rule format.

Unfortunately this feature strips the leading paths from the target, so, in practice, using it is always more complicated than it should be (e.g. it requires the use of -MT).

Microsoft’s Nmake

Microsoft has an implementation of make called Nmake, which comes with Visual Studio. It’s nearly a POSIX-compatible make, but necessarily breaks from the standard in some places. Their cl.exe compiler uses .obj as the object file extension and .exe for binaries, both of which differ from the unix world, so it has different built-in inference rules. Windows also lacks a Bourne shell and the standard unix tools, so all of the commands will necessarily be different.

There’s no equivalent of rm -f on Windows, so good luck writing a proper “clean” target. No, del /f isn’t the same.

So while it’s close to POSIX make, it’s not practical to write a Makefile that will simultaneously work properly with both POSIX make and Nmake. These need to be separate Makefiles.

May your Makefiles be portable

It’s nice to have reliable, portable Makefiles that just work anywhere. Code to the standards and you don’t need feature tests or other sorts of special treatment.

Rolling Shutter Simulation in C

2017-07-02T18:35:16Z

The most recent Smarter Every Day (#172) explains a phenomenon that results from rolling shutter. You’ve likely seen this effect in some of your own digital photographs. When a CMOS digital camera captures a picture, it reads one row of the sensor at a time. If the subject of the picture is a fast-moving object (relative to the camera), then the subject will change significantly while the image is being captured, giving strange, unreal results:

In the Smarter Every Day video, Destin illustrates the effect by simulating rolling shutter using a short video clip. In each frame of the video, a few additional rows are locked in place, showing the effect in slow motion, making it easier to understand.

At the end of the video he thanks a friend for figuring out how to get After Effects to simulate rolling shutter. After thinking about this for a moment, I figured I could easily accomplish this myself with just a bit of C, without any libraries. The video above this paragraph is the result.

I previously described a technique to edit and manipulate video without any formal video editing tools. A unix pipeline is sufficient for doing minor video editing, especially without sound. The program at the front of the pipe decodes the video into a raw, uncompressed format, such as YUV4MPEG or PPM. The tools in the middle losslessly manipulate this data to achieve the desired effect (watermark, scaling, etc.). Finally, the tool at the end encodes the video into a standard format.

$ decode video.mp4 | xform-a | xform-b | encode out.mp4

For the “decode” program I’ll be using ffmpeg now that it’s back in the Debian repositories. You can throw a video in virtually any format at it and it will write PPM frames to standard output. For the encoder I’ll be using the x264 command line program, though ffmpeg could handle this part as well. Without any filters in the middle, this example will just re-encode a video:

$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    x264 -o output.mp4 /dev/stdin

The filter tools in the middle only need to read and write in the raw image format. They’re a little bit like shaders, and they’re easy to write. In this case, I’ll write C program that simulates rolling shutter. The filter could be written in any language that can read and write binary data from standard input to standard output.

Update: It appears that input PPM streams are a rather recent feature of libavformat (a.k.a lavf, used by x264). Support for PPM input first appeared in libavformat 3.1 (released June 26th, 2016). If you’re using an older version of libavformat, you’ll need to stick ppmtoy4m in front of x264 in the processing pipeline.

$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    ppmtoy4m | \
    x264 -o output.mp4 /dev/stdin

Video filtering in C

In the past, my go to for raw video data has been loose PPM frames and YUV4MPEG streams (via ppmtoy4m). Fortunately, over the years a lot of tools have gained the ability to manipulate streams of PPM images, which is a much more convenient format. Despite being raw video data, YUV4MPEG is still a fairly complex format with lots of options and annoying colorspace concerns. PPM is simple RGB without complications. The header is just text:

P6

The maximum depth is virtually always 255. A smaller value reduces the image’s dynamic range without reducing the size. A larger value involves byte-order issues (endian). For video frame data, the file will typically look like:

P6
1920 1080
255

Unfortunately the format is actually a little more flexible than this. Except for the new line (LF, 0x0A) after the maximum depth, the whitespace is arbitrary and comments starting with # are permitted. Since the tools I’m using won’t produce comments, I’m going to ignore that detail. I’ll also assume the maximum depth is always 255.

Here’s the structure I used to represent a PPM image, just one frame of video. I’m using a flexible array member to pack the data at the end of the structure.

struct frame {
    size_t width;
    size_t height;
    unsigned char data[];
};

Next a function to allocate a frame:

static struct frame *
frame_create(size_t width, size_t height)
{
    struct frame *f = malloc(sizeof(*f) + width * height * 3);
    f->width = width;
    f->height = height;
    return f;
}

We’ll need a way to write the frames we’ve created.

static void
frame_write(struct frame *f)
{
    printf("P6\n%zu %zu\n255\n", f->width, f->height);
    fwrite(f->data, f->width * f->height, 3, stdout);
}

Finally, a function to read a frame, reusing an existing buffer if possible. The most complex part of the whole program is just parsing the PPM header. The %*c in the scanf() specifically consumes the line feed immediately following the maximum depth.

static struct frame *
frame_read(struct frame *f)
{
    size_t width, height;
    if (scanf("P6 %zu%zu%*d%*c", &width, &height) < 2) {
        free(f);
        return 0;
    }
    if (!f || f->width != width || f->height != height) {
        free(f);
        f = frame_create(width, height);
    }
    fread(f->data, width * height, 3, stdin);
    return f;
}

Since this program will only be part of a pipeline, I’m not worried about checking the results of fwrite() and fread(). The process will be killed by the shell if something goes wrong with the pipes. However, if we’re out of video data and get an EOF, scanf() will fail, indicating the EOF, which is normal and can be handled cleanly.

An identity filter

That’s all the infrastructure we need to built an identity filter that passes frames through unchanged:

int main(void)
{
    struct frame *frame = 0;
    while ((frame = frame_read(frame)))
        frame_write(frame);
}

Processing a frame is just matter of adding some stuff to the body of the while loop.

A rolling shutter filter

For the rolling shutter filter, in addition to the input frame we need an image to hold the result of the rolling shutter. Each input frame will be copied into the rolling shutter frame, but a little less will be copied from each frame, locking a little bit more of the image in place.

int
main(void)
{
    int shutter_step = 3;
    size_t shutter = 0;
    struct frame *f = frame_read(0);
    struct frame *out = frame_create(f->width, f->height);
    while (shutter < f->height && (f = frame_read(f))) {
        size_t offset = shutter * f->width * 3;
        size_t length = f->height * f->width * 3 - offset;
        memcpy(out->data + offset, f->data + offset, length);
        frame_write(out);
        shutter += shutter_step;
    }
    free(out);
    free(f);
}

The shutter_step controls how many rows are capture per frame of video. Generally capturing one row per frame is too slow for the simulation. For a 1080p video, that’s 1,080 frames for the entire simulation: 18 seconds at 60 FPS or 36 seconds at 30 FPS. If this program were to accept command line arguments, controlling the shutter rate would be one of the options.

Putting it all together:

$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
    ./rolling-shutter | \
    x264 -o output.mp4 /dev/stdin

Here are some of the results for different shutter rates: 1, 3, 5, 8, 10, and 15 rows per frame. Feel free to right-click and “View Video” to see the full resolution video.

Source and original input

This post contains the full source in parts, but here it is all together:

rshutter.c

Here’s the original video, filmed by my wife using her Nikon D5500, in case you want to try it for yourself:

It took much longer to figure out the string-pulling contraption to slowly spin the fan at a constant rate than it took to write the C filter program.

Followup Links

On Hacker News, morecoffee shared a video of the second order effect (direct link), where the rolling shutter speed changes over time.

A deeper analysis of rolling shutter: Playing detective with rolling shutter photos.

Building and Installing Software in $HOME

2017-06-19T02:34:39Z

For more than 5 years now I’ve kept a private “root” filesystem within my home directory under $HOME/.local/. Within are the standard /usr directories, such as bin/, include/, lib/, etc., containing my own software, libraries, and man pages. These are first-class citizens, indistinguishable from the system-installed programs and libraries. With one exception (setuid programs), none of this requires root privileges.

Installing software in $HOME serves two important purposes, both of which are indispensable to me on a regular basis.

No root access: Sometimes I’m using a system administered by someone else, and I don’t have root access.

This prevents me from installing packaged software myself through the system’s package manager. Building and installing the software myself in my home directory, without involvement from the system administrator, neatly works around this issue. As a software developer, it’s already perfectly normal for me to build and run custom software, and this is just an extension of that behavior.

In the most desperate situation, all I need from the sysadmin is a decent C compiler and at least a minimal POSIX environment. I can bootstrap anything I might need, both libraries and programs, including a better C compiler along the way. This is one major strength of open source software.

I have noticed one alarming trend: Both GCC (since 4.8) and Clang are written in C++, so it’s becoming less and less reasonable to bootstrap a C++ compiler from a C compiler, or even from a C++ compiler that’s more than a few years old. So you may also need your sysadmin to supply a fairly recent C++ compiler if you want to bootstrap an environment that includes C++. I’ve had to avoid some C++ software (such as CMake) for this reason.

Custom software builds: Even if I am root, I may still want to install software not available through the package manager, a version not available in the package manager, or a version with custom patches.

In theory this is what /usr/local is all about. It’s typically the location for software not managed by the system’s package manager. However, I think it’s cleaner to put this in $HOME/.local, so long as other system users don’t need it.

For example, I have an installation of each version of Emacs between 24.3 (the oldest version worth supporting) through the latest stable release, each suffixed with its version number, under $HOME/.local. This is useful for quickly running a test suite under different releases.

$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...

Another example is NetHack, which I prefer to play with a couple of custom patches (Menucolors, wchar). The install to $HOME/.local is also captured as a patch.

$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install

Normally NetHack wants to be setuid (e.g. run as the “games” user) in order to restrict access to high scores, saves, and bones — saved levels where a player died, to be inserted randomly into other players’ games. This prevents cheating, but requires root to set up. Fortunately, when I install NetHack in my home directory, this isn’t a feature I actually care about, so I can ignore it.

Mutt is in a similar situation, since it wants to install a special setgid program (mutt_dotlock) that synchronizes mailbox access. All MUAs need something like this.

Everything described below is relevant to basically any modern unix-like system: Linux, BSD, etc. I personally install software in $HOME across a variety of systems and, fortunately, it mostly works the same way everywhere. This is probably in large part due to everyone standardizing around the GCC and GNU binutils interfaces, even if the system compiler is actually LLVM/Clang.

Configuring for $HOME installs

Out of the box, installing things in $HOME/.local won’t do anything useful. You need to set up some environment variables in your shell configuration (i.e. .profile, .bashrc, etc.) to tell various programs, such as your shell, about it. The most obvious variable is $PATH:

export PATH=$HOME/.local/bin:$PATH

Notice I put it in the front of the list. This is because I want my home directory programs to override system programs with the same name. For what other reason would I install a program with the same name if not to override the system program?

In the simplest situation this is good enough, but in practice you’ll probably need to set a few more things. If you install libraries in your home directory and expect to use them just as if they were installed on the system, you’ll need to tell the compiler where else to look for those headers and libraries, both for C and C++.

export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib

The first two are like the -I compiler option and the third is like -L linker option, except you usually won’t need to use them explicitly. Unfortunately LIBRARY_PATH doesn’t override the system library paths, so in some cases, you will need to explicitly set -L. Otherwise you will still end up linking against the system library rather than the custom packaged version. I really wish GCC and Clang didn’t behave this way.

Some software uses pkg-config to determine its compiler and linker flags, and your home directory will contain some of the needed information. So set that up too:

export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig

Run-time linker

Finally, when you install libraries in your home directory, the run-time dynamic linker will need to know where to find them. There are three ways to deal with this:

The crude, easy way: LD_LIBRARY_PATH.
The elegant, difficult way: ELF runpath.
Screw it, just statically link the bugger. (Not always possible.)

For the crude way, point the run-time linker at your lib/ and you’re done:

export LD_LIBRARY_PATH=$HOME/.local/lib

However, this is like using a shotgun to kill a fly. If you install a library in your home directory that is also installed on the system, and then run a system program, it may be linked against your library rather than the library installed on the system as was originally intended. This could have detrimental effects.

The precision method is to set the ELF “runpath” value. It’s like a per-binary LD_LIBRARY_PATH. The run-time linker uses this path first in its search for libraries, and it will only have an effect on that particular program/library. This also applies to dlopen().

Some software will configure the runpath by default in their build system, but often you need to configure this yourself. The simplest way is to set the LD_RUN_PATH environment variable when building software. Another option is to manually pass -rpath options to the linker via LDFLAGS. It’s used directly like this:

$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux

Verify with readelf:

$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]

ELF supports a special $ORIGIN “variable” set to the binary’s location. This allows the program and associated libraries to be installed anywhere without changes, so long as they have the same relative position to each other . (Note the quotes to prevent shell interpolation.)

$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux

There is one situation where runpath won’t work: when you want a system-installed program to find a home directory library with dlopen() — e.g. as an extension to that program. You either need to ensure it uses a relative or absolute path (i.e. the argument to dlopen() contains a slash) or you must use LD_LIBRARY_PATH.

Personally, I always use the Worse is Better LD_LIBRARY_PATH shotgun. Occasionally it’s caused some annoying issues, but the vast majority of the time it gets the job done with little fuss. This is just my personal development environment, after all, not a production server.

Manual pages

Another potentially tricky issue is man pages. When a program or library installs a man page in your home directory, it would certainly be nice to access it with man just like it was installed on the system. Fortunately, Debian and Debian-derived systems, using a mechanism I haven’t yet figured out, discover home directory man pages automatically without any assistance. No configuration needed.

It’s more complicated on other systems, such as the BSDs. You’ll need to set the MANPATH variable to include $HOME/.local/share/man. It’s unset by default and it overrides the system settings, which means you need to manually include the system paths. The manpath program can help with this … if it’s available.

export MANPATH=$HOME/.local/share/man:$(manpath)

I haven’t figured out a portable way to deal with this issue, so I mostly ignore it.

How to install software in $HOME

While I’ve poo-pooed autoconf in the past, the standard configure script usually makes it trivial to build and install software in $HOME. The key ingredient is the --prefix option:

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

Most of the time it’s that simple! If you’re linking against your own libraries and want to use runpath, it’s a little more complicated:

$ ./configure --prefix=$HOME/.local \
              LDFLAGS="-Wl,-rpath=$HOME/.local/lib"

For CMake, there’s CMAKE_INSTALL_PREFIX:

$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..

The CMake builds I’ve seen use ELF runpath by default, and no further configuration may be required to make that work. I’m sure that’s not always the case, though.

Some software is just a single, static, standalone binary with everything baked in. It doesn’t need to be given a prefix, and installation is as simple as copying the binary into place. For example, Enchive works like this:

$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin

Some software uses its own unique configuration interface. I can respect that, but it does add some friction for users who now have something additional and non-transferable to learn. I demonstrated a NetHack build above, which has a configuration much more involved than it really should be. Another example is LuaJIT, which uses make variables that must be provided consistently on every invocation:

$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install

(You can use the “install” target to both build and install, but I wanted to illustrate the repetition of PREFIX.)

Some libraries aren’t so smart about pkg-config and need some handholding — for example, ncurses. I mention it because it’s required for both Vim and Emacs, among many others, so I’m often building it myself. It ignores --prefix and needs to be told a second time where to install things:

$ ./configure --prefix=$HOME/.local \
              --enable-pc-files \
              --with-pkg-config-libdir=$PKG_CONFIG_PATH

Another issue is that a whole lot of software has been hardcoded for ncurses 5.x (i.e. ncurses5-config), and it requires hacks/patching to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x for this reason.

Learning through experience

I could go on and on like this, discussing the quirks for the various libraries and programs that I use. Over the years I’ve gotten used to many of these issues, committing the solutions to memory. Unfortunately, even within the same version of a piece of software, the quirks can change between major operating system releases, so I’m continuously learning my way around new issues. It’s really given me an appreciation for all the hard work that package maintainers put into customizing and maintaining software builds to fit properly into a larger ecosystem.

Raw Linux Threads via System Calls

2015-05-15T17:33:40Z

This article has a followup.

Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.

Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.

It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86-64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).

I’ve put the complete demo here if you want to see it all at once:

Pure assembly, library-free Linux threading demo

An x86-64 Primer

I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86-64 assembly, feel free to skip to the next section.

x86-64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.

rsp : stack pointer
rbp : “base” pointer (still used in debugging and profiling)
rax rbx rcx rdx : general purpose (notice: a, b, c, d)
rdi rsi : “destination” and “source”, now meaningless names
r8 r9 r10 r11 r12 r13 r14 r15 : added for x86-64

The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86-64 they’re almost completely interchangeable.

There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.

The Stack

The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.

The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86-64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:

rdi, rsi, rdx, rcx, r8, r9

So, for example, to perform a function call like foo(1, 2, 3), store 1, 2 and 3 in rdi, rsi, and rdx, then call the function. The mov instruction stores the source (second) operand in its destination (first) operand. The call instruction pushes the current value of rip onto the stack, then sets rip (jumps) to the address of the target function. When the callee is ready to return, it uses the ret instruction to pop the original rip value off the stack and back into rip, returning control to the caller.

    mov rdi, 1
    mov rsi, 2
    mov rdx, 3
    call foo

Called functions must preserve the contents of these registers (the same value must be stored when the function returns):

rbx, rsp, rbp, r12, r13, r14, r15

System Calls

When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.

rdi, rsi, rdx, r10, r8, r9

Each system call has an integer identifying it. This number is different on each platform, but, in Linux’s case, it will never change. Instead of call, rax is set to the number of the desired system call and the syscall instruction makes the request to the OS kernel. Prior to x86-64, this was done with an old-fashioned interrupt. Because interrupts are slow, a special, statically-positioned “vsyscall” page (now deprecated as a security hazard), later vDSO, is provided to allow certain system calls to be made as function calls. We’ll only need the syscall instruction in this article.

So, for example, the write() system call has this C prototype.

ssize_t write(int fd, const void *buf, size_t count);

On x86-64, the write() system call is at the top of the system call table as call 1 (read() is 0). Standard output is file descriptor 1 by default (standard input is 0). The following bit of code will write 10 bytes of data from the memory address buffer (a symbol defined elsewhere in the assembly program) to standard output. The number of bytes written, or -1 for error, will be returned in rax.

    mov rdi, 1        ; fd
    mov rsi, buffer
    mov rdx, 10       ; 10 bytes
    mov rax, 1        ; SYS_write
    syscall

Effective Addresses

There’s one last thing you need to know: registers often hold a memory address (i.e. a pointer), and you need a way to read the data behind that address. In NASM syntax, wrap the register in brackets (e.g. [rax]), which, if you’re familiar with C, would be the same as dereferencing the pointer.

These bracket expressions, called an effective address, may be limited mathematical expressions to offset that base address entirely within a single instruction. This expression can include another register (index), a power-of-two scalar (bit shift), and an immediate signed offset. For example, [rax + rdx*8 + 12]. If rax is a pointer to a struct, and rdx is an array index to an element in array on that struct, only a single instruction is needed to read that element. NASM is smart enough to allow the assembly programmer to break this mold a little bit with more complex expressions, so long as it can reduce it to the [base + index*2^exp + offset] form.

The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.

Allocating a Stack

Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.

The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.

You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:

brk(): Extends (or shrinks) the heap of a running process, typically located somewhere shortly after the .bss segment. Many allocators will do this for small or initial allocations. This is a less optimal choice for thread stacks because the stacks will be very near other important data, near other stacks, and lack a guard page (by default). It would be somewhat easier for an attacker to exploit a buffer overflow. A guard page is a locked-down page just past the absolute end of the stack that will trigger a segmentation fault on a stack overflow, rather than allow a stack overflow to trash other memory undetected. A guard page could still be created manually with mprotect(). Also, there’s also no room for these stacks to grow.
mmap(): Use an anonymous mapping to allocate a contiguous set of pages at some randomized memory location. As we’ll see, you can even tell the kernel specifically that you’re going to use this memory as a stack. Also, this is simpler than using brk() anyway.

On x86-64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.

void *stack_create(void);

The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.

void *mmap(void *addr, size_t length, int prot, int flags);

For flags, we’ll choose a private, anonymous mapping that, being a stack, grows downward. Even with that last flag, the system call will still return the bottom address of the mapping, which will be important to remember later. It’s just a simple matter of setting the arguments in the registers and making the system call.

%define SYS_mmap	9
%define STACK_SIZE	(4096 * 1024)	; 4 MB

stack_create:
    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
    mov rax, SYS_mmap
    syscall
    ret

Now we can allocate new stacks (or stack-sized buffers) as needed.

Spawning a Thread

Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.

long sys_clone(unsigned long flags, void *child_stack);

Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.

long thread_create(void (*)(void));

The function pointer argument is passed via rdi, per the ABI. Store this for safekeeping on the stack (push) in preparation for calling stack_create(). When it returns, the address of the low end of stack will be in rax.

thread_create:
    push rdi
    call stack_create
    lea rsi, [rax + STACK_SIZE - 8]
    pop qword [rsi]
    mov rdi, CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
             CLONE_PARENT | CLONE_THREAD | CLONE_IO
    mov rax, SYS_clone
    syscall
    ret

The second argument to clone() is a pointer to the high address of the stack (specifically, just above the stack). So we need to add STACK_SIZE to rax to get the high end. This is done with the lea instruction: load effective address. Despite the brackets, it doesn’t actually read memory at that address, but instead stores the address in the destination register (rsi). I’ve moved it back by 8 bytes because I’m going to place the thread function pointer at the “top” of the new stack in the next instruction. You’ll see why in a moment.

Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.

As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.

CLONE_THREAD: Put the new process in the same thread group.
CLONE_VM: Runs in the same virtual memory space.
CLONE_PARENT: Share a parent with the callee.
CLONE_SIGHAND: Share signal handlers.
CLONE_FS, CLONE_FILES, CLONE_IO: Share filesystem information.

A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).

Now here’s the really cool part, and the reason branching isn’t needed. There’s no reason to check rax to determine if we are the original thread (in which case we return to the caller) or if we’re the new thread (in which case we jump to the thread function). Remember how we seeded the new stack with the thread function? When the new thread returns (ret), it will jump to the thread function with a completely empty stack. The original thread, using the original stack, will return to the caller.

The value returned by thread_create() is the process ID of the new thread, which is essentially the thread object (e.g. Pthread’s pthread_t).

Cleaning Up

The thread function has to be careful not to return (ret) since there’s nowhere to return. It will fall off the stack and terminate the program with a segmentation fault. Remember that threads are just processes? It must use the exit() syscall to terminate. This won’t terminate the other threads.

%define SYS_exit	60

exit:
    mov rax, SYS_exit
    syscall

Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.

More Exploration

If you found this interesting, be sure to check out the full demo link at the top of this article. Now with the ability to spawn threads, it’s a great opportunity to explore and experiment with x86’s synchronization primitives, such as the lock instruction prefix, xadd, and compare-and-exchange (cmpxchg). I’ll discuss these in a future article.

A Basic Just-In-Time Compiler

2015-03-19T04:57:55Z

This article was discussed on Hacker News and on reddit.

Monday’s /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It’s given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it’s an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
…

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there’s less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;
};

To keep things simple here, I’m just assuming the page size is 4kB. In a real program, we’d use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It’s anonymous because it’s not backed by a file.

struct asmbuf *
asmbuf_create(void)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}

Windows doesn’t have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here’s the equivalent in Win32.

struct asmbuf *
asmbuf_create(void)
{
    DWORD type = MEM_RESERVE | MEM_COMMIT;
    return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}

Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

void
asmbuf_finalize(struct asmbuf *buf)
{
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}

Or on Win32 (that last parameter is not allowed to be NULL),

void
asmbuf_finalize(struct asmbuf *buf)
{
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}

Finally, instead of free() it gets unmapped.

void
asmbuf_free(struct asmbuf *buf)
{
    munmap(buf, PAGE_SIZE);
}

And on Win32,

void
asmbuf_free(struct asmbuf *buf)
{
    VirtualFree(buf, 0, MEM_RELEASE);
}

I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We’re only going to be concerned with three of x86-64’s many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we’ll be able to call from C like a normal function. Here’s what it’s prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that’s where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   rax, rdi

There’s a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.

ret

So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.

Fortunately there’s a much easier way. We’ll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to nasm. It will produce a raw binary with the machine code, which we’ll disassemble with nidsasm (the NASM disassembler).

$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret

That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // ret

For each operation, we’ll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator’s instruction if it’s small (32-bits or fewer), but I’m keeping it simple. To sneakily capture the “template” for this instruction I’m going to use 0x0123456789abcdef as the operand.

mov   rdi, 0x0123456789abcdef

Which disassembled with ndisasm is,

00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301

Notice the operand listed little endian immediately after the instruction. That’s also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   rdi, operand
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
    case '+':
        asmbuf_ins(buf, 3, 0x4801f8);   // add   rax, rdi
        break;
    case '-':
        asmbuf_ins(buf, 3, 0x4829f8);   // sub   rax, rdi
        break;
    case '*':
        asmbuf_ins(buf, 4, 0x480fafc7); // imul  rax, rdi
        break;
    case '/':
        asmbuf_ins(buf, 3, 0x4831d2);   // xor   rdx, rdx
        asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  rdi
        break;
}

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we’re all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.

I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

Interactive Programming in C

2014-12-23T05:43:41Z

I’m a huge fan of interactive programming (see: JavaScript, Java, Lisp, Clojure). That is, modifying and extending a program while it’s running. For certain kinds of non-batch applications, it takes much of the tedium out of testing and tweaking during development. Until last week I didn’t know how to apply interactive programming to C. How does one go about redefining functions in a running C program?

Last week in Handmade Hero (days 21-25), Casey Muratori added interactive programming to the game engine. This is especially useful in game development, where the developer might want to tweak, say, a boss fight without having to restart the entire game after each tweak. Now that I’ve seen it done, it seems so obvious. The secret is to build almost the entire application as a shared library.

This puts a serious constraint on the design of the program: it cannot keep any state in global or static variables, though this should be avoided anyway. Global state will be lost each time the shared library is reloaded. In some situations, this can also restrict use of the C standard library, including functions like malloc(), depending on how these functions are implemented or linked. For example, if the C standard library is statically linked, functions with global state may introduce global state into the shared library. It’s difficult to know what’s safe to use. This works fine in Handmade Hero because the core game, the part loaded as a shared library, makes no use of external libraries, including the standard library.

Additionally, the shared library must be careful with its use of function pointers. The functions being pointed at will no longer exist after a reload. This is a real issue when combining interactive programming with object oriented C.

An example with the Game of Life

To demonstrate how this works, let’s go through an example. I wrote a simple ncurses Game of Life demo that’s easy to modify. You can get the entire source here if you’d like to play around with it yourself on a Unix-like system.

https://github.com/skeeto/interactive-c-demo

Quick start:

In a terminal run make then ./main. Press r randomize and q to quit.
Edit game.c to change the Game of Life rules, add colors, etc.
In a second terminal run make. Your changes will be reflected immediately in the original program!

As of this writing, Handmade Hero is being written on Windows, so Casey is using a DLL and the Win32 API, but the same technique can be applied on Linux, or any other Unix-like system, using libdl. That’s what I’ll be using here.

The program will be broken into two parts: the Game of Life shared library (“game”) and a wrapper (“main”) whose job is only to load the shared library, reload it when it updates, and call it at a regular interval. The wrapper is agnostic about the operation of the “game” portion, so it could be re-used almost untouched in another project.

To avoid maintaining a whole bunch of function pointer assignments in several places, the API to the “game” is enclosed in a struct. This also eliminates warnings from the C compiler about mixing data and function pointers. The layout and contents of the game_state struct is private to the game itself. The wrapper will only handle a pointer to this struct.

struct game_state;

struct game_api {
    struct game_state *(*init)();
    void (*finalize)(struct game_state *state);
    void (*reload)(struct game_state *state);
    void (*unload)(struct game_state *state);
    bool (*step)(struct game_state *state);
};

In the demo the API is made of 5 functions. The first 4 are primarily concerned with loading and unloading.

init(): Allocate and return a state to be passed to every other API call. This will be called once when the program starts and never again, even after reloading. If we were concerned about using malloc() in the shared library, the wrapper would be responsible for performing the actual memory allocation.
finalize(): The opposite of init(), to free all resources held by the game state.
reload(): Called immediately after the library is reloaded. This is the chance to sneak in some additional initialization in the running program. Normally this function will be empty. It’s only used temporarily during development.
unload(): Called just before the library is unloaded, before a new version is loaded. This is a chance to prepare the state for use by the next version of the library. This can be used to update structs and such, if you wanted to be really careful. This would also normally be empty.
step(): Called at a regular interval to run the game. A real game will likely have a few more functions like this.

The library will provide a filled out API struct as a global variable, GAME_API. This is the only exported symbol in the entire shared library! All functions will be declared static, including the ones referenced by the structure.

const struct game_api GAME_API = {
    .init     = game_init,
    .finalize = game_finalize,
    .reload   = game_reload,
    .unload   = game_unload,
    .step     = game_step
};

dlopen, dlsym, and dlclose

The wrapper is focused on calling dlopen(), dlsym(), and dlclose() in the right order at the right time. The game will be compiled to the file libgame.so, so that’s what will be loaded. It’s written in the source with a ./ to force the name to be used as a filename. The wrapper keeps track of everything in a game struct.

const char *GAME_LIBRARY = "./libgame.so";

struct game {
    void *handle;
    ino_t id;
    struct game_api api;
    struct game_state *state;
};

The handle is the value returned by dlopen(). The id is the inode of the shared library, as returned by stat(). The rest is defined above. Why the inode? We could use a timestamp instead, but that’s indirect. What we really care about is if the shared object file is actually a different file than the one that was loaded. The file will never be updated in place, it will be replaced by the compiler/linker, so the timestamp isn’t what’s important.

Using the inode is a much simpler situation than in Handmade Hero. Due to Windows’ broken file locking behavior, the game DLL can’t be replaced while it’s being used. To work around this limitation, the build system and the loader have to rely on randomly-generated filenames.

void game_load(struct game *game)

The purpose of the game_load() function is to load the game API into a game struct, but only if either it hasn’t been loaded yet or if it’s been updated. Since it has several independent failure conditions, let’s examine it in parts.

struct stat attr;
if ((stat(GAME_LIBRARY, &attr) == 0) && (game->id != attr.st_ino)) {

First, use stat() to determine if the library’s inode is different than the one that’s already loaded. The id field will be 0 initially, so as long as stat() succeeds, this will load the library the first time.

    if (game->handle) {
        game->api.unload(game->state);
        dlclose(game->handle);
    }

If a library is already loaded, unload it first, being sure to call unload() to inform the library that it’s being updated. It’s critically important that dlclose() happens before dlopen(). On my system, dlopen() looks only at the string it’s given, not the file behind it. Even though the file has been replaced on the filesystem, dlopen() will see that the string matches a library already opened and return a pointer to the old library. (Is this a bug?) The handles are reference counted internally by libdl.

    void *handle = dlopen(GAME_LIBRARY, RTLD_NOW);

Finally load the game library. There’s a race condition here that cannot be helped due to limitations of dlopen(). The library may have been updated again since the call to stat(). Since we can’t ask dlopen() about the inode of the library it opened, we can’t know. But as this is only used during development, not in production, it’s not a big deal.

    if (handle) {
        game->handle = handle;
        game->id = attr.st_ino;
        /* ... more below ... */
    } else {
        game->handle = NULL;
        game->id = 0;
    }

If dlopen() fails, it will return NULL. In the case of ELF, this will happen if the compiler/linker is still in the process of writing out the shared library. Since the unload was already done, this means no game will be loaded when game_load returns. The user of the struct needs to be prepared for this eventuality. It will need to try loading again later (i.e. a few milliseconds). It may be worth filling the API with stub functions when no library is loaded.

    const struct game_api *api = dlsym(game->handle, "GAME_API");
    if (api != NULL) {
        game->api = *api;
        if (game->state == NULL)
            game->state = game->api.init();
        game->api.reload(game->state);
    } else {
        dlclose(game->handle);
        game->handle = NULL;
        game->id = 0;
    }

When the library loads without error, look up the GAME_API struct that was mentioned before and copy it into the local struct. Copying rather than using the pointer avoids one more layer of redirection when making function calls. The game state is initialized if it hasn’t been already, and the reload() function is called to inform the game it’s just been reloaded.

If looking up the GAME_API fails, close the handle and consider it a failure.

The main loop calls game_load() each time around. And that’s it!

int main(void)
{
    struct game game = {0};
    for (;;) {
        game_load(&game);
        if (game.handle)
            if (!game.api.step(game.state))
                break;
        usleep(100000);
    }
    game_unload(&game);
    return 0;
}

Now that I have this technique in by toolbelt, it has me itching to develop a proper, full game in C with OpenGL and all, perhaps in another Ludum Dare. The ability to develop interactively is very appealing.

How to build DOS COM files with GCC

2014-12-09T23:50:10Z

Update 2018: RenéRebe builds upon this article in an interesting follow-up video (part 2).

Update 2020: DOS Defender was featured on GET OFF MY LAWN.

This past weekend I participated in Ludum Dare #31. Before the theme was even announced, due to recent fascination I wanted to make an old school DOS game. DOSBox would be the target platform since it’s the most practical way to run DOS applications anymore, despite modern x86 CPUs still being fully backwards compatible all the way back to the 16-bit 8086.

I successfully created and submitted a DOS game called DOS Defender. It’s a 32-bit 80386 real mode DOS COM program. All assets are embedded in the executable and there are no external dependencies, so the entire game is packed into that 10kB binary.

https://github.com/skeeto/dosdefender-ld31
DOSDEF.COM (10kB, v1.1.0, run in DOSBox)

You’ll need a joystick/gamepad in order to play. I included mouse support in the Ludum Dare release in order to make it easier to review, but this was removed because it doesn’t work well.

The most technically interesting part is that I didn’t need any DOS development tools to create this! I only used my every day Linux C compiler (gcc). It’s not actually possible to build DOS Defender in DOS. Instead, I’m treating DOS as an embedded platform, which is the only form in which DOS still exists today. Along with DOSBox and DOSEMU, this is a pretty comfortable toolchain.

If all you care about is how to do this yourself, skip to the “Tricking GCC” section, where we’ll write a “Hello, World” DOS COM program with Linux’s GCC.

Finding the right tools

I didn’t have GCC in mind when I started this project. What really triggered all of this was that I had noticed Debian’s bcc package, Bruce’s C Compiler, that builds 16-bit 8086 binaries. It’s kept around for compiling x86 bootloaders and such, but it can also be used to compile DOS COM files, which was the part that interested me.

For some background: the Intel 8086 was a 16-bit microprocessor released in 1978. It had none of the fancy features of today’s CPU: no memory protection, no floating point instructions, and only up to 1MB of RAM addressable. All modern x86 desktops and laptops can still pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same limited addressing and all. That’s some serious backwards compatibility. This feature is called real mode. It’s the mode in which all x86 computers boot. Modern operating systems switch to protected mode as soon as possible, which provides virtual addressing and safe multi-tasking. DOS is not one of these operating systems.

Unfortunately, bcc is not an ANSI C compiler. It supports a subset of K&R C, along with inline x86 assembly. Unlike other 8086 C compilers, it has no notion of “far” or “long” pointers, so inline assembly is required to access other memory segments (VGA, clock, etc.). Side note: the remnants of these 8086 “long pointers” still exists today in the Win32 API: LPSTR, LPWORD, LPDWORD, etc. The inline assembly isn’t anywhere near as nice as GCC’s inline assembly. The assembly code has to manually load variables from the stack so, since bcc supports two different calling conventions, the assembly ends up being hard-coded to one calling convention or the other.

Given all its limitations, I went looking for alternatives.

DJGPP

DJGPP is the DOS port of GCC. It’s a very impressive project, bringing almost all of POSIX to DOS. The DOS ports of many programs are built with DJGPP. In order to achieve this, it only produces 32-bit protected mode programs. If a protected mode program needs to manipulate hardware (i.e. VGA), it must make requests to a DOS Protected Mode Interface (DPMI) service. If I used DJGPP, I couldn’t make a single, standalone binary as I had wanted, since I’d need to include a DPMI server. There’s also a performance penalty for making DPMI requests.

Getting a DJGPP toolchain working can be difficult, to put it kindly. Fortunately I found a useful project, build-djgpp, that makes it easy, at least on Linux.

Either there’s a serious bug or the official DJGPP binaries have become infected again, because in my testing I kept getting the “Not COFF: check for viruses” error message when running my programs in DOSBox. To double check that it’s not an infection on my own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as a clean room. It’s impossible for this ARM-based device to get infected with an x86 virus. It still had the same problem, and all the binary hashes matched up between the machines, so it’s not my fault.

So given the DPMI issue and the above, I moved on.

Tricking GCC

What I finally settled on is a neat hack that involves “tricking” GCC into producing real mode DOS COM files, so long as it can target 80386 (as is usually the case). The 80386 was released in 1985 and was the first 32-bit x86 microprocessor. GCC still targets this instruction set today, even in the x86-64 toolchain. Unfortunately, GCC cannot actually produce 16-bit code, so my main goal of targeting 8086 would not be achievable. This doesn’t matter, though, since DOSBox, my intended platform, is an 80386 emulator.

In theory this should even work unchanged with MinGW, but there’s a long-standing MinGW bug that prevents it from working right (“cannot perform PE operations on non PE output file”). It’s still do-able, and I did it myself, but you’ll need to drop the OUTPUT_FORMAT directive and add an extra objcopy step (objcopy -O binary).

Hello World in DOS

To demonstrate how to do all this, let’s make a DOS “Hello, World” COM program using GCC on Linux.

There’s a significant burden with this technique: there will be no standard library. It’s basically like writing an operating system from scratch, except for the few services DOS provides. This means no printf() or anything of the sort. Instead we’ll ask DOS to print a string to the terminal. Making a request to DOS means firing an interrupt, which means inline assembly!

DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x2F. The big one, and the one we’re interested in, is 0x21, function 0x09 (print string). Between DOS and BIOS, there are thousands of functions called this way. I’m not going to try to explain x86 assembly, but in short the function number is stuffed into register ah and interrupt 0x21 is fired. Function 0x09 also takes an argument, the pointer to the string to be printed, which is passed in registers dx and ds.

Here’s the GCC inline assembly print() function. Strings passed to this function must be terminated with a $. Why? Because DOS.

static void print(char *string)
{
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");
}

The assembly is declared volatile because it has a side effect (printing the string). To GCC, the assembly is an opaque hunk, and the optimizer relies in the output/input/clobber constraints (the last three lines). For DOS programs like this, all inline assembly will have side effects. This is because it’s not being written for optimization but to access hardware and DOS, things not accessible to plain C.

Care must also be taken by the caller, because GCC doesn’t know that the memory pointed to by string is ever read. It’s likely the array that backs the string needs to be declared volatile too. This is all foreshadowing into what’s to come: doing anything in this environment is an endless struggle against the optimizer. Not all of these battles can be won.

Now for the main function. The name of this function shouldn’t matter, but I’m avoiding calling it main() since MinGW has a funny ideas about mangling this particular symbol, even when it’s asked not to.

int dosmain(void)
{
    print("Hello, World!\n$");
    return 0;
}

COM files are limited to 65,279 bytes in size. This is because an x86 memory segment is 64kB and COM files are simply loaded by DOS to 0x0100 in the segment and executed. There are no headers, it’s just a raw binary. Since a COM program can never be of any significant size, and no real linking needs to occur (freestanding), the entire thing will be compiled as one translation unit. It will be one call to GCC with a bunch of options.

Compiler Options

Here are the essential compiler options.

-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding

Since no standard libraries are in use, the only difference between gnu99 and c99 is that trigraphs are disabled (as they should be) and inline assembly can be written as asm instead of __asm__. It’s a no brainer. This project will be so closely tied to GCC that I don’t care about using GCC extensions anyway.

I’m using -Os to keep the compiled output as small as possible. It will also make the program run faster. This is important when targeting DOSBox because, by default, it will deliberately run as slow as a machine from the 1980’s. I want to be able to fit in that constraint. If the optimizer is causing problems, you may need to temporarily make this -O0 to determine if the problem is your fault or the optimizer’s fault.

You see, the optimizer doesn’t understand that the program will be running in real mode, and under its addressing constraints. It will perform all sorts of invalid optimizations that break your perfectly valid programs. It’s not a GCC bug since we’re doing crazy stuff here. I had to rework my code a number of times to stop the optimizer from breaking my program. For example, I had to avoid returning complex structs from functions because they’d sometimes be filled with garbage. The real danger here is that a future version of GCC will be more clever and will break more stuff. In this battle, volatile is your friend.

Th next option is -nostdlib, since there are no valid libraries for us to link against, even statically.

The options -m32 -march=i386 set the compiler to produce 80386 code. If I was writing a bootloader for a modern computer, targeting 80686 would be fine, too, but DOSBox is 80386.

The -ffreestanding argument requires that GCC not emit code that calls built-in standard library helper functions. Sometimes instead of emitting code to do something, it emits code that calls a built-in function to do it, especially with math operators. This was one of the main problems I had with bcc, where this behavior couldn’t be disabled. This is most commonly used in writing bootloaders and kernels. And now DOS COM files.

Linker Options

The -Wl option is used to pass arguments to the linker (ld). We need it since we’re doing all this in one call to GCC.

-Wl,--nmagic,--script=com.ld

The --nmagic turns off page alignment of sections. One, we don’t need this. Two, that would waste precious space. In my tests it doesn’t appear to be necessary, but I’m including it just in case.

The --script option tells the linker that we want to use a custom linker script. This allows us to precisely lay out the sections (text, data, bss, rodata) of our program. Here’s the com.ld script.

OUTPUT_FORMAT(binary)
SECTIONS
{
    . = 0x0100;
    .text :
    {
        *(.text);
    }
    .data :
    {
        *(.data);
        *(.bss);
        *(.rodata);
    }
    _heap = ALIGN(4);
}

The OUTPUT_FORMAT(binary) says not to put this into an ELF (or PE, etc.) file. The linker should just dump the raw code. A COM file is just raw code, so this means the linker will produce a COM file!

I had said that COM files are loaded to 0x0100. The fourth line offsets the binary to this location. The first byte of the COM file will still be the first byte of code, but it will be designed to run from that offset in memory.

What follows is all the sections, text (program), data (static data), bss (zero-initialized data), rodata (strings). Finally I mark the end of the binary with the symbol _heap. This will come in handy later for writing sbrk(), after we’re done with “Hello, World.” I’ve asked for the _heap position to be 4-byte aligned.

We’re almost there.

Program Startup

The linker is usually aware of our entry point (main) and sets that up for us. But since we asked for “binary” output, we’re on our own. If the print() function is emitted first, our program’s execution will begin with executing that function, which is invalid. Our program needs a little header stanza to get things started.

The linker script has a STARTUP option for handling this, but to keep it simple we’ll put that right in the program. This is usually called crt0.o or Boot.o, in case those names every come up in your own reading. This inline assembly must be the very first thing in our code, before any includes and such. DOS will do most of the setup for us, we really just have to jump to the entry point.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C, %ah\n"
     "int   $0x21\n");

The .code16gcc tells the assembler that we’re going to be running in real mode, so that it makes the proper adjustment. Despite the name, this will not make it produce 16-bit code! First it calls dosmain, the function we wrote above. Then it informs DOS, using function 0x4C (terminate with return code), that we’re done, passing the exit code along in the 1-byte register al (already set by dosmain). This inline assembly is automatically volatile because it has no inputs or outputs.

Everything at Once

Here’s the entire C program.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C,%ah\n"
     "int   $0x21\n");

static void print(char *string)
{
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");
}

int dosmain(void)
{
    print("Hello, World!\n$");
    return 0;
}

I won’t repeat com.ld. Here’s the call to GCC.

gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
    -o hello.com -Wl,--nmagic,--script=com.ld hello.c

And testing it in DOSBox:

From here if you want fancy graphics, it’s just a matter of making an interrupt and writing to VGA memory. If you want sound you can perform an interrupt for the PC speaker. I haven’t sorted out how to call Sound Blaster yet. It was from this point that I grew DOS Defender.

Memory Allocation

To cover one more thing, remember that _heap symbol? We can use it to implement sbrk() for dynamic memory allocation within the main program segment. This is real mode, and there’s no virtual memory, so we’re free to write to any memory we can address at any time. Some of this is reserved (i.e. low and high memory) for hardware. So using sbrk() specifically isn’t really necessary, but it’s interesting to implement ourselves.

As is normal on x86, your text and segments are at a low address (0x0100 in this case) and the stack is at a high address (around 0xffff in this case). On Unix-like systems, the memory returned by malloc() comes from two places: sbrk() and mmap(). What sbrk() does is allocates memory just above the text/data segments, growing “up” towards the stack. Each call to sbrk() will grow this space (or leave it exactly the same). That memory would then managed by malloc() and friends.

Here’s how we can get sbrk() in a COM program. Notice I have to define my own size_t, since we don’t have a standard library.

typedef unsigned short  size_t;

extern char _heap;
static char *hbreak = &_heap;

static void *sbrk(size_t size)
{
    char *ptr = hbreak;
    hbreak += size;
    return ptr;
}

It just sets a pointer to _heap and grows it as needed. A slightly smarter sbrk() would be careful about alignment as well.

In the making of DOS Defender an interesting thing happened. I was (incorrectly) counting on the memory return by my sbrk() being zeroed. This was the case the first time the game ran. However, DOS doesn’t zero this memory between programs. When I would run my game again, it would pick right up where it left off, because the same data structures with the same contents were loaded back into place. A pretty cool accident! It’s part of what makes this a fun embedded platform.

C Object Oriented Programming

2014-10-21T03:52:43Z

~~Object oriented programming, polymorphism in particular, is essential to nearly any large, complex software system. Without it, decoupling different system components is difficult.~~ (Update in 2017: I no longer agree with this statement.) C doesn’t come with object oriented capabilities, so large C programs tend to grow their own out of C’s primitives. This includes huge C projects like the Linux kernel, BSD kernels, and SQLite.

Starting Simple

Suppose you’re writing a function pass_match() that takes an input stream, an output stream, and a pattern. It works sort of like grep. It passes to the output each line of input that matches the pattern. The pattern string contains a shell glob pattern to be handled by POSIX fnmatch(). Here’s what the interface looks like.

void pass_match(FILE *in, FILE *out, const char *pattern);

Glob patterns are simple enough that pre-compilation, as would be done for a regular expression, is unnecessary. The bare string is enough.

Some time later the customer wants the program to support regular expressions in addition to shell-style glob patterns. For efficiency’s sake, regular expressions need to be pre-compiled and so will not be passed to the function as a string. It will instead be a POSIX regex_t object. A quick-and-dirty approach might be to accept both and match whichever one isn’t NULL.

void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);

Bleh. This is ugly and won’t scale well. What happens when more kinds of filters are needed? It would be much better to accept a single object that covers both cases, and possibly even another kind of filter in the future.

A Generalized Filter

One of the most common ways to customize the the behavior of a function in C is to pass a function pointer. For example, the final argument to qsort() is a comparator that determines how objects get sorted.

For pass_match(), this function would accept a string and return a boolean value deciding if the string should be passed to the output stream. It gets called once on each line of input.

void pass_match(FILE *in, FILE *out, bool (*match)(const char *));

However, this has one of the same problems as qsort(): the passed function lacks context. It needs a pattern string or regex_t object to operate on. In other languages these would be attached to the function as a closure, but C doesn’t have closures. It would need to be smuggled in via a global variable, which is not good.

static regex_t regex;  // BAD!!!

bool regex_match(const char *string)
{
    return regexec(&regex, string, 0, NULL, 0) == 0;
}

Because of the global variable, in practice pass_match() would be neither reentrant nor thread-safe. We could take a lesson from GNU’s qsort_r() and accept a context to be passed to the filter function. This simulates a closure.

void pass_match(FILE *in, FILE *out,
                bool (*match)(const char *, void *), void *context);

The provided context pointer would be passed to the filter function as the second argument, and no global variables are needed. This would probably be good enough for most purposes and it’s about as simple as possible. The interface to pass_match() would cover any kind of filter.

But wouldn’t it be nice to package the function and context together as one object?

More Abstraction

How about putting the context on a struct and making an interface out of that? Here’s a tagged union that behaves as one or the other.

enum filter_type { GLOB, REGEX };

struct filter {
    enum filter_type type;
    union {
        const char *pattern;
        regex_t regex;
    } context;
};

There’s one function for interacting with this struct: filter_match(). It checks the type member and calls the correct function with the correct context.

bool filter_match(struct filter *filter, const char *string)
{
    switch (filter->type) {
    case GLOB:
        return fnmatch(filter->context.pattern, string, 0) == 0;
    case REGEX:
        return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
    }
    abort(); // programmer error
}

And the pass_match() API now looks like this. This will be the final change to pass_match(), both in implementation and interface.

void pass_match(FILE *input, FILE *output, struct filter *filter);

It still doesn’t care how the filter works, so it’s good enough to cover all future cases. It just calls filter_match() on the pointer it was given. However, the switch and tagged union aren’t friendly to extension. Really, it’s outright hostile. We finally have some degree of polymorphism, but it’s crude. It’s like building duct tape into a design. Adding new behavior means adding another switch case. This is a step backwards. We can do better.

Methods

With the switch we’re no longer taking advantage of function pointers. So what about putting a function pointer on the struct?

struct filter {
    bool (*match)(struct filter *, const char *);
};

The filter itself is passed as the first argument, providing context. In object oriented languages, that’s the implicit this argument. To avoid requiring the caller to worry about this detail, we’ll hide it in a new switch-free version of filter_match().

bool filter_match(struct filter *filter, const char *string)
{
    return filter->match(filter, string);
}

Notice we’re still lacking the actual context, the pattern string or the regex object. Those will be different structs that embed the filter struct.

struct filter_regex {
    struct filter filter;
    regex_t regex;
};

struct filter_glob {
    struct filter filter;
    const char *pattern;
};

For both the original filter struct is the first member. This is critical. We’re going to be using a trick called type punning. The first member is guaranteed to be positioned at the beginning of the struct, so a pointer to a struct filter_glob is also a pointer to a struct filter. Notice any resemblance to inheritance?

Each type, glob and regex, needs its own match method.

static bool
method_match_regex(struct filter *filter, const char *string)
{
    struct filter_regex *regex = (struct filter_regex *) filter;
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static bool
method_match_glob(struct filter *filter, const char *string)
{
    struct filter_glob *glob = (struct filter_glob *) filter;
    return fnmatch(glob->pattern, string, 0) == 0;
}

I’ve prefixed them with method_ to indicate their intended usage. I declared these static because they’re completely private. Other parts of the program will only be accessing them through a function pointer on the struct. This means we need some constructors in order to set up those function pointers. (For simplicity, I’m not error checking.)

struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.match = method_match_regex;
    return &regex->filter;
}

struct filter *filter_glob_create(const char *pattern)
{
    struct filter_glob *glob = malloc(sizeof(*glob));
    glob->pattern = pattern;
    glob->filter.match = method_match_glob;
    return &glob->filter;
}

Now this is real polymorphism. It’s really simple from the user’s perspective. They call the correct constructor and get a filter object that has the desired behavior. This object can be passed around trivially, and no other part of the program worries about how it’s implemented. Best of all, since each method is a separate function rather than a switch case, new kinds of filter subtypes can be defined independently. Users can create their own filter types that work just as well as the two “built-in” filters.

Cleaning Up

Oops, the regex filter needs to be cleaned up when it’s done, but the user, by design, won’t know how to do it. Let’s add a free() method.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
};

void filter_free(struct filter *filter)
{
    return filter->free(filter);
}

And the methods for each. These would also be assigned in the constructor.

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = (struct filter_regex *) f;
    regfree(&regex->regex);
    free(f);
}

static void
method_free_glob(struct filter *f)
{
    free(f);
}

The glob constructor should perhaps strdup() its pattern as a private copy, in which case it would be freed here.

Object Composition

A good rule of thumb is to prefer composition over inheritance. Having tidy filter objects opens up some interesting possibilities for composition. Here’s an AND filter that composes two arbitrary filter objects. It only matches when both its subfilters match. It supports short circuiting, so put the faster, or most discriminating, filter first in the constructor (user’s responsibility).

struct filter_and {
    struct filter filter;
    struct filter *sub[2];
};

static bool
method_match_and(struct filter *f, const char *s)
{
    struct filter_and *and = (struct filter_and *) f;
    return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);
}

static void
method_free_and(struct filter *f)
{
    struct filter_and *and = (struct filter_and *) f;
    filter_free(and->sub[0]);
    filter_free(and->sub[1]);
    free(f);
}

struct filter *filter_and(struct filter *a, struct filter *b)
{
    struct filter_and *and = malloc(sizeof(*and));
    and->sub[0] = a;
    and->sub[1] = b;
    and->filter.match = method_match_and;
    and->filter.free = method_free_and;
    return &and->filter;
}

It can combine a regex filter and a glob filter, or two regex filters, or two glob filters, or even other AND filters. It doesn’t care what the subfilters are. Also, the free() method here frees its subfilters. This means that the user doesn’t need to keep hold of every filter created, just the “top” one in the composition.

To make composition filters easier to use, here are two “constant” filters. These are statically allocated, shared, and are never actually freed.

static bool
method_match_any(struct filter *f, const char *string)
{
    return true;
}

static bool
method_match_none(struct filter *f, const char *string)
{
    return false;
}

static void
method_free_noop(struct filter *f)
{
}

struct filter FILTER_ANY  = { method_match_any,  method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };

The FILTER_NONE filter will generally be used with a (theoretical) filter_or() and FILTER_ANY will generally be used with the previously defined filter_and().

Here’s a simple program that composes multiple glob filters into a single filter, one for each program argument.

int main(int argc, char **argv)
{
    struct filter *filter = &FILTER_ANY;
    for (char **p = argv + 1; *p; p++)
        filter = filter_and(filter_glob_create(*p), filter);
    pass_match(stdin, stdout, filter);
    filter_free(filter);
    return 0;
}

Notice only one call to filter_free() is needed to clean up the entire filter.

Multiple Inheritance

As I mentioned before, the filter struct must be the first member of filter subtype structs in order for type punning to work. If we want to “inherit” from two different types like this, they would both need to be in this position: a contradiction.

Fortunately type punning can be generalized such that it the first-member constraint isn’t necessary. This is commonly done through a container_of() macro. Here’s a C99-conforming definition.

#include 

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Given a pointer to a member of a struct, the container_of() macro allows us to back out to the containing struct. Suppose the regex struct was defined differently, so that the regex_t member came first.

struct filter_regex {
    regex_t regex;
    struct filter filter;
};

The constructor remains unchanged. The casts in the methods change to the macro.

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);

}

It’s a constant, compile-time computed offset, so there should be no practical performance impact. The filter can now participate freely in other intrusive data structures, like linked lists and such. It’s analogous to multiple inheritance.

Vtables

Say we want to add a third method, clone(), to the filter API, to make an independent copy of a filter, one that will need to be separately freed. It will be like the copy assignment operator in C++. Each kind of filter will need to define an appropriate “method” for it. As long as new methods like this are added at the end, this doesn’t break the API, but it does break the ABI regardless.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};

The filter object is starting to get big. It’s got three pointers — 24 bytes on modern systems — and these pointers are the same between all instances of the same type. That’s a lot of redundancy. Instead, these pointers could be shared between instances in a common table called a virtual method table, commonly known as a vtable.

Here’s a vtable version of the filter API. The overhead is now only one pointer regardless of the number of methods in the interface.

struct filter {
    struct filter_vtable *vtable;
};

struct filter_vtable {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};

Each type creates its own vtable and links to it in the constructor. Here’s the regex filter re-written for the new vtable API and clone method. This is all the tricks in one basket for a big object oriented C finale!

struct filter *filter_regex_create(const char *pattern);

struct filter_regex {
    regex_t regex;
    const char *pattern;
    struct filter filter;
};

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);
}

static struct filter *
method_clone_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return filter_regex_create(regex->pattern);
}

/* vtable */
struct filter_vtable filter_regex_vtable = {
    method_match_regex, method_free_regex, method_clone_regex
};

/* constructor */
struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regex->pattern = pattern;
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.vtable = &filter_regex_vtable;
    return &regex->filter;
}

This is almost exactly what’s going on behind the scenes in C++. When a method/function is declared virtual, and therefore dispatches based on the run-time type of its left-most argument, it’s listed in the vtables for classes that implement it. Otherwise it’s just a normal function. This is why functions need to be declared virtual ahead of time in C++.

In conclusion, it’s relatively easy to get the core benefits of object oriented programming in plain old C. It doesn’t require heavy use of macros, nor do users of these systems need to know that underneath it’s an object system, unless they want to extend it for themselves.

Here’s the whole example program once if you’re interested in poking:

https://gist.github.com/skeeto/5faa131b19673549d8ca

C11 Lock-free Stack

2014-09-02T03:10:01Z

C11, the latest C standard revision, hasn’t received anywhere near the same amount of fanfare as C++11. I’m not sure why this is. Some of the updates to each language are very similar, such as formal support for threading and atomic object access. Three years have passed and some parts of C11 still haven’t been implemented by any compilers or standard libraries yet. Since there’s not yet a lot of discussion online about C11, I’m basing much of this article on my own understanding of the C11 draft. I may be under-using the _Atomic type specifier and not paying enough attention to memory ordering constraints.

Still, this is a good opportunity to break new ground with a demonstration of C11. I’m going to use the new stdatomic.h portion of C11 to build a lock-free data structure. To compile this code you’ll need a C compiler and C library with support for both C11 and the optional stdatomic.h features. As of this writing, as far as I know only GCC 4.9, released April 2014, supports this. It’s in Debian unstable but not in Wheezy.

If you want to take a look before going further, here’s the source. The test code in the repository uses plain old pthreads because C11 threads haven’t been implemented by anyone yet.

https://github.com/skeeto/lstack

I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won’t run just anywhere. I’ll discuss all this below. So sorry for the long article. It’s just a lot more complex a topic than I had anticipated!

Lock-free

A lock-free data structure doesn’t require the use of mutex locks. More generally, it’s a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations — transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it’s usually safer, because there’s no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there’s additional risk of starvation (livelock), where a thread is unable to make progress.

As a demonstration, I’ll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it’s going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there’s only one value to be updated when pushing and popping: the pointer to the head of the list.

Here’s what the API will look like. I’ll define lstack_t shortly. I’m making it an opaque type because its fields should never be accessed directly. The goal is to completely hide the atomic semantics from the users of the stack.

int     lstack_init(lstack_t *lstack, size_t max_size);
void    lstack_free(lstack_t *lstack);
size_t  lstack_size(lstack_t *lstack);
int     lstack_push(lstack_t *lstack, void *value);
void   *lstack_pop (lstack_t *lstack);

Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.

Notice there’s a maximum size declared at initialization time. While lock-free allocation is possible [PDF], C makes no guarantees that malloc() is lock-free, so being truly lock-free means not calling malloc(). An important secondary benefit to pre-allocating the stack’s memory is that this implementation doesn’t require the use of hazard pointers, which would be far more complicated than the stack itself.

The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being “full.”

The int return value of lstack_init() and lstack_push() is for error codes, returning 0 for success. The only way these can fail is by running out of memory. This is an issue regardless of being lock-free: systems can simply run out of memory. In the push case it means the stack is full.

Structures

Here’s the definition for a node in the stack. Neither field needs to be accessed atomically, so they’re not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it’s effectively immutable (outside of reuse). Users never need to touch this structure.

struct lstack_node {
    void *value;
    struct lstack_node *next;
};

Internally a lstack_t is composed of two stacks: the value stack (head) and the free node stack (free). These will be handled identically by the atomic functions, so it’s really a matter of convention which stack is which. All nodes are initially placed on the free stack and the value stack starts empty. Here’s what an internal stack looks like.

struct lstack_head {
    uintptr_t aba;
    struct lstack_node *node;
};

There’s still no atomic declaration here because the struct is going to be handled as an entire unit. The aba field is critically important for correctness and I’ll go over it shortly. It’s declared as a uintptr_t because it needs to be the same size as a pointer. Now, this is not guaranteed by C11 — it’s only guaranteed to be large enough to hold any valid void * pointer, so it could be even larger — but this will be the case on any system that has the required hardware support for this lock-free stack. This struct is therefore the size of two pointers. If that’s not true for any reason, this code will not link. Users will never directly access or handle this struct either.

Finally, here’s the actual stack structure.

typedef struct {
    struct lstack_node *node_buffer;
    _Atomic struct lstack_head head, free;
    _Atomic size_t size;
} lstack_t;

Notice the use of the new _Atomic qualifier. Atomic values may have different size, representation, and alignment requirements in order to satisfy atomic access. These values should never be accessed directly, even just for reading (use atomic_load()).

The size field is for convenience to check the number of elements on the stack. It’s accessed separately from the stack nodes themselves, so it’s not safe to read size and use the information to make assumptions about future accesses (e.g. checking if the stack is empty before popping off an element). Since there’s no way to lock the lock-free stack, there’s otherwise no way to estimate the size of the stack during concurrent access without completely disassembling it via lstack_pop().

There’s no reason to use volatile here. That’s a separate issue from atomic operations. The C11 stdatomic.h macros and functions will ensure atomic values are accessed appropriately.

Stack Functions

As stated before, all nodes are initially placed on the internal free stack. During initialization they’re allocated in one solid chunk, chained together, and pinned on the free pointer. The initial assignments to atomic values are done through ATOMIC_VAR_INIT, which deals with memory access ordering concerns. The aba counters don’t actually need to be initialized. Garbage, indeterminate values are just fine, but not initializing them would probably look like a mistake.

int
lstack_init(lstack_t *lstack, size_t max_size)
{
    struct lstack_head head_init = {0, NULL};
    lstack->head = ATOMIC_VAR_INIT(head_init);
    lstack->size = ATOMIC_VAR_INIT(0);

    /* Pre-allocate all nodes. */
    lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
    if (lstack->node_buffer == NULL)
        return ENOMEM;
    for (size_t i = 0; i < max_size - 1; i++)
        lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
    lstack->node_buffer[max_size - 1].next = NULL;
    struct lstack_head free_init = {0, lstack->node_buffer};
    lstack->free = ATOMIC_VAR_INIT(free_init);
    return 0;
}

The free nodes will not necessarily be used in the same order that they’re placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.

The reverse of lstack_init() is simple, and it’s assumed concurrent access has terminated. The stack is no longer valid, at least not until lstack_init() is used again. This one is declared inline and put in the header.

static inline void
stack_free(lstack_t *lstack)
{
    free(lstack->node_buffer);
}

To read an atomic value we need to use atomic_load(). Give it a pointer to an atomic value, it dereferences the pointer and returns the value. This is used in another inline function for reading the size of the stack.

static inline size_t
lstack_size(lstack_t *lstack)
{
    return atomic_load(&lstack->size);
}

Push and Pop

For operating on the two stacks there will be two internal, static functions, push and pop. These deal directly in nodes, accepting and returning them, so they’re not suitable to expose in the API (users aren’t meant to be aware of nodes). This is the most complex part of lock-free stacks. Here’s pop().

static struct lstack_node *
pop(_Atomic struct lstack_head *head)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        if (orig.node == NULL)
            return NULL;  // empty stack
        next.aba = orig.aba + 1;
        next.node = orig.node->next;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
    return orig.node;
}

It’s centered around the new C11 stdatomic.h function atomic_compare_exchange_weak(). This is an atomic operation more generally called compare-and-swap (CAS). On x86 there’s an instruction specifically for this, cmpxchg. Give it a pointer to the atomic value to be updated (head), a pointer to the value it’s expected to be (orig), and a desired new value (next). If the expected and actual values match, it’s updated to the new value. If not, it reports a failure and updates the expected value to the latest value. In the event of a failure we start all over again, which requires the while loop. This is an optimistic strategy.

The “weak” part means it will sometimes spuriously fail where the “strong” version would otherwise succeed. In exchange for more failures, calling the weak version is faster. Use the weak version when the body of your do ... while loop is fast and the strong version when it’s slow (when trying again is expensive), or if you don’t need a loop at all. You usually want to use weak.

The alternative to CAS is load-link/store-conditional. It’s a stronger primitive that doesn’t suffer from the ABA problem described next, but it’s also not available on x86-64. On other platforms, one or both of atomic_compare_exchange_*() will be implemented using LL/SC, but we still have to code for the worst case (CAS).

The ABA Problem

The aba field is here to solve the ABA problem by counting the number of changes that have been made to the stack. It will be updated atomically alongside the pointer. Reasoning about the ABA problem is where I got stuck last time writing this article.

Suppose aba didn’t exist and it was just a pointer being swapped. Say we have two threads, A and B.

Thread A copies the current head into orig, enters the loop body to update next.node to orig.node->next, then gets preempted before the CAS. The scheduler pauses the thread.
Thread B comes along performs a pop() changing the value pointed to by head. At this point A’s CAS will fail, which is fine. It would reconstruct a new updated value and try again. While A is still asleep, B puts the popped node back on the free node stack.
Some time passes with A still paused. The freed node gets re-used and pushed back on top of the stack, which is likely given that nodes are allocated FIFO. Now head has its original value again, but the head->node->next pointer is pointing somewhere completely new! This is very bad because A’s CAS will now succeed despite next.node having the wrong value.
A wakes up and it’s CAS succeeds. At least one stack value has been lost and at least one node struct was leaked (it will be on neither stack, nor currently being held by a thread). This is the ABA problem.

The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They’ll work with append-only data structures, where pointers are never recycled, but that’s it.

The aba field solves the problem because it’s incremented every time the pointer is updated. Remember that this internal stack struct is two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16 bytes is compared by CAS and they all have to match for it to succeed. Since B, or other threads, will increment aba at least twice (once to remove the node, and once to put it back in place), A will never mistake the recycled pointer for the old one. There’s a special double-width CAS instruction specifically for this purpose, cmpxchg16. This is generally called DWCAS. It’s available on most x86-64 processors. On Linux you can check /proc/cpuinfo for support. It will be listed as cx16.

If it’s not available at compile-time this program won’t link. The function that wraps cmpxchg16 won’t be there. You can tell GCC to assume it’s there with the -mcx16 flag. The same rule here applies to C++11’s new std::atomic.

There’s still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There’s nothing we can do about this, but if you witness this in the wild you need to immediately stop what you’re doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.

Hazard Pointers and Garbage Collection

Another problem in pop() is dereferencing orig.node to access its next field. By the time we get to it, the node pointed to by orig.node may have already been removed from the stack and freed. If the stack was using malloc() and free() for allocations, it may even have had free() called on it. If so, the dereference would be undefined behavior — a segmentation fault, or worse.

There are three ways to deal with this.

Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won’t be a problem. However, if we’re interacting with a garbage collector we’re not really lock-free.
Hazard pointers. Each thread keeps track of what nodes it’s currently accessing and other threads aren’t allowed to free nodes on this list. This is messy and complicated.
Never free nodes. This implementation recycles nodes, but they’re never truly freed until lstack_free(). It’s always safe to dereference a node pointer because there’s always a node behind it. It may point to a node that’s on the free list or one that was even recycled since we got the pointer, but the aba field deals with any of those issues.

Reference counting on the node won’t work here because we can’t get to the counter fast enough (atomically). It too would require dereferencing in order to increment. The reference counter could potentially be packed alongside the pointer and accessed by a DWCAS, but we’re already using those bytes for aba.

Push

Push is a lot like pop.

static void
push(_Atomic struct lstack_head *head, struct lstack_node *node)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        node->next = orig.node;
        next.aba = orig.aba + 1;
        next.node = node;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
}

It’s counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn’t take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.

API Push and Pop

The API push and pop functions are built on these internal atomic functions.

int
lstack_push(lstack_t *lstack, void *value)
{
    struct lstack_node *node = pop(&lstack->free);
    if (node == NULL)
        return ENOMEM;
    node->value = value;
    push(&lstack->head, node);
    atomic_fetch_add(&lstack->size, 1);
    return 0;
}

Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there’s an instant where the stack size is listed as one shorter than it actually is. However, since there’s no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.

Popping is the same thing in reverse.

void *
lstack_pop(lstack_t *lstack)
{
    struct lstack_node *node = pop(&lstack->head);
    if (node == NULL)
        return NULL;
    atomic_fetch_sub(&lstack->size, 1);
    void *value = node->value;
    push(&lstack->free, node);
    return value;
}

Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It’s really simple with the primitive push and pop.

SHA1 Demo

The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It’s not really exercising the library thoroughly because there are no contended pops, but I couldn’t think of a better example at the time.

The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.

Digispark and Debian

2014-05-14T17:57:31Z

Following Brian’s lead, I recently picked up a couple of Digispark USB development boards. It’s a cheap, tiny, Arduino-like microcontroller. There are a couple of interesting project ideas that I have in mind for these. It’s been over 6 years since I last hacked on a microcontroller.

Unfortunately, support for the Digispark on Linux is spotty. Just as with any hardware project, the details are irreversibly messy. It can’t make use of the standard Arduino software for programming the board, so you have to download a customized toolchain. This download includes files that have the incorrect vendor ID, requiring a manual fix. Worse, the fix listed in their documentation is incomplete, at least for Debian and Debian-derived systems.

The main problem is that Linux will not automatically create a /dev/ttyACM0 device like it normally does for Arduino devices. Instead it gets a long, hidden, unpredictable device name. The fix is to ask udev to give it a predictable name by appending the following to the first line in the provided udev rules file (49-micronucleus.rules),

SYMLINK+="ttyACM%n"

The whole uncommented portion of the rules file should look like this:

49-micronucleus.rules (pastebin since it’s a long line)

The == is a conditional operator, indicating that the rule only applies when the condition is met. The := and += are assignment operators, evaluated when all of the conditions are met. The SYMLINK part tells udev put a softlink to the device in /dev under a predictable name.

Publishing My Private Keys

2012-06-24T00:00:00Z

Update March 2017: I no longer use PGP. Also, there’s a bug in GnuPG that silently discards these security settings, and it’s unlikely to ever get fixed. You’ll need to find/build an old version of GnuPG if you want to properly protect your secret keys.

Update August 2019: I’ve got a PGP key again, but I’m using my own tool, passphrase2pgp, to manage it. This tool allows for a particular workflow that GnuPG has never and will never provide. It doesn’t rely on S2K as described below.

One of the items in my dotfiles repository is my PGP keys, both private and public. I believe this is a unique approach that hasn’t been done before — a public experiment. It may seem dangerous, but I’ve given it careful thought and I’m only using the tools already available from GnuPG. It ensures my keys are well backed-up (via the Torvalds method) and available wherever I should need them.

In your GnuPG directory there are two core files: secring.gpg and pubring.gpg. The first contains your secret keys and the second contains public keys. secring.gpg is not itself encrypted. You can (should) have different passphrases for each key, after all. These files (or any PGP file) can be inspected with --list-packets. Notice it won’t prompt for a passphrase in order to get this data,

$ gpg --list-packets ~/.gnupg/secring.gpg
:secret key packet:
    version 4, algo 1, created 1298734547, expires 0
    skey[0]: [2048 bits]
    skey[1]: [17 bits]
    iter+salt S2K, algo: 9, SHA1 protection, hash: 10, salt: ...
    protect count: 10485760 (212)
    protect IV:  a6 61 4a 95 44 1e 7e 90 88 c3 01 70 8d 56 2e 11
    encrypted stuff follows
:user ID packet: "Christopher Wellons <...>"
:signature packet: algo 1, keyid 613382C548B2B841
... and so on ...

Each key is encrypted individually within this file with a passphrase. If you try to use the key, GPG will attempt to decrypt it by asking for the passphrase. If someone were to somehow gain access to your secring.gpg, they’d still need to get your passphrase, so pick a strong one. The official documentation advises you to keep your secring.gpg well-guarded and only rely on the passphrase as a cautionary measure. I’m ignoring that part.

If you’re using GPG’s defaults, your secret key is encrypted with CAST5, a symmetric block cipher. The encryption key is your passphrase salted (mixed with a non-secret random number) and hashed with SHA-1 65,536 times. Using the hash function over and over is called key stretching. It greatly increases the amount of required work for a brute-force attack, making your passphrase more effective. All of these settings can be adjusted to better protect the secret key at the cost of less portability. Since I’ve chosen to publish my secring.gpg in my dotfiles repository I cranked up the settings as far as I can.

I changed the cipher to AES256, which is more modern, more trusted, and more widely used than CAST5. For the passphrase digest, I selected SHA-512. There are better passphrase digest algorithms out there but this is the longest, slowest one that GPG offers. The PGP spec supports between 1024 and 65,011,712 digest iterations, so I picked one of the largest. 65 million iterations takes my laptop over a second to process — absolutely brutal for someone attempting a brute-force attack. Here’s the command to change to this configuration on an existing key,

gpg --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 \
    --s2k-count 65000000 --edit-key 

When the edit key prompt comes up, enter passwd to change your passphrase. You can enter the same passphrase again and it will re-use it with the new configuration.

I’m feeling quite secure with my secret key, despite publishing my secring.gpg. Before now, I was much more at risk of losing it to disk failure than having it exposed. I challenge anyone who doubts my security to crack my secret key. I’d rather learn that I’m wrong sooner than later!

With this established in my dotfiles repository, I can more easily include private dotfiles. Rather than use a symmetric cipher with an individual passphrase on each file, I encrypt the private dotfiles to myself. All my private dotfiles are managed with one key: my PGP key. This also plays better with Emacs. While it supports transparent encryption, it doesn’t even attempt to manage your passphrase (with good reason). If the file is encrypted with a symmetric cipher, Emacs will prompt for a passphrase on each save. If I encrypt them with my public key, I only need the passphrase when I first open the file.

How it works right now is any dotfile that ends with .priv.pgp will be decrypted into place — not symlinked, unfortunately, since this is impossible. The install script has a -p switch to disable private dotfiles, such as when I’m using an untrusted computer. gpg-agent ensures that I only need to enter my passphrase once during the install process no matter how many private dotfiles there are.

Making Your Own GIF Image Macros

2012-04-10T00:00:00Z

This tutorial is very similar to my video editing tutorial. That’s because the process is the same up until the encoding stage, where I encode to GIF rather than WebM.

So you want to make your own animated GIFs from a video clip? Well, it’s a pretty easy process that can be done almost entirely from the command line. I’m going to show you how to turn the clip into a GIF and add an image macro overlay. Like this,

The key tool here is going to be Gifsicle, a very excellent command-line tool for creating and manipulating GIF images. So, the full list of tools is,

Here’s the source video for the tutorial. It’s an awkward video my wife took of our confused cats, Calvin and Rocc.

My goal is to cut after Calvin looks at the camera, before he looks away. From roughly 3 seconds to 23 seconds. I’ll have mplayer give me the frames as JPEG images.

mplayer -vo jpeg -ss 3 -endpos 23 -benchmark calvin-dummy.webm

This tells mplayer to output JPEG frames between 3 and 23 seconds, doing it as fast as it can (-benchmark). This output almost 800 images. Next I look through the frames and delete the extra images at the beginning and end that I don’t want to keep. I’m also going to throw away the even numbered frames, since GIFs can’t have such a high framerate in practice.

rm *[0,2,4,6,8].jpg

There’s also dead space around the cats in the image that I want to crop. Looking at one of the frames in GIMP, I’ve determined this is a 450 by 340 box, with the top-left corner at (136, 70). We’ll need this information for ImageMagick.

Gifsicle only knows how to work with GIFs, so we need to batch convert these frames with ImageMagick’s convert. This is where we need the crop dimensions from above, which is given in ImageMagick’s notation.

ls *.jpg | xargs -I{} -P4 \
    convert {} -crop 450x340+136+70 +repage -resize 300 {}.gif

This will do four images at a time in parallel. The +repage is necessary because ImageMagick keeps track of the original image “canvas”, and it will simply drop the section of the image we don’t want rather than completely crop it away. The repage forces it to resize the canvas as well. I’m also scaling it down slightly to save on the final file size.

We have our GIF frames, so we’re almost there! Next, we ask Gifsicle to compile an animated GIF.

gifsicle --loop --delay 5 --dither --colors 32 -O2 *.gif > ../out.gif

I’ve found that using 32 colors and dithering the image gives very nice results at a reasonable file size. Dithering adds noise to the image to remove the banding that occurs with small color palettes. I’ve also instructed it to optimize the GIF as fully as it can (-O2). If you’re just experimenting and want Gifsicle to go faster, turning off dithering goes a long way, followed by disabling optimization.

The delay of 5 gives us the 15-ish frames-per-second we want — since we cut half the frames from a 30 frames-per-second source video. We also want to loop indefinitely.

The result is this 6.7 MB GIF. A little large, but good enough. It’s basically what I was going for. Next we add some macro text.

In GIMP, make a new image with the same dimensions of the GIF frames, with a transparent background.

Add your macro text in white, in the Impact Condensed font.

Right click the text layer and select “Alpha to Selection,” then under Select, grow the selection by a few pixels — 3 in this case.

Select the background layer and fill the selection with black, giving a black border to the text.

Save this image as text.png, for our text overlay.

Time to go back and redo the frames, overlaying the text this time. This is called compositing and ImageMagick can do it without breaking a sweat. To composite two images is simple.

convert base.png top.png -composite out.png

List the image to go on top, then use the -composite flag, and it’s placed over top of the base image. In my case, I actually don’t want the text to appear until Calvin, the orange cat, faces the camera. This happens quite conveniently at just about frame 500, so I’m only going to redo those frames.

ls 000005*.jpg | xargs -I{} -P4 \
    convert {} -crop 450x340+136+70 +repage \
               -resize 300 text.png -composite {}.gif

Run Gifsicle again and this 6.2 MB image is the result. The text overlay compresses better, so it’s a tiny bit smaller.

Now it’s time to post it on reddit and reap that tasty, tasty karma. (Over 400,000 views!)

Poor Man's Video Editing

2011-11-28T00:00:00Z

I’ve done all my video editing in a very old-school, unix-style way. I actually have no experience with real video editing software, which may explain why I tolerate the manual process. Instead, I use several open source tools, none of which are designed specifically for video editing.

MPlayer
ImageMagick (or any batch image editing tool)
ppmtoy4m
The WebM encoder (or your preferred encoder)

The first three are usually available from your Linux distribution repositories, making them trivial to obtain. The last one is easy to obtain and compile.

~~If you’re using a modern browser, you should have noticed my portrait on the left-hand side changed recently~~ (update: it’s been removed). That’s an HTML5 WebM video — currently with Ogg Theora fallback due to a GitHub issue. To cut the video down to that portrait size, I used the above four tools on the original video.

WebM seems to be becoming the standard HTML5 video format. Google is pushing it and it’s supported by all the major browsers, except Safari. So, unless something big happens, I plan on going with WebM for web video in the future.

To begin, as I’ve done before, split the video into its individual frames,

mplayer -vo jpeg -ao dummy -benchmark video_file

The -benchmark option hints for mplayer to go as fast as possible, rather than normal playback speed.

Next look through the output frames and delete any unwanted frames to keep, such as the first and last few seconds of video. With the desired frames remaining, use ImageMagick, or any batch image editing software, to crop out the relevant section of the images. This can be done in parallel with xargs’ -P option — to take advantage of multiple cores if disk I/O isn’t being the bottleneck.

ls *.jpg | xargs -I{} -P5 convert {} 312x459+177+22 {}.ppm

That crops out a 312 by 459 section of the image, with the top-left corner at (177, 22). Any other convert filters can be stuck in there too. Notice the output format is the portable pixmap (ppm), which is significant because it won’t introduce any additional loss and, most importantly, it is required by the next tool.

If I’m happy with the result, I use ppmtoy4m to pipe the new frames to the encoder,

cat *.ppm | ppmtoy4m | vpxenc --best -o output.webm -

As the name implies, ppmtoy4m converts a series of portable pixmap files into a YUV4MPEG2 (y4m) video stream. YUV4MPEG2 is the bitmap of the video world: gigantic, lossless, uncompressed video. It’s exactly the kind of thing you want to hand to a video encoder. If you need to specify any video-specific parameters, ppmtoy4m is the tool that needs to know it. For example, to set the framerate to 10 FPS,

... | ppmtoy4m -F 10:1 | ...

ppmtoy4m is a classically-trained unix tool: stdin to stdout. No need to dump that raw video to disk, just pipe it right into the WebM encoder. If you choose a different encoder, it might not support reading from stdin, especially if you do multiple passes. A possible workaround would be a named pipe,

mkfifo video.y4m
cat *.ppm | ppmtoy4m > video.y4m &
otherencoder video.4pm

For WebM encoding, I like to use the --best option, telling the encoder to take its time to do a good job. To do two passes and get even more quality per byte (--passes=2) a pipe cannot be used and you’ll need to write the entire raw video onto the disk. If you try to pipe it anyway, vpxenc will simply crash rather than give an error message (as of this writing). This had me confused for awhile.

To produce Ogg Theora instead of WebM, ffmpeg2theora is a great tool. It’s well-behaved on the command line and can be dropped in place of vpxenc.

To do audio, encode your audio stream with your favorite audio encoder (Vorbis, Lame, etc.) then merge them together into your preferred container. For example, to add audio to a WebM video (i.e. Matroska), use mkvmerge from MKVToolNix,

mkvmerge --webm -o combined.webm video.webm audio.ogg

Extra notes update: There’s a bug in imlib2 where it can’t read PPM files that have no initial comment, so some tools, including GIMP and QIV, can’t read PPM files produced by ImageMagick. Fortunately ppmtoy4m is unaffected. However, there is a bug in ppmtoy4m where it can’t read PPM files with a depth other than 8 bits. Fix this by giving the option -depth 8 to ImageMagick’s convert.

Try Out My Java With Emacs Workflow Within Minutes

2011-11-19T00:00:00Z

Update January 2013: I’ve learned more about Java dependency management and no longer use my old .ant repository. As a result, I have deleted it, so ignore any references to it below. The only thing I keep in $HOME/.ant/lib these days is an up-to-date ivy.jar.

Last month I started managing my entire Emacs configuration in Git, which has already paid for itself by saving me time. I found out a few other people have been using it (including Brian), so I also wrote up a README file describing my specific changes.

With Emacs being a breeze to synchronize between my computers, I noticed a new bottleneck emerged: my .ant directory. Apache Ant puts everything in $ANT_HOME/lib and $HOME/.ant/lib into its classpath. So, for example, if you wanted to use JUnit with Ant, you’d toss junit.jar in either of those directories. $ANT_HOME tends to be a system directory, and I prefer to only modify system directories indirectly through apt, so I put everything in $HOME/.ant/lib. Unfortunately, that’s another directory to keep track of on my own. Fortunately, I already know how to deal with that. It’s now another Git repository,

https://github.com/skeeto/.ant (README)

With that in place, settling into a new computer for development is almost as simple as cloning those two repositories. Yesterday I took the step to eliminate the only significant step that remained: setting up java-docs. Before you could really take advantage of my Java extension, you really needed to have a Javadoc directory scanned by Emacs. The results of that scan not only provided an easy way to jump into documentation, but also provided the lists for class name completion. Now, java-docs now automatically loads up the core Java Javadoc, linking to the official website, if the user never sets it up.

So if you want to see exactly how my Emacs workflow with Java operates, it’s just a few small steps away. This should work for any operating system suitable for Java development.

Let’s start by getting Java set up. First, install a JDK and Apache Ant. This is trivial to do on Debian-based systems,

sudo apt-get install openjdk-6-jdk ant

On Windows, the JDK is easy, but Ant needs some help. You probably need to set ANT_HOME to point to the install location, and you definitely need to add it to your PATH.

Next install Git. This should be straightforward; just make sure its in your PATH (so Emacs can find it).

Clone my .ant repository in your home directory.

cd
git clone https://github.com/skeeto/.ant.git

Except for Emacs, that’s really all I need to develop with Java. This setup should allow you to compile and hack on just about any of my Java projects. To test it out, anywhere you like clone one of my projects, such as my example project.

git clone https://github.com/skeeto/sample-java-project.git

You should be able to build and run it now,

cd sample-java-project
ant run

If that works, you’re ready to set up Emacs. First, install Emacs. If you’re not familiar with Emacs, now would be the time to go through the tutorial to pick up the basics. Fire it up and type CTRL + h and then t (in Emacs’ terms: C-h t), or select the tutorial from the menu.

Move any existing configuration out of the way,

mv .emacs .old.emacs
mv .emacs.d .old.emacs.d

Clone my configuration,

git clone https://github.com/skeeto/.emacs.d.git

Then run Emacs. You should be greeted with a plain, gray window: the wombat theme. No menu bar, no toolbar, just a minibuffer, mode line, and wide open window. Anything else is a waste of screen real estate. This initial empty buffer has a great aesthetic, don’t you think?

Now to go for a test drive: open up that Java project you cloned, with M-x open-java-project. That will prompt you for the root directory of the project. The only thing this does is pre-opens all of the source files for you, exposing their contents to dabbrev-expand and makes jumping to other source files as easy as changing buffers — so it’s not strictly necessary.

Switch to a buffer with a source file, such as SampleJavaProject.java if you used my example project. Change whatever you like, such as the printed string. You can add import statements at any time with C-x I (note: capital I), where java-docs will present you with a huge list of classes from which to pick. The import will be added at the top of the buffer in the correct position in the import listing.

Without needing to save, hit C-x r to run the program from Emacs. A *compilation-1* buffer will pop up with all of the output from Ant and the program. If you just want to compile without running it, type C-x c instead. If there were any errors, Ant will report them in the compilation buffer. You can jump directly to these with C-x ` (that’s a backtick).

Now open a new source file in the same package (same directory) as the source file you just edited. Type cls and hit tab. The boilerplate, including package statement, will be filled out for you by YASnippet. There are a bunch of completion snippets available. Try jal for example, which completes with information from java-docs.

When I’m developing a library, I don’t have a main function, so there’s nothing to “run”. Instead, I drive things from unit tests, which can be run with C-x t, which runs the “test” target if there is one.

To see your changes, type C-x g to bring up Magit and type M-s in the Magit buffer (to show a full diff). From here you can make commits, push, pull, merge, switch branches, reset, and so on. To learn how to do all this, see the Magit manual. You can type q to exit the Magit window, or use S- to move to an adjacent buffer in any direction.

And that’s basically my workflow. Developing in C is a very similar process, but without the java-docs part.

Sample Java Project

2010-10-04T00:00:00Z

Here's a little on-going project I put together recently. It's mostly for my own future reference, but perhaps someone else may find it useful.

git clone git://github.com/skeeto/sample-java-project.git

If you couldn't guess already, I'm strongly against tying a project's development to a particular IDE. It happens too much: someone starts the project by firing up their favorite IDE, clicking "Create new project", and checks in whatever it spits out. It usually creates a build system integrated tightly into that particular IDE. At work I've seen it happen on two different large Java projects. There are some ways around it, like maintaining two build systems side-by-side, but it's not very pretty. Sometimes the Java IDE can spit out some Ant build files for the sake of continuous integration, but it remains a second-class citizen for development.

I prefer the other direction: start with a standalone build system, then stick your own development environment on top of that. Each developer picks and is responsible for whatever IDE or editor they want, with the standalone build system providing the canonical build (and, in my experience, if you must use an IDE, NetBeans has the smoothest integration with Ant). So in the case of Java, this means setting up an Ant-based build.

I've said before that I like the Java platform, I just find the primary language disappointing. Similarly, I like Ant, I just find the build script language disappointing (XML). It seems other people like it too, at least for Java development, because I haven't been able to find any serious criticisms of it outside of hating the XML (notice the first result in that search is written by someone who is Doing It All Wrong). I love that it works on filesets and not files. It's like getting atomic commits for my build system. If I add a new source file to my project I don't need to adjust the Ant build script in any way.

One downside of Ant is that, while it's commonly used in a very standard way, it doesn't guide you in that direction or provide special shortcuts to make the common cases easier. It's typical to have a src/ directory containing all your source and a build/ directory, created by Ant, that contains all the built and generated files. With Ant you basically say, "Compile these sources to here, then jar that directory up." Ant alone doesn't make this very obvious. Give it to someone standed on a desert island and I bet they won't derive the same best practice as the rest of the world.

Take make, for example. Because building object files from source is so common, (depending on the implementation) it has built-in rules for it. This is all you need to say, and make knows how to do the rest.

file.o : file.c

Same for linking, it's so common you don't have to type anything more than necessary.

program : main.o common.o file.o

It guides you in creating good Makefiles. If you want to learn the best practice for Ant, you have to either buy a book on Ant or look at what lots of other people are doing. And so I provide my sample-java-project for this exact purpose.

You can use that as a skeleton when creating your own project, and you'll barely have to customize the build file. It's a big mass of boilerplate, the kind of stuff that Ant should have built-in by default. I'll be expanding it over time as I learn more about how to effectively use Ant.

So far, I included two things that you normally won't see: a target to run a Java indenter (AStyle) on your code, and a target to run the bureaucratic Checkstyle on your code.

Identifying Files

2010-05-20T00:00:00Z

At work I currently spend about a third of my time doing data reduction, and it's become one of my favorite tasks. (I've done it on my own too). Data come in from various organizations and sponsors in all sorts of strange formats. We have a bunch of fancy analysis tools to work on the data, but they aren't any good if they can't read the format. So I'm tasked with writing tools to convert incoming data into a more useful format.

If the source file is a text-based file it's usually just a matter of writing a parser — possibly including a grammar — after carefully studying the textual structure. Binary files are trickier. Fortunately, there are a few tools that come in handy for identifying the format of a strange binary file.

The first is the standard utility found on any unix-like system: file . I have no idea if it has an official website because it's a term that's impossible to search for. It tries to identify a file based on the magic numbers and other tests, none based on the actual file name. I've never been to lucky to have file recognize a strange format at work. But silence speaks volumes: it means the data are not packed into something common, like a simple zip archive.

Next, I take a look at the file with ent, a pseudo-random number sequence test program. This will reveal how compressed (or even encrypted) data are. If ent says the data are very dense, say 7 bits per byte or more, the format is employing a good compression algorithm. The next step would be tackling that so I can start over on the uncompressed contents. If it's something like 4 bits per byte there's no compression. If it's in between then it might be employing a weak, custom compression algorithm. I've always seen the latter two.

Next I dive in with a hex editor. I use a combination of Emacs' hexl-mode and the standard BSD tool hexdump (for something more static). One of the first things I like to identify is byte order, and in a hex dump it's often obvious.

In general, better designed formats use big endian, also known as network order. That's the standard ordering used in communication, regardless of the native byte ordering of the network clients. The amateur, home-brew formats are generally less thoughtful and dump out whatever the native format is, usually little endian because that's what x86 is. Worse, they'll also generate data on architectures that are big endian, so you can get it both ways without any warning. In that case your conversion tool has to be sensitive to byte order and find some way to identify which ordering a file is using. A time-stamp field is very useful here, because a 64-bit time-stamp read with the wrong byte order will give a very unreasonable date.

For example, here's something I see often.

eb 03 00 00 35 00 00 00 66 1e 00 00

That's most likely 3 4-byte values, in little endian byte order. The zeros make the integers stand out.

eb 03 00 00 35 00 00 00 66 1e 00 00

We can tell it's little endian because the non-zero digits are on the left. This information will be useful in identifying more bytes in the file.

Next I'd look for headers, common strings of bytes, so that I can identify larger structures in the data. I've never had to reverse engineer a format ... yet. I'm not sure if I could. Once I got this far I've always been able to research the format further and find either source code or documentation, revealing everything to me.

If the file contains strings I'll dump them out with strings. I haven't found this too useful at work, but it's been useful at home.

And there's something still useful beyond these. Something I made myself at home for a completely different purpose, but I've exploited its side effects: my PNG Archiver. The original purpose of the tool is to store a file in an image, as images are easier to share with others. The side effect is that by viewing the image I get to see the structure of the file. For example, here's my laptop's /bin/ls, very roughly labeled.

It's easy to spot the different segments of the ELF format. Higher entropy sections are more brightly colored. Strings, being composed of ASCII-like text, have their MSB's unset, which is why they're darker. Any non-compressed format will have an interesting profile like this. Here's a Word doc, an infamously horrible format,

And here's some Emacs bytecode. You can tell the code vectors apart from the constants section below it.

If you find yourself having to inspect strange files, keep these tools around to make the job easier.

The Emacs Calculator

2009-06-23T00:00:00Z

Did you know that Emacs comes with a calculator? Woop-dee-doo! Call the presses! Wow, a whole calculator! Sounds a bit lame, right?

Actually, it's much more than just a simple calculator. It's a computer algebra system! It is officially called a calculator, which isn't fair. It's an understatement, and I am sure has caused many people to overlook it. I finally ran into it during a thorough (re)reading of the Emacs manuals and almost skipped over it myself.

Ever see that demonstration by Will Wright for the game Spore several years ago? The player starts as a single-cell organism and evolves into a civilization with interstellar presence. When he started the demo he showed a cell through what looked like a microscope. No one had any idea yet what the game was about, so every time he increased the scope, from bacteria to animal, animal to civilization, civilization to space travel, interplanetary travel to interstellar travel, there was a huge reaction from the audience. It was like those infomercials: "But that's not all!!!"

As I made my way through the Emacs calc manual I was continually amazed by its power, with a similar constant increase in scope. Each new page was almost saying, "But that's not all!!!"

Like an infomercial I'm going to run through some of its features. See the calc manual for a real thorough introduction. It has practice exercises that shows some gotchas and interesting feature interactions.

Fire it up with C-x * c or M-x calc. There will be two new windows (Emacs windows, that is), one with the calculator and the other with usage history (the "trail").

First of all, the calculator operates on a stack and so its basic use is done with RPN. The stack builds vertically, downwards. Type in numbers and hit enter to push them onto the stack. Operators can be typed right after the number, so no need to hit enter all the time. Because negative (-) is reserved for subtraction an underscore _ is used to type a negative number. An example stack with 3, 4, and 10,

10 is at the "top" of the stack (indicated by the "1:"), so if we type a * the top two elements are multiplied. Like so,

2:  3
1:  40
    .

The calculator has no limitations on the size of integers, so you work with large numbers without losing precision. For example, we'll take 2^200.

2:  2
1:  200
    .

Apply the ^ operator,

1:  1606938044258990275541962092341162602522202993782792835301376
    .

But that's not all!!! It has a complex number type, which is entered in pairs (real, imaginary) with parenthesis. They can be operated on like any other number. Take -1 + 2i minus 4 + 2i,

2:  (-1, 2)
1:  (4, 2)
    .

Subtract with -,

1:  -5
    .

Then take the square root of that using Q, the square root function.

1:  (0., 2.2360679775)
    .

We can set the calculator's precision with p. The default is 12 places, showing here 1 / 7.

1:  0.142857142857
    .

If we adjust the precision to 50 and do it again,

2:  0.142857142857
1:  0.14285714285714285714285714285714285714285714285714
    .

Numbers can be displayed in various notations, too, like fixed-point, scientific notation, and engineering notation. It will switch between these without losing any information (the stored form is separate from the displayed form).

But that's not all!!! We can represent rational numbers precisely with ratios. These are entered with a :. Push on 1/7, 3/14, and 17/29,

3:  1:7
2:  3:13
1:  17:29
    .

And multiply them all together, which displays in the lowest form,

1:  51:2842
    .

There is a mode for working in these automatically.

But that's not all!!! We can change the radix. To enter a number with a different radix, which prefix it with the radix and a #. Here is how we enter 29 in base-2,

2#11101

We can change the display radix with d r. With 29 on the stack, here's base-4,

1:  4#131
    .

Base-16,

1:  16#1D
    .

Base-36,

1:  36#T
    .

But that's not all!!! We can enter algebraic expressions onto the stack with apostrophe, '. Symbols can be entered as part of the expression. Note: these expressions are not entered in RPN.

1:  a^3 + a^2 b / c d - a / b
    .

There is a "big" mode (d B) for easier reading,

          2
     3   a  b   a
1:  a  + ---- - -
         c d    b

    .

We can assign values to variables to have the expression evaluated. If we assign a to 10 and use the "evaluates-to" operator,

          2
     3   a  b   a             100 b   10
1:  a  + ---- - -  =>  1000 + ----- - --
         c d    b              c d    b

    .

But that's not all!!! There is a vector type for working with vectors and matrices and doing linear algebra. They are entered with brackets, [].

2:  [4, 1, 5]
1:  [ [ 1, 2, 3 ]
      [ 4, 5, 6 ]
      [ 6, 7, 8 ] ]
    .

And take the dot product, then take cross product of this vector and matrix,

2:  [38, 48, 58]
1:  [ [ -14, -18, -22 ]
      [ -19, -18, -17 ]
      [ 15,  18,  21  ] ]
    .

Any matrix and vector operator you could probably think of is available, including map and reduce (and you can define your own expression to apply).

We can use this to solve a linear system. Find x and y in terms of a and b,

x + a y = 6
x + b y = 10

Enter it (note we are using symbols),

2:  [6, 10]
1:  [ [ 1, a ]
      [ 1, b ] ]
    .

And divide,

          4 a     4
1:  [6 + -----, -----]
         a - b  b - a

    .

But that's not all!!! We can create graphs if gnuplot is installed. We can give it two vectors, or an algebraic expression. This plot of sin(x) and x cos(x) was made with just a few keystrokes,

But that's not all!!! There is an HMS type for handling times and angles. For 2 hours, 30 minutes, and 4 seconds, and some others,

3:  2@ 30' 4"
2:  4@ 22' 13"
1:  1@ 2' 56"
    .

Of course, the normal operators work as expected. We can add them all up,

1:  7@ 55' 13"
    .

We can convert between this and radians, and degrees, and so on.

But that's not all!!! The calculator also has a date type, entered inside angled brackets, <> (in algebra entry mode). It is really flexible on input dates. We can insert the current date with t N.

1:  <6:59:34pm Tue Jun 23, 2009>
    .

If we add numbers they are treated as days. Add 4,

1:  <6:59:34pm Sat Jun 27, 2009>
    .

It works with the HMS format from before too. Subtract 2@ 3' 15".

1:  <4:56:32pm Sat Jun 27, 2009>
    .

But that's not all!!! There is a modulo form for performing modulo arithmetic. For example, 17 mod 24,

1:  17 mod 24
    .

Add 10,

1:  3 mod 24
    .

This is most useful for forms such as n^p mod M, which this will handle efficiently. For example, 3^100000 mod 24. The naive way would be to find 3^100000 first, then take the modulus. This involves a computationally expensive middle step of calculating 3^100000, a huge number. The modulo form does it smarter.

But that's not all!!! The calculator can do unit conversions. The version of Emacs (22.3.1) I am typing in right now knows about 159 different units. For example, I push 65 mph onto the stack,

1:  65 mph
    .

Convert to meters per second with u c,

1:  29.0576 m / s
    .

It is flexible about mixing type of units. For example, I enter 3 cubic meters,

I can convert to gallons,

1:  792.516157074 gal
    .

I work in a lab without Internet access during the day, so when I need to do various conversions Emacs is indispensable.

The speed of light is also a unit. I can enter 1 c and convert to meters per second,

1:  299792458 m / s
    .

But that's not all!!! As I said, it's a computer algebra system so it understands symbolic math. Remember those algebraic expressions from before? I can operate on those. Let's push some expressions onto the stack,

3:  ln(x)

       2   a x
2:  a x  + --- + c
            b

1:  y + c

    .

Multiply the top two, then add the third,

                2   a x
1:  ln(x) + (a x  + --- + c) (y + c)
                     b

    .

Expand with a x, then simplify with a s,

                 2   a x y              2   a c x    2
1:  ln(x) + a y x  + ----- + c y + a c x  + ----- + c
                       b                      b

    .

Now, one of the coolest features: calculus. Differentiate with respect to x, with a d,

    1             a y             a c
1:  - + 2 a y x + --- + 2 a c x + ---
    x              b               b

    .

Or undo that and integrate it,

                       3      2                  3        2
                  a y x    a x  y           a c x    a c x       2
1:  x ln(x) - x + ------ + ------ + c x y + ------ + ------ + x c
                    3       2 b               3       2 b

    .

That's just awesome! That's a text editor ... doing calculus!

So, that was most of the main features. It was kind of exhausting going through all of that, and I am only scratching the surface of what the calculator can do.

Naturally, it can be extended with some elisp. It provides a defmath macro specifically for this.

I bet (hope?) someday it will have a functions for doing Laplace and Fourier transforms.

Linear Spatial Filters with GNU Octave

2008-02-22T00:00:00Z

I have gotten several e-mails lately about using GNU Octave. One specifically was about blurring images in Octave. In response, I am writing this in-depth post to cover spatial filters, and how to use them in GNU Octave (a free implementation of the Matlab programming language). This should be the sort of information you would find near the beginning of an introductory digital image processing textbook, but written out more simply. In the future, I will probably be writing a post covering non-linear spatial and/or frequency domain filters in Octave.

If you want to follow along in Octave, I strongly recommend that you upgrade to the new Octave 3.0. It is considered stable, but differs significantly from Octave 2.1, which many people may be used to. You will also need to install the image processing package from Octave-Forge. To get help with any Octave function, just type help.

The most common linear spatial image filtering involves convolving a filter mask, sometimes called a convolution kernel, over an image, which is a two-dimensional matrix. In the case of an RGB color image, the image is actually composed of three two-dimensional grayscale images, each representing a single color, where each is convolved with the filter mask separately.

Convolution is sliding a mask over an image. The new value at the mask's position is the sum of the value of each element of the mask multiplied by the value of the image at that position. For an example, let's start with 1-dimensional convolution. Define a mask,

5 3 2 4 8

The 2 is the anchor for the mask. Define an image,

0 0 1 2 1 0 0

As we convolve, the mask will extend beyond the image at the edges. One way to handle this is to pad the image with 0's. We start by placing the mask at the left edge. (zero-padding is underlined)

Mask:   5 3 2 4 8
Image:  0 0 0 0 1 2 1 0 0

The first output value is 8, as every other element of the mask is multiplied by zero.

Output: 8 x x x x x x

Now, slide the mask over by one position,

Mask:   5 3 2 4 8
Image:  0 0 0 1 2 1 0 0

The output here is 20, because 8*2 + 4*1 = 20;

Output: 8 20 x x x x x

If we continue sliding the mask along, the output becomes,

Output: 8 20 18 11 13 13 5

Here is the correlation done in Octave interactively, (filter2() is the correlation function).

octave> filter2([5 3 2 4 8], [0 0 1 2 1 0 0])
ans =

    8   20   18   11   13   13    5

The same thing happens in two-dimensional convolution, with the mask moving in the vertical direction as well, so that each element in the image is covered.

Sometimes you will hear this described as correlation (Octave's filter2) or convolution (Octave's conv2). The only difference between these operations is that in convolution the filter masked is rotated 180 degrees. Whoop-dee-doo. Most of the time your filter is probably symmetrical anyway. So, don't worry much about the difference between these two. Especially in Octave, where rotating a matrix is easy (see rot90()).

Now that we know convolution, let's introduce the sample image we will be using. I carefully put this together in Inkscape, which should give us a nice scalable test image. When converting to a raster format, there is a bit of unwanted anti-aliasing going on (couldn't find a way to turn that off), but it is minimal.

Save that image (the PNG file, not the linked SVG file) where you can get to it in Octave. Now, let's load the image into Octave using imread().

m = imread("image-test.png");

The image is a grayscale image, so it has only one layer. The size of m should be 300x300. You can check this like so (note the lack of semicolon so we can see the output),

size(m)

You can view the image stored in m with imshow. It doesn't care about the image dimensions or size, so until you resize the plot window, it will probably be stretched.

imshow(m);

Now, let's make an extremely simple 5x5 filter mask.

f = ones(5) * 1/25

Octave will show us what this matrix looks like.

f =

   0.040000   0.040000   0.040000   0.040000   0.040000
   0.040000   0.040000   0.040000   0.040000   0.040000
   0.040000   0.040000   0.040000   0.040000   0.040000
   0.040000   0.040000   0.040000   0.040000   0.040000
   0.040000   0.040000   0.040000   0.040000   0.040000

This filter mask is called an averaging filter. It simply averages all the pixels around the image (think about how this works out in the convolution). The effect will be to blur the image. It is important to note here that the sum of the elements is 1 (or 100% if you are thinking of averages). You can check it like so,

sum(f(:))

Now, to convolve the image with the filter mask using filter2().

ave_m = filter2(f, m);

You can view the filtered image again with imshow() except that we need to first convert the image matrix to a matrix of 8-bit unsigned integers. It is kind of annoying that we need this, but this is the way it is as of this writing.

ave_m = uint8(ave_m);
imshow(ave_m);

Or, we can save this image to a file using imwrite(). Just like with imshow(), you will first need to convert the image to uint8.

imwrite("averaged.png", ave_m);

There are a few things to notice about this image. First there is a black border around the outside of the filtered image. This is due to the zero-padding (black border) done by filter2(). The border of the image had 0's averaged into them. Second, some parts of the blurred image are "noisy". Here are some selected parts at 4x zoom.

Notice how the circle, and the "a" seem a little bit boxy? This is due to the shape of our filter. Also notice that the blurring isn't as smooth as it could be. This is because the filter itself isn't very smooth. We'll fix both these problems with a new filter later.

First, here is how we can fix the border problem: we pad the image with itself. Octave provides us three easy ways to do this. The first is replicate padding: the padding outside the image is the same as the nearest border pixel in the image. Circular padding: the padding from from the opposite side of the image, as if it was wrapped. This would be a good choice for a periodic image. Last, and probably the most useful is symmetric: the padding is a mirror reflection of the image itself.

To apply symmetric padding, we use the padarray() function. We only want to pad the image by the amount that the mask will "hang off". Let's pad the original image for a 9x9 filter, which will hang off by 4 pixels each way,

mpad = padarray(m, [4 4], "symmetric");

Next, we will replace the averaging filter with a 2D Gaussian distribution. The Gaussian, or normal, distribution has many wonderful and useful properties (as a statistics professor I had once said, anyone who considers themselves to be educated should know about the normal distribution). One property that makes it useful is that if we integrate the Gaussian distribution from minus infinity to infinity, the result is 1. The easiest way to get the curve without having to type in the equation is using fspecial(): a special function for creating image filters.

f_gauss = fspecial("gaussian", 9, 2);

This creates a 9x9 Gaussian filter with variance 2. The variance controls the effective size of the filter. Increasing the size of the filter from 9 to 99 will actually have virtually no impact on the final result. It just needs to be large enough to cover the curve. Six times the variance covers over 99% of the curve, so for a variance of 2, a filter of size 7x7 (always make your filters odd in size) is plenty. A larger filter means a longer convolution time. Here is what the 9x9 filter looks like,

And to filter with the Gaussian,

gauss_m = filter2(f_gauss, mpad, "valid";
gauss_m = uint8(guass_m);

Notice the extra argument "valid"? Since we padded the image before filtering, we don't want this padding to be part of the image result. filter2() normally returns an image of the same size as the input image, but we only want the part that didn't undergo (additional) zero-padding. The result is now the same size as the original image, but without the messy border,

Also, compare the result to the average filter above. See how much smoother this image is? If you are interested in blurring an image, you will generally want to go with a Gaussian filter like this.

Now I will let you in on a little shortcut. In Matlab, there is a function called imfilter which does the padding and filtering in one step. As of this writing, the Octave-Forge image package doesn't officially include this function, but it is there in the source repository now, meaning that it will probably appear in the next version of that package. I actually wrote my own before I found this one. You can grab the official one here: imfilter.m

With this new function, we can filter with the Gaussian and save like this. Notice the flipping of the first two arguments from filter2, as well as the lack of converting to uint8.

gauss_m = imfilter(m, f, "symmetric");
imwrite("gauss.png", gauss_m);

imfilter() will also handle the 3-layer color images seamlessly. Without it, you would need to run filter2() on each layer separately.

So that is just about all there is. fspecial() has many more filters available including motion blur, unsharp, and edge detection. For example, the Sobel edge detector,

octave:25> fspecial("sobel")
ans =

   1   2   1
   0   0   0
  -1  -2  -1

It is good at detecting edges in one direction. We can rotate this each way to detect edges all over the image.

mf = uint8(zeros(size(m)));
for i = 0:3
  mf += imfilter(m, rot90(fspecial("sobel"), i));
end
imshow(mf)

Happy Hacking with Octave!

Unsharp Masking

2007-12-19T00:00:00Z

While studying for my digital image processing final exam yesterday, I came back across unsharp masking. When I first saw this, I thought it was really neat. This time around, I took the hands-on approach and tried it myself in Octave. It has been used by the publishing and printing industry for years.

Unsharp masking is a method of sharpening an image. The idea is this,

Blur the original image.
Subtract the blurred image from the original, creating a mask.
Add the mask to the original image.

Here is an example using a 1-dimensional signal. I blurred the signal with a 1x5 averaging filter: [1 1 1 1 1] * 1/5. Then I subtracted the blurred signal from the original to create a mask. Finally, I added the unsharp mask to the original signal. For images, we do this in 2-dimensions, as an image is simply a 2-dimensional signal.

When it comes to image processing, we can create the mask in one easy step! This is done by performing a 2-dimensional convolution with a Laplacian kernel. It does steps 1 and 2 at the same time. This is the Laplacian I used in the example at the beginning,

So, to do it in Octave, this is all you need,

octave> i = imread("moon.png");
octave> m = conv2(i, [0 -1 0; -1 4 -1; 0 -1 0], "same");
octave> imwrite("moon-sharp.png", i + 2 * uint8(m))

i is the image and m is the mask. The mask created in step 2 looks like this,

You could take the above Octave code and drop it into a little she-bang script to create a simple image sharpening program. I leave this as an exercise for the reader.