null program

NASM x86 Assembly Major Mode for Emacs

Last weekend I created a new Emacs mode, nasm-mode, for editing Netwide Assembler (NASM) x86 assembly programs. Over the past week I tweaked it until it felt comfortable enough to share on MELPA. It's got what you'd expect from a standard Emacs programming language mode: syntax highlighting, automatic indentation, and imenu support. It's not a full parser, but it knows all of NASM's instructions and directives.

Until recently I didn't really have preferences about x86 assemblers (GAS, NASM, YASM, FASM, MASM, etc.) or syntax (Intel, AT&T). I stuck to the GNU Assembler (GAS) since it's already there with all the other GNU development tools I know and love, and it's required for inline assembly in GCC. However, nasm-mode now marks my commitment to NASM as my primary x86 assembler.

Why NASM?

I need an assembler that can assemble 16-bit code (8086, 8088, 80186, 80286), because real mode is fun. Despite its .code16gcc directive, GAS is not suitable for this purpose. It's just enough to get the CPU into protected mode -- as needed when writing an operating system with GCC -- and that's it. A different assembler is required for serious 16-bit programming.

GAS syntax has problems. I'm not talking about the argument order (source first or destination first), since there's no right answer to that one. The linked article covers a number of problems, with these being the big ones for me:

Being a portable assembler, GAS is the jack of all instruction sets, master of none. If I'm going to write a lot of x86 assembly, I want a tool specialized for the job.

YASM

I also looked at YASM, a rewrite of NASM. It supports 16-bit assembly and mostly uses NASM syntax. In my research I found that NASM used to lag behind in features due to slower development, which is what spawned YASM. In recent years this seems to have flipped around, with YASM lagging behind. If you're using YASM, nasm-mode should work pretty well for you, since it's still very similar.

YASM optionally supports GAS syntax, but this reintroduces almost all of GAS's problems. Even YASM's improvements (i.e. its ORG directive) become broken when switching to GAS syntax.

FASM

FASM is the "flat assembler," an assembler written in assembly language. This means it's only available on x86 platforms. While I don't really plan on developing x86 assembly on a Raspberry Pi, I'd rather not limit my options! I already regard 16-bit DOS programming as a form of embedded programming, and this may very well extend to the rest of x86 someday.

Also, it hasn't made its way into the various Linux distribution package repositories, including Debian, so it's already at a disadvantage for me.

MASM

This is Microsoft's assembler that comes with Visual Studio. Windows only and not open source, this is in no way a serious consideration. But since NASM's syntax was originally derived from MASM, it's worth mentioning. NASM takes the good parts of MASM and fixes the mistakes (such as the offset operator). It's different enough that nasm-mode would not work well with MASM.

NASM

It's not perfect, but it's got an excellent manual, it's a solid program that does exactly what it says it will do, has a powerful macro system, great 16-bit support, highly portable, easy to build, and its semantics and syntax has been carefully considered. It also comes with a simple, pure binary disassembler (ndisasm). In retrospect it seems like an obvious choice!

My one complaint would be that it's that it's too flexible about labels. The colon on labels is optional, which can lead to subtle bugs. NASM will warn about this under some conditions (orphan-labels). Combined with the preprocessor, the difference between a macro and a label is ambiguous, short of re-implementing the entire preprocessor in Emacs Lisp.

Why nasm-mode?

Emacs comes with an asm-mode for editing assembly code for various architectures. Unfortunately it's another jack-of-all-trades that's not very good. More so, it doesn't follow Emacs' normal editing conventions, having unusual automatic indentation and self-insertion behaviors. It's what prompted me to make nasm-mode.

To be fair, I don't think it's possible to write a major mode that covers many different instruction set architectures. Each architecture has its own quirks and oddities that essentially makes gives it a unique language. This is especially true with x86, which, from its 37 year tenure touched by so many different vendors, comes in a number of incompatible flavors. Each assembler/architecture pair needs its own major mode. I hope I just wrote NASM's.

One area where I'm still stuck is that I can't find an x86 style guide. It's easy to find half a dozen style guides of varying authority for any programming language that's more than 10 years old ... except x86. There's no obvious answer when it comes to automatic indentation. How are comments formatted and indented? How are instructions aligned? Should labels be on the same line as the instruction? Should labels require a colon? (I've decided this is "yes.") What about long label names? How are function prototypes/signatures documented? (The mode could take advantage of such a standard, a la ElDoc.) It seems everyone uses their own style. This is another conundrum for a generic asm-mode.

There are a couple of other nasm-modes floating around with different levels of completeness. Mine should supersede these, and will be much easier to maintain into the future as NASM evolves.

tags: [ emacs x86 ]

A Basic Just-In-Time Compiler

Monday's /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It's given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it's an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86_64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I'll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren't allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there's less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;
};

To keep things simple here, I'm just assuming the page size is 4kB. In a real program, we'd use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86_64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It's anonymous because it's not backed by a file.

struct asmbuf *
asmbuf_create(void)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}

Windows doesn't have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here's the equivalent in Win32.

struct asmbuf *
asmbuf_create(void)
{
    return VirtualAlloc(NULL, PAGE_SIZE, MEM_COMMIT, PAGE_READWRITE);
}

Anyone reading closely should notice that I haven't actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: "write xor execute." That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

void
asmbuf_finalize(struct asmbuf *buf)
{
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}

Or on Win32 (that last parameter is not allowed to be NULL),

void
asmbuf_finalize(struct asmbuf *buf)
{
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}

Finally, instead of free() it gets unmapped.

void
asmbuf_free(struct asmbuf *buf)
{
    munmap(buf, PAGE_SIZE);
}

And on Win32,

void
asmbuf_free(struct asmbuf *buf)
{
    VirtualFree(buf, PAGE_SIZE, MEM_RELEASE);
}

I won't list the definitions here, but there are two "methods" for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We're only going to be concerned with three of x86_64's many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we'll be able to call from C like a normal function. Here's what it's prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that's where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   %rdi, %rax

There's a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.

retq

So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86_64 documentation, look up the instructions we're using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60's.

Fortunately there's a much easier way. We'll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to as (GAS). It will produce a.out, which we'll disassemble with objdump -d.

$ as peek.s
$ objdump -d a.out

a.out:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   c3                      retq

That's straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   %rdi, %rax
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // retq

For each operation, we'll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator's instruction if it's small (32-bits or fewer), but I'm keeping it simple. To sneakily capture the "template" for this instruction I'm going to use 0x0123456789abcdef as the operand.

mov   $0x0123456789abcdef, %rdi

Which disassembled with objdump -d is,

0:  48 bf ef cd ab 89 67    movabs $0x123456789abcdef,%rdi
7:  45 23 01

Notice the operand listed little endian immediately after the instruction. That's also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   operand, %rdi
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
case '+':
    asmbuf_ins(buf, 3, 0x4801f8);   // add   %rdi, %rax
    break;
case '-':
    asmbuf_ins(buf, 3, 0x4829f8);   // sub   %rdi, %rax
    break;
case '*':
    asmbuf_ins(buf, 4, 0x480fafc7); // imul  %rdi, %rax
    break;
case '/':
    asmbuf_ins(buf, 3, 0x4831d2);   // xor   %rdx, %rdx
    asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  %rdi
    break;
}

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we're all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That's pretty cool if you ask me! Now this was an extremely simplified situation. There's no branching, no intermediate values, no function calls, and I didn't even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it's a 1:1 translation.

I'd like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that's more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

tags: [ c tutorial netsec x86 ]

Goblin-COM 7DRL 2015

Yesterday I completed my third entry to the annual Seven Day Roguelike (7DRL) challenge (previously: 2013 and 2014). This year's entry is called Goblin-COM.

As with previous years, the ideas behind the game are not all that original. The goal was to be a fantasy version of classic X-COM with an ANSI terminal interface. You are the ruler of a fledgling human nation that is under attack by invading goblins. You hire heroes, operate squads, construct buildings, and manage resource income.

The inspiration this year came from watching BattleBunny play OpenXCOM, an open source clone of the original X-COM. It had its major 1.0 release last year. Like the early days of OpenTTD, it currently depends on the original game assets. But also like OpenTTD, it surpasses the original game in every way, so there's no reason to bother running the original anymore. I've also recently been watching One F Jef play Silent Storm, which is another turn-based squad game with a similar combat simulation.

As in X-COM, the game is broken into two modes of play: the geoscape (strategic) and the battlescape (tactical). Unfortunately I ran out of time and didn't get to the battlescape part, though I'd like to add it in the future. What's left is a sort-of city-builder with some squad management. You can hire heroes and send them out in squads to eliminate goblins, but rather than dropping to the battlescape, battles always auto-resolve in your favor. Despite this, the game still has a story, a win state, and a lose state. I won't say what they are, so you have to play it for yourself!

Terminal Emulator Layer

My previous entries were HTML5 games, but this entry is a plain old standalone application. C has been my preferred language for the past few months, so that's what I used. Both UTF-8-capable ANSI terminals and the Windows console are supported, so it should be perfectly playable on any modern machine. Note, though, that some of the poorer-quality terminal emulators that you'll find in your Linux distribution's repositories (rxvt and its derivatives) are not Unicode-capable, which means they won't work with G-COM.

I didn't make use of ncurses, instead opting to write my own terminal graphics engine. That's because I wanted a single, small binary that was easy to build, and I didn't want to mess around with PDCurses. I've also been studying the Win32 API lately, so writing my own terminal platform layer would rather easy to do anyway.

I experimented with a number of terminal emulators -- LXTerminal, Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator -- but the least capable "terminal" by far is the Windows console, so it was the one to dictate the capabilities of the graphics engine. Some ANSI terminals are capable of 256 colors, bold, underline, and strikethrough fonts, but a highly portable API is basically limited to 16 colors (RGBCMYKW with two levels of intensity) for each of the foreground and background, and no other special text properties.

ANSI terminals also have a concept of a default foreground color and a default background color. Most applications that output color (git, grep, ls) leave the background color alone and are careful to choose neutral foreground colors. G-COM always sets the background color, so that the game looks the same no matter what the default colors are. Also, the Windows console doesn't really have default colors anyway, even if I wanted to use them.

I put in partial support for Unicode because I wanted to use interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported Unicode for a long time now, but since they added it too early, they're locked into the outdated UTF-16. For me this wasn't too bad, because few computers, Linux included, are equipped to render characters outside of the Basic Multilingual Plane anyway, so there's no need to deal with surrogate pairs. This is especially true for the Windows console, which can only render a very small set of characters: another limit on my graphics engine. Internally individual codepoints are handled as uint16_t and strings are handled as UTF-8.

I said partial support because, in addition to the above, it has no support for combining characters, or any other situation where a codepoint takes up something other than one space in the terminal. This requires lookup tables and dealing with pitfalls, but since I get to control exactly which characters were going to be used I didn't need any of that.

In spite of the limitations, I'm really happy with the graphical results. The waves are animated continuously, even while the game is paused, and it looks great. Here's GNOME Terminal's rendering, which I think looked the best by default.

I'll talk about how G-COM actually communicates with the terminal in another article. The interface between the game and the graphics engine is really clean (device.h), so it would be an interesting project to write a back end that renders the game to a regular window, no terminal needed.

Color Directive

I came up with a format directive to help me colorize everything. It runs in addition to the standard printf directives. Here's an example,

panel_printf(&panel, 1, 1, "Really save and quit? (Rk{y}/Rk{n})");

The color is specified by two characters, and the text it applies to is wrapped in curly brackets. There are eight colors to pick from: RGBCMYKW. That covers all the binary values for red, green, and blue. To specify an "intense" (bright) color, capitalize it. That means the Rk{...} above makes the wrapped text bright red.

Nested directives are also supported. (And, yes, that K means "high intense black," a.k.a. dark gray. A w means "low intensity white," a.k.a. light gray.)

panel_printf(p, x, y++, "Kk{♦}    wk{Rk{B}uild}     Kk{♦}");

And it mixes with the normal printf directives:

panel_printf(p, 1, y++, "(Rk{m}) Yk{Mine} [%s]", cost);

Single Binary

The GNU linker has a really nice feature for linking arbitrary binary data into your application. I used this to embed my assets into a single binary so that the user doesn't need to worry about any sort of data directory or anything like that. Here's what the make rule would look like:

$(LD) -r -b binary -o $@ $^

The -r specifies that output should be relocatable -- i.e. it can be fed back into the linker later when linking the final binary. The -b binary says that the input is just an opaque binary file ("plain" text included). The linker will create three symbols for each input file:

When then you can access from your C program like so:

extern const char _binary_filename_txt_start[];

I used this to embed the story texts, and I've used it in the past to embed images and textures. If you were to link zlib, you could easily compress these assets, too. I'm surprised this sort of thing isn't done more often!

Dumb Game Saves

To save time, and because it doesn't really matter, saves are just memory dumps. I took another page from Handmade Hero and allocate everything in a single, contiguous block of memory. With one exception, there are no pointers, so the entire block is relocatable. When references are needed, it's done via integers into the embedded arrays. This allows it to be cleanly reloaded in another process later. As a side effect, it also means there are no dynamic allocations (malloc()) while the game is running. Here's roughly what it looks like.

typedef struct game {
    uint64_t map_seed;
    map_t *map;
    long time;
    float wood, gold, food;
    long population;
    float goblin_spawn_rate;
    invader_t invaders[16];
    squad_t squads[16];
    hero_t heroes[128];
    game_event_t events[16];
} game_t;

The map pointer is that one exception, but that's because it's generated fresh after loading from the map_seed. Saving and loading is trivial (error checking omitted) and very fast.

void
game_save(game_t *game, FILE *out)
{
    fwrite(game, sizeof(*game), 1, out);
}

game_t *
game_load(FILE *in)
{
    game_t *game = malloc(sizeof(*game));
    fread(game, sizeof(*game), 1, in);
    game->map = map_generate(game->map_seed);
    return game;
}

The data isn't important enough to bother with rename+fsync durability. I'll risk the data if it makes savescumming that much harder!

The downside to this technique is that saves are generally not portable across architectures (particularly where endianness differs), and may not even portable between different platforms on the same architecture. I only needed to persist a single game state on the same machine, so this wouldn't be a problem.

Final Results

I'm definitely going to be reusing some of this code in future projects. The G-COM terminal graphics layer is nifty, and I already like it better than ncurses, whose API I've always thought was kind of ugly and old-fashioned. I like writing terminal applications.

Just like the last couple of years, the final game is a lot simpler than I had planned at the beginning of the week. Most things take longer to code than I initially expect. I'm still enjoying playing it, which is a really good sign. When I play, I'm having enough fun to deliberately delay the end of the game so that I can sprawl my nation out over the island and generate crazy income.

tags: [ game media ]

Generic C Reference Counting

As a result of making regular use of object-oriented programming in C, I've discovered a useful reference counting technique for the occasional dynamically allocated structs that need it. The situation arises when the same struct instance is shared between an arbitrary number of other data structures and I need to keep track of it all.

It's incredibly simple and lives entirely in a header file, so without further ado (ref.h):

#pragma once

struct ref {
    void (*free)(const struct ref *);
    int count;
};

static inline void
ref_inc(const struct ref *ref)
{
    ((struct ref *)ref)->count++;
}

static inline void
ref_dec(const struct ref *ref)
{
    if (--((struct ref *)ref)->count == 0)
        ref->free(ref);
}

It has only two fields: the reference count and a "method" that knows how to free the object once the reference count hits 0. Structs using this reference counter will know how to free themselves, so callers will never call a specific *_destroy()/*_free() function. Instead they call ref_dec() to decrement the reference counter and let it happen on its own.

I decided to go with a signed count because it allows for better error checking. It may be worth putting an assert() in ref_inc() and ref_dec() to ensure the count is always non-negative. I chose an int because it's fast, and anything smaller will be padded out to at least that size anyway. On x86_64, struct ref is 16 bytes.

This is basically all there is to a C++ shared_ptr, leveraging C++'s destructors and performing all increment/decrement work automatically.

Thread Safety

Those increments and decrements aren't thread safe, so this won't work as-is when data structures are shared between threads. If you're sure that you're using GCC on a capable platform, you can make use of its atomic builtins, making the reference counter completely thread safe.

static inline void
ref_inc(const struct ref *ref)
{
    __sync_add_and_fetch((int *)&ref->count, 1);
}

static inline void
ref_dec(const struct ref *ref)
{
    if (__sync_sub_and_fetch((int *)&ref->count, 1) == 0)
        ref->free(ref);
}

Or if you're using C11, make use of the new stdatomic.h.

static inline void
ref_inc(const struct ref *ref)
{
    atomic_fetch_add((int *)&ref->count, 1);
}

static inline void
ref_dec(const struct ref *ref)
{
    if (atomic_fetch_sub((int *)&ref->count, 1) == 1)
        ref->free(ref);
}

What's That Const?

There's a very deliberate decision to make all of the function arguments const, for both reference counting functions and the free() method. This may seem wrong because these functions are specifically intended to modify the reference count. There are dangerous-looking casts in each case to remove the const.

The reason for this is that's it's likely for someone holding a const pointer to one of these objects to want to keep their own reference. Their promise not to modify the object doesn't really apply to the reference count, which is merely embedded metadata. They would need to cast the const away before being permitted to call ref_inc() and ref_dec(). Rather than litter the program with dangerous casts, the casts are all kept in one place -- in the reference counting functions -- where they're strictly limited to mutating the reference counting fields.

On a related note, the stdlib.h free() function doesn't take a const pointer, so the free() method taking a const pointer is a slight departure from the norm. Taking a non-const pointer was a mistake in the C standard library. The free() function mutates the pointer itself -- including all other pointers to that object -- making it invalid. Semantically, it doesn't mutate the memory behind the pointer, so it's not actually violating the const. To compare, the Linux kernel kfree() takes a const void *.

Just as users may need to increment and decrement the counters on const objects, they'll also need to be able to free() them, so it's also a const.

Usage Example

So how does one use this generic reference counter? Embed a struct ref in your own structure and use our old friend: the container_of() macro. For anyone who's forgotten, this macro not part of standard C, but you can define it with offsetof().

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Here's a dumb linked list example where each node is individually reference counted. Adding an extra 16 bytes to each of your linked list nodes isn't normally going to help with much, but if the tail of the linked list is being shared between different data structures (such as other lists), reference counting makes things a lot simpler.

struct node {
    char id[64];
    float value;
    struct node *next;
    struct ref refcount;
};

I put refcount at the end so that we'll have to use container_of() in this example. It conveniently casts away the const for us.

static void
node_free(const struct ref *ref)
{
    struct node *node = container_of(ref, struct node, refcount);
    struct node *child = node->next;
    free(node);
    if (child)
        ref_dec(&child->refcount);
}

Notice that it recursively decrements its child's reference count afterwards (intentionally tail recursive). A whole list will clean itself up when the head is freed and no part of the list is shared.

The allocation function sets up the free() function pointer and initializes the count to 1.

struct node *
node_create(char *id, float value)
{
    struct node *node = malloc(sizeof(*node));
    snprintf(node->id, sizeof(node->id), "%s", id);
    node->value = value;
    node->next = NULL;
    node->refcount = (struct ref){node_free, 1};
    return node;
}

(Side note: I used snprintf() because strncpy() is broken and strlcpy() is non-standard, so it's the most straightforward way to do this in standard C.);

And to start making some use of the reference counter, here's push and pop.

void
node_push(struct node **nodes, char *id, float value)
{
    struct node *node = node_create(id, value);
    node->next = *nodes;
    *nodes = node;
}

struct node *
node_pop(struct node **nodes)
{
    struct node *node = *nodes;
    *nodes = (*nodes)->next;
    if (*nodes)
        ref_inc(&(*nodes)->refcount);
    return node;
}

Notice node_pop() increments the reference count of the new head node before returning. That's because the node now has an additional reference: from *nodes and from the node that was just popped. It's up to the caller to free the returned node, which would decrement the count of the new head node, but not free it. Alternatively node_pop() could set next on the returned node to NULL rather than increment the counter, which would also prevent the returned node from freeing the new head when it gets freed. But it's probably more useful for the returned node to keep functioning as a list. That's what the reference counting is for, after all.

Finally, a simple program to exercise it all. It reads ID/value pairs from standard input.

void
node_print(struct node *node)
{
    for (; node; node = node->next)
        printf("%s = %f\n", node->id, node->value);
}

int main(void)
{
    struct node *nodes = NULL;
    char id[64];
    float value;
    while (scanf(" %63s %f", id, &value) == 2)
        node_push(&nodes, id, value);
    if (nodes != NULL) {
        node_print(nodes);
        struct node *old = node_pop(&nodes);
        node_push(&nodes, "foobar", 0.0f);
        node_print(nodes);
        ref_dec(&old->refcount);
        ref_dec(&nodes->refcount);
    }
    return 0;
}

I've used this technique several times over the past few months. It's trivial to remember, so I just code it up from scratch each time I need it.

tags: [ c ]

Interactive Programming in C

I'm a huge fan of interactive programming (see: JavaScript, Java, Lisp, Clojure). That is, modifying and extending a program while it's running. For certain kinds of non-batch applications, it takes much of the tedium out of testing and tweaking during development. Until last week I didn't know how to apply interactive programming to C. How does one go about redefining functions in a running C program?

Last week in Handmade Hero (days 21-25), Casey Muratori added interactive programming to the game engine. This is especially useful in game development, where the developer might want to tweak, say, a boss fight without having to restart the entire game after each tweak. Now that I've seen it done, it seems so obvious. The secret is to build almost the entire application as a shared library.

This puts a serious constraint on the design of the program: it cannot keep any state in global or static variables, though this should be avoided anyway. Global state will be lost each time the shared library is reloaded. In some situations, this can also restrict use of the C standard library, including functions like malloc(), depending on how these functions are implemented or linked. For example, if the C standard library is statically linked, functions with global state may introduce global state into the shared library. It's difficult to know what's safe to use. This works fine in Handmade Hero because the core game, the part loaded as a shared library, makes no use of external libraries, including the standard library.

Additionally, the shared library must be careful with its use of function pointers. The functions being pointed at will no longer exist after a reload. This is a real issue when combining interactive programming with object oriented C.

An example with the Game of Life

To demonstrate how this works, let's go through an example. I wrote a simple ncurses Game of Life demo that's easy to modify. You can get the entire source here if you'd like to play around with it yourself on a Unix-like system.

Quick start:

  1. In a terminal run make then ./main. Press r randomize and q to quit.
  2. Edit game.c to change the Game of Life rules, add colors, etc.
  3. In a second terminal run make. Your changes will be reflected immediately in the original program!

As of this writing, Handmade Hero is being written on Windows, so Casey is using a DLL and the Win32 API, but the same technique can be applied on Linux, or any other Unix-like system, using libdl. That's what I'll be using here.

The program will be broken into two parts: the Game of Life shared library ("game") and a wrapper ("main") whose job is only to load the shared library, reload it when it updates, and call it at a regular interval. The wrapper is agnostic about the operation of the "game" portion, so it could be re-used almost untouched in another project.

To avoid maintaining a whole bunch of function pointer assignments in several places, the API to the "game" is enclosed in a struct. This also eliminates warnings from the C compiler about mixing data and function pointers. The layout and contents of the game_state struct is private to the game itself. The wrapper will only handle a pointer to this struct.

struct game_state;

struct game_api {
    struct game_state *(*init)();
    void (*finalize)(struct game_state *state);
    void (*reload)(struct game_state *state);
    void (*unload)(struct game_state *state);
    bool (*step)(struct game_state *state);
};

In the demo the API is made of 5 functions. The first 4 are primarily concerned with loading and unloading.

The library will provide a filled out API struct as a global variable, GAME_API. This is the only exported symbol in the entire shared library! All functions will be declared static, including the ones referenced by the structure.

const struct game_api GAME_API = {
    .init     = game_init,
    .finalize = game_finalize,
    .reload   = game_reload,
    .unload   = game_unload,
    .step     = game_step
};

dlopen, dlsym, and dlclose

The wrapper is focused on calling dlopen(), dlsym(), and dlclose() in the right order at the right time. The game will be compiled to the file libgame.so, so that's what will be loaded. It's written in the source with a ./ to force the name to be used as a filename. The wrapper keeps track of everything in a game struct.

const char *GAME_LIBRARY = "./libgame.so";

struct game {
    void *handle;
    ino_t id;
    struct game_api api;
    struct game_state *state;
};

The handle is the value returned by dlopen(). The id is the inode of the shared library, as returned by stat(). The rest is defined above. Why the inode? We could use a timestamp instead, but that's indirect. What we really care about is if the shared object file is actually a different file than the one that was loaded. The file will never be updated in place, it will be replaced by the compiler/linker, so the timestamp isn't what's important.

Using the inode is a much simpler situation than in Handmade Hero. Due to Windows' broken file locking behavior, the game DLL can't be replaced while it's being used. To work around this limitation, the build system and the loader have to rely on randomly-generated filenames.

void game_load(struct game *game)

The purpose of the game_load() function is to load the game API into a game struct, but only if either it hasn't been loaded yet or if it's been updated. Since it has several independent failure conditions, let's examine it in parts.

struct stat attr;
if ((stat(GAME_LIBRARY, &attr) == 0) && (game->id != attr.st_ino)) {

First, Use stat() to determine if the library's inode is different than the one that's already loaded. The id field will be 0 initially, so as long as stat() succeeds, this will load the library the first time.

    if (game->handle) {
        game->api.unload(game->state);
        dlclose(game->handle);
    }

If a library is already loaded, unload it first, being sure to call unload() to inform the library that it's being updated. It's critically important that dlclose() happens before dlopen(). On my system, dlopen() looks only at the string it's given, not the file behind it. Even though the file has been replaced on the filesystem, dlopen() will see that the string matches a library already opened and return a pointer to the old library. (Is this a bug?) The handles are reference counted internally by libdl.

    void *handle = dlopen(GAME_LIBRARY, RTLD_NOW);

Finally load the game library. There's a race condition here that cannot be helped due to limitations of dlopen(). The library may have been updated again since the call to stat(). Since we can't ask dlopen() about the inode of the library it opened, we can't know. Since this is only used during development, not in production, it's not a big deal.

    if (handle) {
        game->handle = handle;
        game->id = attr.st_ino;
        /* ... more below ... */
    } else {
        game->handle = NULL;
        game->id = 0;
    }

If dlopen() fails, it will return NULL. In the case of ELF, this will happen if the compiler/linker is still in the process of writing out the shared library. Since the unload was already done, this means no game will be loaded when game_load returns. The user of the struct needs to be prepared for this eventuality. It will need to try loading again later (i.e. a few milliseconds). It may be worth filling the API with stub functions when no library is loaded.

    const struct game_api *api = dlsym(game->handle, "GAME_API");
    if (api != NULL) {
        game->api = *api;
        if (game->state == NULL)
            game->state = game->api.init();
        game->api.reload(game->state);
    } else {
        dlclose(game->handle);
        game->handle = NULL;
        game->id = 0;
    }

When the library loads without error, look up the GAME_API struct that was mentioned before and copy it into the local struct. Copying rather than using the pointer avoids one more layer of redirection when making function calls. The game state is initialized if it hasn't been already, and the reload() function is called to inform the game it's just been reloaded.

If looking up the GAME_API fails, close the handle and consider it a failure.

The main loop calls game_load() each time around. And that's it!

int main(void)
{
    struct game game = {0};
    for (;;) {
        game_load(&game);
        if (game.handle)
            if (!game.api.step(game.state))
                break;
        usleep(100000);
    }
    game_unload(&game);
    return 0;
}

Now that I have this technique in by toolbelt, it has me itching to develop a proper, full game in C with OpenGL and all, perhaps in another Ludum Dare. The ability to develop interactively is very appealing.

tags: [ c ]

How to build DOS COM files with GCC

This past weekend I participated in Ludum Dare #31. Before the theme was even announced, due to recent fascination I wanted to make an old school DOS game. DOSBox would be the target platform since it's the most practical way to run DOS applications anymore, despite modern x86 CPUs still being fully backwards compatible all the way back to the 16-bit 8086.

I successfully created and submitted a DOS game called DOS Defender. It's a 32-bit 80386 real mode DOS COM program. All assets are embedded in the executable and there are no external dependencies, so the entire game is packed into that 10kB binary.

You'll need a joystick/gamepad in order to play. I included mouse support in the Ludum Dare release in order to make it easier to review, but this was removed because it doesn't work well.

The most technically interesting part is that I didn't need any DOS development tools to create this! I only used my every day Linux C compiler (gcc). It's not actually possible to build DOS Defender in DOS. Instead, I'm treating DOS as an embedded platform, which is the only form in which DOS still exists today. Along with DOSBox and DOSEMU, this is a pretty comfortable toolchain.

If all you care about is how to do this yourself, skip to the "Tricking GCC" section, where we'll write a "Hello, World" DOS COM program with Linux's GCC.

Finding the right tools

I didn't have GCC in mind when I started this project. What really triggered all of this was that I had noticed Debian's bcc package, Bruce's C Compiler, that builds 16-bit 8086 binaries. It's kept around for compiling x86 bootloaders and such, but it can also be used to compile DOS COM files, which was the part that interested me.

For some background: the Intel 8086 was a 16-bit microprocessor released in 1978. It had none of the fancy features of today's CPU: no memory protection, no floating point instructions, and only up to 1MB of RAM addressable. All modern x86 desktops and laptops can still pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same limited addressing and all. That's some serious backwards compatibility. This feature is called real mode. It's the mode in which all x86 computers boot. Modern operating systems switch to protected mode as soon as possible, which provides virtual addressing and safe multi-tasking. DOS is not one of these operating systems.

Unfortunately, bcc is not an ANSI C compiler. It supports a subset of K&R C, along with inline x86 assembly. Unlike other 8086 C compilers, it has no notion of "far" or "long" pointers, so inline assembly is required to access other memory segments (VGA, clock, etc.). Side note: the remnants of these 8086 "long pointers" still exists today in the Win32 API: LPSTR, LPWORD, LPDWORD, etc. The inline assembly isn't anywhere near as nice as GCC's inline assembly. The assembly code has to manually load variables from the stack so, since bcc supports two different calling conventions, the assembly ends up being hard-coded to one calling convention or the other.

Given all its limitations, I went looking for alternatives.

DJGPP

DJGPP is the DOS port of GCC. It's a very impressive project, bringing almost all of POSIX to DOS. The DOS ports of many programs are built with DJGPP. In order to achieve this, it only produces 32-bit protected mode programs. If a protected mode program needs to manipulate hardware (i.e. VGA), it must make requests to a DOS Protected Mode Interface (DPMI) service. If I used DJGPP, I couldn't make a single, standalone binary as I had wanted, since I'd need to include a DPMI server. There's also a performance penalty for making DPMI requests.

Getting a DJGPP toolchain working can be difficult, to put it kindly. Fortunately I found a useful project, build-djgpp, that makes it easy, at least on Linux.

Either there's a serious bug or the official DJGPP binaries have become infected again, because in my testing I kept getting the "Not COFF: check for viruses" error message when running my programs in DOSBox. To double check that it's not an infection on my own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as a clean room. It's impossible for this ARM-based device to get infected with an x86 virus. It still had the same problem, and all the binary hashes matched up between the machines, so it's not my fault.

So given the DPMI issue and the above, I moved on.

Tricking GCC

What I finally settled on is a neat hack that involves "tricking" GCC into producing real mode DOS COM files, so long as it can target 80386 (as is usually the case). The 80386 was released in 1985 and was the first 32-bit x86 microprocessor. GCC still targets this instruction set today, even in the x86_64 toolchain. Unfortunately, GCC cannot actually produce 16-bit code, so my main goal of targeting 8086 would not be achievable. This doesn't matter, though, since DOSBox, my intended platform, is an 80386 emulator.

In theory this should even work unchanged with MinGW, but there's a long-standing MinGW bug that prevents it from working right ("cannot perform PE operations on non PE output file"). It's still do-able, and I did it myself, but you'll need to drop the OUTPUT_FORMAT directive and add an extra objdump step.

Hello World in DOS

To demonstrate how to do all this, let's make a DOS "Hello, World" COM program using GCC on Linux.

There's a significant burden with this technique: there will be no standard library. It's basically like writing an operating system from scratch, except for the few services DOS provides. This means no printf() or anything of the sort. Instead we'll ask DOS to print a string to the terminal. Making a request to DOS means firing an interrupt, which means inline assembly!

DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x2F. The big one, and the one we're interested in, is 0x21, function 0x09 (print string). Between DOS and BIOS, there are thousands of functions called this way. I'm not going to try to explain x86 assembly, but in short the function number is stuffed into register ah and interrupt 0x21 is fired. Function 0x09 also takes an argument, the pointer to the string to be printed, which is passed in registers dx and ds.

Here's the GCC inline assembly print() function. Strings passed to this function must be terminated with a $. Why? Because DOS.

static void print(char *string)
{
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");
}

The assembly is declared volatile because it has a side effect (printing the string). To GCC, the assembly is an opaque hunk, and the optimizer relies in the output/input/clobber constraints (the last three lines). For DOS programs like this, all inline assembly will have side effects. This is because it's not being written for optimization but to access hardware and DOS, things not accessible to plain C.

Care must also be taken by the caller, because GCC doesn't know that the memory pointed to by string is ever read. It's likely the array that backs the string needs to be declared volatile too. This is all foreshadowing into what's to come: doing anything in this environment is an endless struggle against the optimizer. Not all of these battles can be won.

Now for the main function. The name of this function shouldn't matter, but I'm avoiding calling it main() since MinGW has a funny ideas about mangling this particular symbol, even when it's asked not to.

int dosmain(void)
{
    print("Hello, World!\n$");
    return 0;
}

COM files are limited to 65,279 bytes in side. This is because an x86 memory segment is 64kB and COM files are simply loaded by DOS to 0x0100 in the segment and executed. There are no headers, it's just a raw binary. Since a COM program can never be of any significant size, and no real linking needs to occur (freestanding), the entire thing will be compiled as one translation unit. It will be one call to GCC with a bunch of options.

Compiler Options

Here are the essential compiler options.

-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding

Since no standard libraries are in use, the only difference between gnu99 and c99 is that trigraphs are disabled (as they should be) and inline assembly can be written as asm instead of __asm__. It's a no brainer. This project will be so closely tied to GCC that I don't care about using GCC extensions anyway.

I'm using -Os to keep the compiled output as small as possible. It will also make the program run faster. This is important when targeting DOSBox because, by default, it will deliberately run as slow as a machine from the 1980's. I want to be able to fit in that constraint. If the optimizer is causing problems, you may need to temporarily make this -O0 to determine if the problem is your fault or the optimizer's fault.

You see, the optimizer doesn't understand that the program will be running in real mode, and under its addressing constraints. It will perform all sorts of invalid optimizations that break your perfectly valid programs. It's not a GCC bug since we're doing crazy stuff here. I had to rework my code a number of times to stop the optimizer from breaking my program. For example, I had to avoid returning complex structs from functions because they'd sometimes be filled with garbage. The real danger here is that a future version of GCC will be more clever and will break more stuff. In this battle, volatile is your friend.

Th next option is -nostdlib, since there are no valid libraries for us to link against, even statically.

The options -m32 -march=i386 set the compiler to produce 80386 code. If I was writing a bootloader for a modern computer, targeting 80686 would be fine, too, but DOSBox is 80386.

The -ffreestanding argument requires that GCC not emit code that calls built-in standard library helper functions. Sometimes instead of emitting code to do something, it emits code that calls a built-in function to do it, especially with math operators. This was one of the main problems I had with bcc, where this behavior couldn't be disabled. This is most commonly used in writing bootloaders and kernels. And now DOS COM files.

Linker Options

The -Wl option is used to pass arguments to the linker (ld). We need it since we're doing all this in one call to GCC.

-Wl,--nmagic,--script=com.ld

The --nmagic turns off page alignment of sections. One, we don't need this. Two, that would waste precious space. In my tests it doesn't appear to be necessary, but I'm including it just in case.

The --script option tells the linker that we want to use a custom linker script. This allows us to precisely lay out the sections (text, data, bss, rodata) of our program. Here's the com.ld script.

OUTPUT_FORMAT(binary)
SECTIONS
{
    . = 0x0100;
    .text :
    {
        *(.text);
    }
    .data :
    {
        *(.data);
        *(.bss);
        *(.rodata);
    }
    _heap = ALIGN(4);
}

The OUTPUT_FORMAT(binary) says not to put this into an ELF (or PE, etc.) file. The linker should just dump the raw code. A COM file is just raw code, so this means the linker will produce a COM file!

I had said that COM files are loaded to 0x0100. The fourth line offsets the binary to this location. The first byte of the COM file will still be the first byte of code, but it will be designed to run from that offset in memory.

What follows is all the sections, text (program), data (static data), bss (zero-initialized data), rodata (strings). Finally I mark the end of the binary with the symbol _heap. This will come in handy later for writing sbrk(), after we're done with "Hello, World." I've asked for the _heap position to be 4-byte aligned.

We're almost there.

Program Startup

The linker is usually aware of our entry point (main) and sets that up for us. But since we asked for "binary" output, we're on our own. If the print() function is emitted first, our program's execution will begin with executing that function, which is invalid. Our program needs a little header stanza to get things started.

The linker script has a STARTUP option for handling this, but to keep it simple we'll put that right in the program. This is usually called crt0.o or Boot.o, in case those names every come up in your own reading. This inline assembly must be the very first thing in our code, before any includes and such. DOS will so most of the setup for us, we really just have to jump to the entry point.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C, %ah\n"
     "int   $0x21\n");

The .code16gcc tells the assembler that we're going to be running in real mode, so that it makes the proper adjustment. Despite the name, this will not make it produce 16-bit code! First it calls dosmain, the function we wrote above. Then it informs DOS, using function 0x4C (terminate with return code), that we're done, passing the exit code along in the 1-byte register al (already set by dosmain). This inline assembly is automatically volatile because it has no inputs or outputs.

Everything at Once

Here's the entire C program.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C,%ah\n"
     "int   $0x21\n");

static void print(char *string)
{
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");
}

int dosmain(void)
{
    print("Hello, World!\n$");
    return 0;
}

I won't repeat com.ld. Here's the call to GCC.

gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
    -o hello.com -Wl,--nmagic,--script=com.ld hello.c

And testing it in DOSBox:

From here if you want fancy graphics, it's just a matter of making an interrupt and writing to VGA memory. If you want sound you can perform an interrupt for the PC speaker. I haven't sorted out how to call Sound Blaster yet. It was from this point that I grew DOS Defender.

Memory Allocation

To cover one more thing, remember that _heap symbol? We can use it to implement sbrk() for dynamic memory allocation within the main program segment. This is real mode, and there's no virtual memory, so we're free to write to any memory we can address at any time. Some of this is reserved (i.e. low and high memory) for hardware. So using sbrk() specifically isn't really necessary, but it's interesting to implement ourselves.

As is normal on x86, your text and segments are at a low address (0x0100 in this case) and the stack is at a high address (around 0xffff in this case). On Unix-like systems, the memory returned by malloc() comes from two places: sbrk() and mmap(). What sbrk() does is allocates memory just above the text/data segments, growing "up" towards the stack. Each call to sbrk() will grow this space (or leave it exactly the same). That memory would then managed by malloc() and friends.

Here's how we can get sbrk() in a COM program. Notice I have to define my own size_t, since we don't have a standard library.

typedef unsigned short  size_t;

extern char _heap;
static char *hbreak = &_heap;

static void *sbrk(size_t size)
{
    char *ptr = hbreak;
    hbreak += size;
    return ptr;
}

It just sets a pointer to _heap and grows it as needed. A slightly smarter sbrk() would be careful about alignment as well.

In the making of DOS Defender an interesting thing happened. I was (incorrectly) counting on the memory return by my sbrk() being zeroed. This was the case the first time the game ran. However, DOS doesn't zero this memory between programs. When I would run my game again, it would pick right up where it left off, because the same data structures with the same contents were loaded back into place. A pretty cool accident! It's part of what makes this a fun embedded platform.

tags: [ c debian tutorial game ]

LZSS Quine Puzzle

When I was a kid I spent some time playing a top-down, 2D, puzzle/action, 1993, MS-DOS game called God of Thunder. It came on a shareware CD, now long lost, called Games People Play. A couple decades later I was recently reminded of the game and decided to dig it up and play it again. It's not quite as exciting as I remember it -- nostalgia really warps perception -- but it's still an interesting game nonetheless.

That got me thinking about how difficult it might be to modify ("mod") the game to add my own levels and puzzles. It's a tiny game, so there aren't many assets to reverse engineer. Unpacked, the game just barely fits on a 1.44 MB high density floppy disk. That was probably one of the game's primary design constraints. It also means it's almost certainly employing some sort of home-brew compression algorithm in order to fit more content. I find these sorts of things absolutely interesting and delightful.

You see, back in those old days, compression wasn't really a "solved" problem like it is today. They had to design and implement their own algorithms, with varying degrees of success. Today if you need compression for a project, you just grab zlib. Released in 1995, it implements the most widely used compression algorithm today, DEFLATE, with a tidy, in-memory API. zlib is well-tested, thoroughly optimized, and sits in a nearly-perfect sweet spot between compression ratio and performance. There's even an embeddable version. Since spinning platters are so slow compared to CPUs, compression is likely to speed up an application simply because fewer bytes need to go to and from the disk. Today it's less about saving storage space and more about reducing input/output demands.

Fortunately for me, someone has already reversed engineered most of the God of Thunder assets. It uses its own flavor of Lempel-Ziv-Storer-Szymanski (LZSS), which itself is derived from LZ77, one of the algorithms used in DEFLATE. The original LZSS paper focuses purely on the math, describing the algorithm in terms of symbols with no concern for how it's actually serialized into bits. Those specific details were decided by the game's developers, and that's what I'll be describing below.

As an adult I'm finding the God of Thunder asset formats to be more interesting than the game itself. It's a better puzzle! I really enjoy studying the file formats of various applications, especially older ones that didn't have modern standards to build on. Usually lots of thought and engineering goes into the design these formats -- and, too often, not enough thought goes into it. The format's specifics reveal insights into the internal workings of the application, sometimes exposing unanticipated failure modes. Prying apart odd, proprietary formats (i.e. "data reduction") is probably my favorite kind of work at my day job, and it comes up fairly often.

God of Thunder LZSS Definition

An LZSS compression stream is made up of two kinds of chunks: literals and back references. A literal chunk is passed through to the output buffer unchanged. A reference chunk is a pair of numbers: a length and an offset backwards into the output buffer. Only a single bit is needed for each chunk to identify its type.

To avoid any sort of complicated and slow bit wrangling, the God of Thunder developers (or whoever inspired them) came up with the smart idea to stage 8 of these bits up at once as a single byte, a "control" byte. Since literal chunks are 1 byte and reference chunks are 2 bytes, everything falls onto clean byte boundaries. Every group of 8 chunks is prefixed with one of these control bytes, and so every LZSS compression stream begins with a control byte. The least significant bit controls the 1st chunk in the group and the most significant bit controls the 8th chunk. A 1 denotes a literal and a 0 denotes a reference.

So, for example, a control byte of 0xff means to pass through unchanged the next 8 bytes of the compression stream. This would be the least efficient compression scenario, because the "compressed" stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come entirely from the back references.

A back reference is two bytes little endian (this was in MS-DOS running on x86), the lower 12 bits are the offset and the upper 4 bits are the length, minus 2. That is, you read the 4 length bits and add 2. This is because it doesn't make any sense to reference a length shorter than 2: a literal chunk would be shorter. The offset doesn't have anything added to it. This was a design mistake since an offset of 0 doesn't make any sense. It refers to a byte just outside the output buffer. It should have been stored as the offset minus 1.

A 12-bit offset means up to a 4kB sliding window of output may be referenced at any time. A 4-bit length, plus two, means up to 17 bytes may be copied in a single back reference. Compared to other compression algorithms, this is rather short.

It's important to note that the length is allowed to extend beyond the output buffer (offset < length). The bytes are, in effect, copied one at a time into the output and may potentially be reused within the same operation (like the opposite of memmove). An offset of 1 and a length of 10 means "repeat the last output byte 10 times."

That's the entire format! It's extremely simple but reasonably effective for the game's assets.

Worst Case and Best Case

In the worst case, such as compressing random data, the compression stream will be at most 112.5% (9/8) bigger than the uncompressed stream.

In the best case, such as a long string of zeros, the compressed stream will be, at minimum, 12.5% (1/8) the size of the decompressed stream. Think about it this way: imagine every chunk is a reference of maximum length. That's 1 control byte plus 16 (8 * 2) reference bytes, for a total of 17 compressed bytes. This emits 17 * 8 decompressed bytes, 17 being the maximum length from 8 chunks. Conveniently those two 17s cancel, leaving a factor of 8 for the best case.

LZSS End of Stream

If you're paying really close attention, you may have noticed that by grouping 8 control bits at a time, the length of the input stream is, strictly speaking, constrained to certain lengths. What if, during compression, the input stream stream comes up short of exactly those 8 chunks? As is, there's no way to communicate a premature end to the stream. There are three ways around this using a small amount of metadata, each differing in robustness.

  1. Keep track of the size of the decompressed data. When that many bytes have been emitted, halt. This is how God of Thunder handles it. A small validation check could be performed here. The output stream should always end between chunks, not in the middle of a chunk (i.e. in the middle of copying a back reference). Some of the bits in the control byte may contain arbitrary data that doesn't effect the output, which is a concern when hashing compressed data. My suggestion: require the unused control bits to be 0, which allows for an additional validation check. The output stream should never end just short of a literal chunk.

  2. Keep track of the size of the compressed data. Halt when no more chunks are encountered. A similar, weaker validation check can be performed here: the input stream should never stop between two bytes of a reference. It's weaker because it's less sensitive to corruption, making it harder to detect. The same unused control bit padding situation applies here.

  3. Use an out-of-band end marker (EOF). This is very similar to keeping track of the input size (the filesystem is doing it), but has the weakest validation of all. The stream could be accidentally truncated at any point between chunks, which is undetectable. This makes it the least sensitive to corruption.

An LZSS Quine

After spending some time playing around with this format, I thought about what it would take to make an LZSS quine. That is, find an LZSS compressed stream that decompresses to itself. It's been done for DEFLATE, which I imagine is a much harder problem. There are zip files containing exact copies of themselves, recursively. I'm pretty confident it's never been done for this exact compression format, simply because it's so specific to this old MS-DOS game.

I haven't figured it out yet, so you won't find the solution here. This, dear readers, is my challenge to you! Using the format described above, craft an LZSS quine. LZSS doesn't have no-op chunks (i.e. length = 0), which makes this harder than it would otherwise be. It may not even be possible, which, in that case, your challenge is to prove it!

So far I've determined that it begins with at least 4kB of 0xff. Why is this? First, as I mentioned, all compression streams begin with a control byte. Second, no references can be made until at least one literal byte has been passed, so the first bit (LSB) of the first byte is a 1, and the second byte is exactly the same as the first byte. So the first two bytes are xxxxxx1, with the x being "don't care (yet)."

If the next chunk is a back reference, those first two bytes become xxxxxx01. It could only reference that one byte (so offset = 1), and the length would need to be at least two, ensuring at least the first three bytes of output all have that same pattern. However, on the most significant byte of the reference chunk, this conflicts with having an offset of 1 because the 9th bit of the offset is set to 1, forcing the offset to an invalid 257 bytes. Therefore, the second chunk must be a literal.

This pattern continues until the first eight chunks are all literals, which means the quine begins with at least 9 0xff bytes. Going on, this also means the first back reference is going to be 0xffff (offset = 4095, length = 17), so the sliding window needs to be filled enough to make that a offset valid. References would then be used to "catch up" with the compression stream, then some magic is needed to finish off the stream.

That's where I'm stuck.

tags: [ compression ]

C Object Oriented Programming

Object oriented programming, polymorphism in particular, is essential to nearly any large, complex software system. Without it, decoupling different system components is difficult. C doesn't come with object oriented capabilities, so large C programs tend to grow their own out of C's primitives. This includes huge C projects like the Linux kernel, BSD kernels, and SQLite.

Starting Simple

Suppose you're writing a function pass_match() that takes an input stream, an output stream, and a pattern. It works sort of like grep. It passes to the output each line of input that matches the pattern. The pattern string contains a shell glob pattern to be handled by POSIX fnmatch(). Here's what the interface looks like.

void pass_match(FILE *in, FILE *out, const char *pattern);

Glob patterns are simple enough that pre-compilation, as would be done for a regular expression, is unnecessary. The bare string is enough.

Some time later the customer wants the program to support regular expressions in addition to shell-style glob patterns. For efficiency's sake, regular expressions need to be pre-compiled and so will not be passed to the function as a string. It will instead be a POSIX regex_t object. A quick-and-dirty approach might be to accept both and match whichever one isn't NULL.

void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);

Bleh. This is ugly and won't scale well. What happens when more kinds of filters are needed? It would be much better to accept a single object that covers both cases, and possibly even another kind of filter in the future.

A Generalized Filter

One of the most common ways to customize the the behavior of a function in C is to pass a function pointer. For example, the final argument to qsort() is a comparator that determines how objects get sorted.

For pass_match(), this function would accept a string and return a boolean value deciding if the string should be passed to the output stream. It gets called once on each line of input.

void pass_match(FILE *in, FILE *out, bool (*match)(const char *));

However, this has one of the same problems as qsort(): the passed function lacks context. It needs a pattern string or regex_t object to operate on. In other languages these would be attached to the function as a closure, but C doesn't have closures. It would need to be smuggled in via a global variable, which is not good.

static regex_t regex;  // BAD!!!

bool regex_match(const char *string)
{
    return regexec(&regex, string, 0, NULL, 0) == 0;
}

Because of the global variable, in practice pass_match() would be neither reentrant nor thread-safe. We could take a lesson from GNU's qsort_r() and accept a context to be passed to the filter function. This simulates a closure.

void pass_match(FILE *in, FILE *out,
                bool (*match)(const char *, void *), void *context);

The provided context pointer would be passed to the filter function as the second argument, and no global variables are needed. This would probably be good enough for most purposes and it's about as simple as possible. The interface to pass_match() would cover any kind of filter.

But wouldn't it be nice to package the function and context together as one object?

More Abstraction

How about putting the context on a struct and making an interface out of that? Here's a tagged union that behaves as one or the other.

enum filter_type { GLOB, REGEX };

struct filter {
    enum filter_type type;
    union {
        const char *pattern;
        regex_t regex;
    } context;
};

There's one function for interacting with this struct: filter_match(). It checks the type member and calls the correct function with the correct context.

bool filter_match(struct filter *filter, const char *string)
{
    switch (filter->type) {
    case GLOB:
        return fnmatch(filter->context.pattern, string, 0) == 0;
    case REGEX:
        return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
    }
    abort(); // programmer error
}

And the pass_match() API now looks like this. This will be the final change to pass_match(), both in implementation and interface.

void pass_match(FILE *input, FILE *output, struct filter *filter);

It still doesn't care how the filter works, so it's good enough to cover all future cases. It just calls filter_match() on the pointer it was given. However, the switch and tagged union aren't friendly to extension. Really, it's outright hostile. We finally have some degree of polymorphism, but it's crude. It's like building duct tape into a design. Adding new behavior means adding another switch case. This is a step backwards. We can do better.

Methods

With the switch we're no longer taking advantage of function pointers. So what about putting a function pointer on the struct?

struct filter {
    bool (*match)(struct filter *, const char *);
};

The filter itself is passed as the first argument, providing context. In object oriented languages, that's the implicit this argument. To avoid requiring the caller to worry about this detail, we'll hide it in a new switch-free version of filter_match().

bool filter_match(struct filter *filter, const char *string)
{
    return filter->match(filter, string);
}

Notice we're still lacking the actual context, the pattern string or the regex object. Those will be different structs that embed the filter struct.

struct filter_regex {
    struct filter filter;
    regex_t regex;
};

struct filter_glob {
    struct filter filter;
    const char *pattern;
};

For both the original filter struct is the first member. This is critical. We're going to be using a trick called type punning. The first member is guaranteed to be positioned at the beginning of the struct, so a pointer to a struct filter_glob is also a pointer to a struct filter. Notice any resemblance to inheritance?

Each type, glob and regex, needs its own match method.

static bool
method_match_regex(struct filter *filter, const char *string)
{
    struct filter_regex *regex = (struct filter_regex *) filter;
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static bool
method_match_glob(struct filter *filter, const char *string)
{
    struct filter_glob *glob = (struct filter_glob *) filter;
    return fnmatch(glob->pattern, string, 0) == 0;
}

I've prefixed them with method_ to indicate their intended usage. I declared these static because they're completely private. Other parts of the program will only be accessing them through a function pointer on the struct. This means we need some constructors in order to set up those function pointers. (For simplicity, I'm not error checking.)

struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.match = method_match_regex;
    return &regex->filter;
}

struct filter *filter_glob_create(const char *pattern)
{
    struct filter_glob *glob = malloc(sizeof(*glob));
    glob->pattern = pattern;
    glob->filter.match = method_match_glob;
    return &glob->filter;
}

Now this is real polymorphism. It's really simple from the user's perspective. They call the correct constructor and get a filter object that has the desired behavior. This object can be passed around trivially, and no other part of the program worries about how it's implemented. Best of all, since each method is a separate function rather than a switch case, new kinds of filter subtypes can be defined independently. Users can create their own filter types that work just as well as the two "built-in" filters.

Cleaning Up

Oops, the regex filter needs to be cleaned up when it's done, but the user, by design, won't know how to do it. Let's add a free() method.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
};

void filter_free(struct filter *filter)
{
    return filter->free(filter);
}

And the methods for each. These would also be assigned in the constructor.

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = (struct filter_regex *) f;
    regfree(&regex->regex);
    free(f);
}

static void
method_free_glob(struct filter *f)
{
    free(f);
}

The glob constructor should perhaps strdup() its pattern as a private copy, in which case it would be freed here.

Object Composition

A good rule of thumb is to prefer composition over inheritance. Having tidy filter objects opens up some interesting possibilities for composition. Here's an AND filter that composes two arbitrary filter objects. It only matches when both its subfilters match. It supports short circuiting, so put the faster, or most discriminating, filter first in the constructor (user's responsibility).

struct filter_and {
    struct filter filter;
    struct filter *sub[2];
};

static bool
method_match_and(struct filter *f, const char *s)
{
    struct filter_and *and = (struct filter_and *) f;
    return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);
}

static void
method_free_and(struct filter *f)
{
    struct filter_and *and = (struct filter_and *) f;
    filter_free(and->sub[0]);
    filter_free(and->sub[1]);
    free(f);
}

struct filter *filter_and(struct filter *a, struct filter *b)
{
    struct filter_and *and = malloc(sizeof(*and));
    and->sub[0] = a;
    and->sub[1] = b;
    and->filter.match = method_match_and;
    and->filter.free = method_free_and;
    return &and->filter;
}

It can combine a regex filter and a glob filter, or two regex filters, or two glob filters, or even other AND filters. It doesn't care what the subfilters are. Also, the free() method here frees its subfilters. This means that the user doesn't need to keep hold of every filter created, just the "top" one in the composition.

To make composition filters easier to use, here are two "constant" filters. These are statically allocated, shared, and are never actually freed.

static bool
method_match_any(struct filter *f, const char *string)
{
    return true;
}

static bool
method_match_none(struct filter *f, const char *string)
{
    return false;
}

static void
method_free_noop(struct filter *f)
{
}

struct filter FILTER_ANY  = { method_match_any,  method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };

The FILTER_NONE filter will generally be used with a (theoretical) filter_or() and FILTER_ANY will generally be used with the previously defined filter_and().

Here's a simple program that composes multiple glob filters into a single filter, one for each program argument.

int main(int argc, char **argv)
{
    struct filter *filter = &FILTER_ANY;
    for (char **p = argv + 1; *p; p++)
        filter = filter_and(filter_glob_create(*p), filter);
    pass_match(stdin, stdout, filter);
    filter_free(filter);
    return 0;
}

Notice only one call to filter_free() is needed to clean up the entire filter.

Multiple Inheritance

As I mentioned before, the filter struct must be the first member of filter subtype structs in order for type punning to work. If we want to "inherit" from two different types like this, they would both need to be in this position: a contradiction.

Fortunately type punning can be generalized such that it the first-member constraint isn't necessary. This is commonly done through a container_of() macro. Here's a C99-conforming definition.

#include <stddef.h>

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Given a pointer to a member of a struct, the container_of() macro allows us to back out to the containing struct. Suppose the regex struct was defined differently, so that the regex_t member came first.

struct filter_regex {
    regex_t regex;
    struct filter filter;
};

The constructor remains unchanged. The casts in the methods change to the macro.

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);

}

It's a constant, compile-time computed offset, so there should be no practical performance impact. The filter can now participate freely in other intrusive data structures, like linked lists and such. It's analogous to multiple inheritance.

Vtables

Say we want to add a third method, clone(), to the filter API, to make an independent copy of a filter, one that will need to be separately freed. It will be like the copy assignment operator in C++. Each kind of filter will need to define an appropriate "method" for it. As long as new methods like this are added at the end, this doesn't break the API, but it does break the ABI regardless.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};

The filter object is starting to get big. It's got three pointers -- 24 bytes on modern systems -- and these pointers are the same between all instances of the same type. That's a lot of redundancy. Instead, these pointers could be shared between instances in a common table called a virtual method table, commonly known as a vtable.

Here's a vtable version of the filter API. The overhead is now only one pointer regardless of the number of methods in the interface.

struct filter {
    struct filter_vtable *vtable;
};

struct filter_vtable {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);
};

Each type creates its own vtable and links to it in the constructor. Here's the regex filter re-written for the new vtable API and clone method. This is all the tricks in one basket for a big object oriented C finale!

struct filter *filter_regex_create(const char *pattern);

struct filter_regex {
    regex_t regex;
    const char *pattern;
    struct filter filter;
};

static bool
method_match_regex(struct filter *f, const char *string)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;
}

static void
method_free_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    regfree(&regex->regex);
    free(f);
}

static struct filter *
method_clone_regex(struct filter *f)
{
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return filter_regex_create(regex->pattern);
}

/* vtable */
struct filter_vtable filter_regex_vtable = {
    method_match_regex, method_free_regex, method_clone_regex
};

/* constructor */
struct filter *filter_regex_create(const char *pattern)
{
    struct filter_regex *regex = malloc(sizeof(*regex));
    regex->pattern = pattern;
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.vtable = &filter_regex_vtable;
    return &regex->filter;
}

This is almost exactly what's going on behind the scenes in C++. When a method/function is declared virtual, and therefore dispatches based on the run-time type of its left-most argument, it's listed in the vtables for classes that implement it. Otherwise it's just a normal function. This is why functions need to be declared virtual ahead of time in C++.

In conclusion, it's relatively easy to get the core benefits of object oriented programming in plain old C. It doesn't require heavy use of macros, nor do users of these systems need to know that underneath it's an object system, unless they want to extend it for themselves.

Here's the whole example program once if you're interested in poking:

tags: [ c cpp ]

Emacs Autotetris Mode

For more than a decade now, Emacs has come with a built-in Tetris clone, originally written by XEmacs' Glynn Clements. Just run M-x tetris any time you want to play. For anyone too busy to waste time playing Tetris, earlier this year I wrote an autotetris-mode that will play the Emacs game automatically.

Load the source, autotetris-mode.el and M-x autotetris. It will start the built-in Tetris but make all the moves itself. It works best when byte compiled.

At the time I had read an article and was interested in trying my hand at my own Tetris AI. Like most things Emacs, the built-in Tetris game is very hackable. It's also pretty simple and easy to understand. Rather than write my own I chose to build upon this one.

Heuristics

It's not a particularly strong AI. It doesn't pay attention to the next piece in queue, it doesn't know the game's basic shapes, and it doesn't try to maximize the score (clearing multiple rows at once). The goal is to continue running for as long as possible. But since it's able to get to the point where the game is so fast that the AI is unable to move pieces fast enough (it's rate limited like a human player), that means it's good enough.

When a new piece appears at the top of the screen, the AI, in memory, tries placing it in all possible positions and all possible orientations. For each of these positions it runs a heuristic on the resulting game state, summing five metrics. Each metric is scaled by a hand-tuned weight to adjust its relative priority. Smaller is better, so the position with the lowest score is selected.

Number of Holes

A hole is any open space that has a solid block above it, even if that hole is accessible without passing through a solid block. Count these holes.

Maximum Height

Add the height of the tallest column. Column height includes any holes in the column. The game ends when a column touches the top of the screen (or something like that), so this should be kept in check.

Mean Height

Add the mean height of all columns. The higher this is, the closer we are to losing the game. Since each row will have at least one hole, this will be a similar measure to the hole count.

Height Disparity

Add the difference between the shortest column height and the tallest column height. If this number is large it means we're not making effective use of the playing area. It also discourages the AI from getting into that annoying situation we all remember: when you really need a 4x1 piece that never seems to come. Those are the brief moments when I truly believe the version I'm playing has to be rigged.

Surface Roughness

Take the root mean square of the column heights. A rougher surface leaves fewer options when placing pieces. This measure will be similar to the disparity measurement.

Emacs-specific Details

With a position selected, the AI sends player inputs at a limited rate to the game itself, moving the piece into place. This is done by calling tetris-move-right, tetris-move-left, and tetris-rotate-next, which, in the normal game, are bound to the arrow keys.

The built-in tetris-mode isn't quite designed for this kind of extension, so it needs a little bit of help. I defined two pieces of advice to create hooks. These hooks alert my AI to two specific events in the game: the game start and a fresh, new piece.

(defadvice tetris-new-shape (after autotetris-new-shape-hook activate)
  (run-hooks 'autotetris-new-shape-hook))

(defadvice tetris-start-game (after autotetris-start-game-hook activate)
  (run-hooks 'autotetris-start-game-hook))

I talked before about the problems with global state. Fortunately, tetris-mode doesn't store any game state in global variables. It stores everything in buffer-local variables, which can be exploited for use in the AI. To perform the "in memory" heuristic checks, it creates a copy of the game state and manipulates the copy. The copy is made by way of clone-buffer on the *Tetris* buffer. The tetris-mode functions all work equally as well on the clone, so I can use the existing game rules to properly place the next piece in each available position. The game's own rules take care of clearing rows and checking for collisions for me. I wrote an autotetris-save-excursion function to handle the messy details.

(defmacro autotetris-save-excursion (&rest body)
  "Restore tetris game state after BODY completes."
  (declare (indent defun))
  `(with-current-buffer tetris-buffer-name
     (let ((autotetris-saved (clone-buffer "*Tetris-saved*")))
       (unwind-protect
           (with-current-buffer autotetris-saved
             (kill-local-variable 'kill-buffer-hook)
             ,@body)
         (kill-buffer autotetris-saved)))))

The kill-buffer-hook variable is also cloned, but I don't want tetris-mode to respond to the clone being killed, so I clear out the hook.

That's basically all there is to it! While watching it feels like it's making dumb mistakes, not placing pieces in optimal positions, but it recovers well from these situations almost every time, so it must know what it's doing. Currently it's a better player than me, which is my rule-of-thumb for calling an AI successful.

tags: [ emacs interactive ]

Global State: A Tale of Two Bad C APIs

Mutable global variables are evil. You've almost certainly heard that before, but it's worth repeating. It makes programs, and libraries especially, harder to understand, harder to optimize, more fragile, more error prone, and less useful. If you're using global state in a way that's visible to users of your API, and it's not essential to the domain, you're almost certainly doing something wrong.

In this article I'm going to use two well-established C APIs to demonstrate why global state is bad for APIs: BSD regular expressions and POSIX Getopt.

BSD Regular Expressions

The BSD regular expression API dates back to 4.3BSD, released in 1986. It's just a pair of functions: one compiles the regex, the other executes it on a string.

char *re_comp(const char *regex);
int   re_exec(const char *string);

It's immediately obvious that there's hidden internal state. Where else would the resulting compiled regex object be? Also notice there's no re_free(), or similar, for releasing resources held by the compiled result. That's because, due to its limited design, it doesn't hold any. It's entirely in static memory, which means there's some upper limit on the complexity of the regex given to this API. Suppose an implementation does use dynamically allocated memory. It seems this might not matter when only one compiled regex is allowed. However, this would create warnings in Valgrind and make it harder to use for bug testing.

This API is not thread-safe. Only one thread can use it at a time. It's not reentrant. While using a regex, calling another function that might use a regex means you have to recompile when it returns, just in case. The global state being entirely hidden, there's no way to tell if another part of the program used it.

Fixing BSD Regular Expressions

This API has been deprecated for some time now, so hopefully no one's using it anymore. 15 years after the BSD regex API came out, POSIX standardized a much better API. It operates on an opaque regex_t object, on which all state is stored. There's no global state.

int    regcomp(regex_t *preg, const char *regex, int cflags);
int    regexec(const regex_t *preg, const char *string, ...);
size_t regerror(int errcode, const regex_t *preg, ...);
void   regfree(regex_t *preg);

This is what a good API looks like.

Getopt

POSIX defines a C API called Getopt for parsing command line arguments. It's a single function that operates on the argc and argv values provided to main(). An option string specifies which options are valid and whether or not they require an argument. Typical use looks like this,

int main(int argc, char **argv)
{
    int option;
    while ((option = getopt(argc, argv, "ab:c:d")) != -1) {
        switch (option) {
            case 'a':
            /* ... */
        }
    }
    /* ... */
    return 0;
}

The b and c options require an argument, indicated by the colons. When encountered, this argument is passed through a global variable optarg. There are four external global variables in total.

extern char *optarg;
extern int optind, opterr, optopt;

If an invalid option is found, getopt() will automatically print a locale-specific error message and return ?. The opterr variable can be used to disable this message and the optopt variable is used to get the actual invalid option character.

The optind variable keeps track of Getopt's progress. It slides along argv as each option is processed. In a minimal, strictly POSIX-compliant Getopt, this is all the global state required.

The argc value in main(), and therefore the same parameter in getopt(), is completely redundant and serves no real purpose. Just like the C strings it points to, the argv vector is guaranteed to be NULL-terminated. At best it's a premature optimization.

Threading an Reentrancy

The most immediate problem is that the entire program can only parse one argument vector at a time. It's not thread-safe. This leaves out the possibility of parsing argument vectors in other threads. For example, if the program is a server that exposes a shell-like interface to remote users, and multiple threads are used to handle those requests, it won't be able to take advantage of Getopt.

The second problem is that, even in a single-threaded application, the program can't pause to parse a different argument vector before returning. It's not reentrant. For example, suppose one of the arguments to the program is a string containing more arguments to be parsed for some subsystem.

#  -s    Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"

In theory, the value of optind could be saved and restored. However, this isn't portable. POSIX doesn't explicitly declare that the entire state is captured by optind, nor is it required to be. Implementations are allowed to have internal, hidden global state. This has implications in resetting Getopt.

Resetting Getopt

In a minimal, strict Getopt, resetting Getopt for parsing another argument vector is just a matter of setting optind to back to its original value of 1. However, this idiom isn't portable, and POSIX provides no portable method for resetting the global parser state.

Real implementations of Getopt go beyond POSIX. Probably the most popular extra feature is option grouping. Typically, multiple options can be grouped into a single argument, so long as only the final option requires an argument.

$ myprogram -adb foo

After processing a, optind cannot be incremented, because it's still working on the first argument. This means there's another internal counter for stepping across the group. In glibc this is called nextchar. Setting optind to 1 will not reset this internal counter, nor would it be detectable by Getopt if it was already set to 1. The glibc way to reset Getopt is to set optind to 0, which is otherwise an invalid value. Some other Getopt implementations follow this idiom, but it's not entirely portable.

Not only does Getopt have nasty global state, the user has no way to reliably control it!

Error Printing

I mentioned that Getopt will automatically print an error message unless disabled with opterr. There's no way to get at this error message, should you want to redirect it somewhere else. It's more hidden, internal state. You could write your own message, but you'd lose out on the automatic locale support.

Fixing Getopt

The way Getopt should have been designed was to accept a context argument and store all state on that context. Following other POSIX APIs (pthreads, regex), the context itself would be an opaque object. In typical use it would have automatic (i.e. stack) duration. The context would either be zero initialized or a function would be provided to initialize it. It might look something like this (in the zero-initialized case).

int getopt(getopt_t *ctx, char **argv, const chat *optstring);

Instead of optarg and optopt global variables, these values would be obtained by interrogating the context. The same applies for optind and the diagnostic message.

const char *getopt_optarg(getopt_t *ctx);
int         getopt_optopt(getopt_t *ctx);
int         getopt_optind(getopt_t *ctx);
const char *getopt_opterr(getopt_t *ctx);

Alternatively, instead of getopt_optind() the API could have a function that continues processing, but returns non-option arguments instead of options. It would return NULL when no more arguments are left. This is the API I'd prefer, because it would allow for argument permutation (allow options to come after non-options, per GNU Getopt) without actually modifying the argument vector. This common extension to Getopt could be added cleanly. The real Getopt isn't designed well for extension.

const char *getopt_next_arg(getopt_t *ctx);

This API eliminates the global state and, as a result, solves all of the problems listed above. It's essentially the same API defined by Popt and my own embeddable Optparse. They're much better options if the limitations of POSIX-style Getopt are an issue.

tags: [ c ]