null program

Mandelbrot Set with SIMD Intrinsics

When I started this blog 8 years ago, my first post was about the Mandelbrot set. Since then, both technology and my own skills have improved (or so I like to believe!), so I'm going to take another look at it, this time using three different Single Instruction, Multiple Data (SIMD) instruction sets: SSE2, AVX, and NEON. The latter two didn't exist when the last article was published. In this article I demonstrate SIMD bringing a 5.8x speedup to a fractal renderer.

If you want to take a look at my code before reading further:

Having multiple CPU cores allows different instructions to operation on (usually) different data independently. In contrast, under SIMD a specific operation (single instruction) acts upon several values (multiple data) at once. It's another form of parallelization. For example, with image processing -- perhaps the most common use case -- this means multiple pixels could be computed within the same number of cycles it would normally take to compute just one. SIMD is generally implemented on CPUs through wide registers: 64, 128, 256, and even 512 bits wide. Values are packed into the register like an array and are operated on independently, generally with saturation arithmetic (clamped, non-wrapping).

Rather than hand-code all this in assembly, I'm using yet another technique I picked up from the always-educational Handmade Hero: compiler intrinsics. The code is all C, but in place of C's operators are pseudo-function calls operating on special SIMD types. These aren't actual function calls, they're intrinsics. The compiler will emit a specific assembly instruction for each intrinsic, sort of like an inline function. This is more flexible for mixing with other C code, the compiler will manage all the registers, and the compiler will attempt to re-order and interleave instructions to maximize throughput. It's a big win!

Some SIMD History

The first widely consumer available SIMD hardware was probably the MMX instruction set, introduced to 32-bit x86 in 1997. This provided 8 64-bit mm0 - mm7, registers aliasing the older x87 floating pointer registers, which operated on packed integer values. This was extended by AMD with its 3DNow! instruction set, adding floating point instructions.

However, you don't need to worry about any of that because these both were superseded by Streaming SIMD Extensions (SSE) in 1999. SSE has 128-bit registers -- confusingly named xmm0 - xmm7 -- and a much richer instruction set. SSE has been extended with SSE2 (2001), SSE3 (2004), SSSE3 (2006), SSE4.1 (2007), and SSE4.2 (2008). x86_64 doesn't have SSE2 as an extension but instead as a core component of the architecture (adding xmm8- xmm15), baking it into its ABI.

In 2009, ARM introduced the NEON instruction set as part of ARMv6. Like SSE, it has 128-bit registers, but its instruction set is more consistent and uniform. One of its most visible features over SSE is a stride load parameter making it flexible for a wider variety data arrangements. NEON is available on your Raspberry Pi, which is why I'm using it here.

In 2011, Intel and AMD introduced the Advanced Vector Extensions (AVX) instruction set. Essentially it's SSE with 256-bit registers, named ymm0 - ymm15. That means operating on 8 single-precision floats at once! As of this writing, this extensions is just starting to become commonplace on desktops and laptops. It also has extensions: AVX2 (2013) and AVX-512 (2015).

Starting with C

Moving on to the code, in mandel.c you'll find mandel_basic, a straight C implementation that produces a monochrome image. Normally I would post the code here within the article, but it's 30 lines long and most of it isn't of any particular interest.

I didn't use C99's complex number support because -- continuing to follow the approach Handmade Hero -- I intended to port this code directly into SIMD intrinsics. It's much easier to work from a straight non-SIMD implementation towards one with compiler intrinsics than coding with compiler intrinsics right away. In fact, I'd say it's almost trivial, since I got it right the first attempt on all three.

There's just one unusual part:

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < s->height; y++) {
   /* ... */

This is an Open Multi-Processing (OpenMP) pragma. It's a higher-level threading API than POSIX or Win32 threads. OpenMP takes care of all thread creation, work scheduling, and cleanup. In this case, the for loop is parallelized such that each row of the image will be scheduled individually to a thread, with one thread spawned for each CPU core. This one line saves all the trouble of managing a work queue and such. I also use it in my SIMD implementations, composing both forms of parallelization for maximum performance.

I did it in single precision because I really want to exploit SIMD. Obviously, being half as wide as double precision, twice an many single precision operands can fit in a SIMD register.

On my wife's i7-4770 (8 logical cores), it takes 29.9ms to render one image using the defaults (1440x1080, real{-2.5, 1.5}, imag{-1.5, 1.5}, 256 iterations). I'll use the same machine for both the SSE2 and AVX benchmarks.

SSE2 Mandelbrot Set

The first translation I did was SSE2 (mandel_sse2.c). As with just about any optimization, it's more complex and harder to read than the straight version. Again, I won't post the code here, especially when this one has doubled to 60 lines long.

Porting to SSE2 (and SIMD in general) is simply a matter of converting all assignments and arithmetic operators to their equivalent intrinsics. The Intel Intrinsics Guide is a godsend for this step. It's easy to search for specific operations and it tells you what headers they come from. Notice that there are no C arithmetic operators until the very end, after the results have been extracted from SSE and pixels are being written.

There are two new types present in this version, __m128 and __m128i. These will be mapped to SSE registers by the compiler, sort of like the old (outdated) C register keyword. One big difference is that it's legal to take the address of these values with &, and the compiler will worry about the store/load operations. The first type is for floating point values and the second is for integer values. At first it's annoying for these to be separate types (the CPU doesn't care), but it becomes a set of compiler-checked rails for avoiding mistakes.

Here's how assignment was written in the straight C version:

float iter_scale = 1.0f / s->iterations;

And here's the SSE version. SSE intrinsics are prefixed with _mm, and the "ps" stands for "packed single-precision."

__m128 iter_scale = _mm_set_ps1(1.0f / s->iterations);

This sets all four lanes of the register to the same value (a broadcast). Lanes can also be assigned individually, such as at the beginning of the innermost loop.

__m128 mx = _mm_set_ps(x + 3, x + 2, x + 1, x + 0);

This next part shows why the SSE2 version is longer. Here's the straight C version:

float zr1 = zr * zr - zi * zi + cr;
float zi1 = zr * zi + zr * zi + ci;
zr = zr1;
zi = zi1;

To make it easier to read in the absence of operator syntax, I broke out the intermediate values. Here's the same operation across four different complex values simultaneously. The purpose of these intrinsics should be easy to guess from their names.

__m128 zr2 = _mm_mul_ps(zr, zr);
__m128 zi2 = _mm_mul_ps(zi, zi);
__m128 zrzi = _mm_mul_ps(zr, zi);
zr = _mm_add_ps(_mm_sub_ps(zr2, zi2), cr);
zi = _mm_add_ps(_mm_add_ps(zrzi, zrzi), ci);

There are a bunch of swizzle instructions added in SSSE3 and beyond for re-arranging bytes within registers. With those I could eliminate that last bit of non-SIMD code at the end of the function for packing pixels. In an earlier version I used them, but since pixel packing isn't a hot spot in this code (it's outside the tight, innermost loop), it didn't impact the final performance, so I took it out for the sake of simplicity.

The running time is now 8.56ms per image, a 3.5x speedup. That's close to the theoretical 4x speedup from moving to 4-lane SIMD. That's fast enough to render fullscreen at 60FPS.

AVX Mandelbrot Set

With SSE2 explained, there's not much to say about AVX (mandel_avx.c). The only difference is the use of __m256, __m256i, the _mm256 intrinsic prefix, and that this operates on 8 points on the complex plane instead of 4.

It's interesting that the AVX naming conventions are subtly improved over SSE. For example, here are the SSE broadcast intrinsics.

Notice the oddball at the end? That's discrimination against sufferers of obsessive-compulsive personality disorder. This was fixed in AVX's broadcast intrinsics:

The running time here is 5.20ms per image, a 1.6x speedup from SSE2. That's not too far from the theoretical 2x speedup from using twice as many lanes. We can render at 60FPS and spend most of the time waiting around for the next vsync.

NEON Mandelbrot Set

NEON is ARM's take on SIMD. It's what you'd find on your phone and tablet rather than desktop or laptop. NEON behaves much like a co-processor: NEON instructions are (cheaply) dispatched asynchronously to their own instruction pipeline, but transferring data back out of NEON is expensive and will stall the ARM pipeline until the NEON pipeline catches up.

Going beyond __m128 and __m256, NEON intrinsics have a type for each of the possible packings. On x86, the old stack-oriented x87 floating-point instructions are replaced with SSE single-value ("ss", "sd") instructions. On ARM, there's no reason to use NEON to operate on single values, so these "packings" don't exist. Instead there are half-wide packings. Note the lack of double-precision support.

Again, the CPU doesn't really care about any of these types. It's all to help the compiler help us. For example, we don't want to multiply a float32x4_t and a float32x2_t since it wouldn't have a meaningful result.

Otherwise everything is similar (mandel_neon.c). NEON intrinsics are (less-cautiously) prefixed with v and suffixed with a type (_f32, _u32, etc.).

The performance on my model Raspberry Pi 2 (900 MHz quad-core ARM Cortex-A7) is 545ms per frame without NEON and 232ms with NEON, a 2.3x speedup. This isn't nearly as impressive as SSE2, also at 4 lanes. My implementation almost certainly needs more work, especially since I know less about ARM than x86.

Compiling with Intrinsics

For the x86 build, I wanted the same binary to have AVX, SSE2, and plain C versions, selected by a command line switch and feature availability, so that I could easily compare benchmarks. Without any special options, gcc and clang will make conservative assumptions about the CPU features of the target machine. In order to build using AVX intrinsics, I need the compiler to assume the target has AVX. The -mavx argument does this.

mandel_avx.o : mandel_avx.c
    $(CC) -c $(CFLAGS) -mavx -o $@ $^

mandel_sse2.o : mandel_sse2.c
    $(CC) -c $(CFLAGS) -msse2 -o $@ $^

mandel_neon.o : mandel_neon.c
    $(CC) -c $(CFLAGS) -mfpu=neon -o $@ $^

All x86_64 CPUs have SSE2 but I included it anyway for clarity. But it should also enable it for 32-bit x86 builds.

It's absolutely critical that each is done in a separate translation unit. Suppose I compiled like so in one big translation unit,

gcc -msse2 -mavx mandel.c mandel_sse2.c mandel_avx.c

The compiler will likely use some AVX instructions outside of the explicit intrinsics, meaning it's going to crash on machine without AVX ("illegal instruction"). The main program needs to be compiled with AVX disabled. That's where it will test for AVX before executing any special instructions.

Feature Testing

Intrinsics are well-supported across different compilers (surprisingly, even including the late-to-the-party Microsoft). Unfortunately testing for CPU features differs across compilers. Intel advertises a _may_i_use_cpu_feature intrinsic, but it's not supported in either gcc or clang. gcc has a __builtin_cpu_supports built-in, but it's only supported by gcc.

The most portable solution I came up with is cpuid.h (x86 specific). It's supported by at least gcc and clang. The clang version of the header is much better documented, so if you want to read up on how this works, read that one.

#include <cpuid.h>

static inline int
    unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
    __get_cpuid(1, &eax, &ebx, &ecx, &edx);
    return ecx & bit_AVX ? 1 : 0;

And in use:

if (use_avx && is_avx_supported())
    mandel_avx(image, &spec);
else if (use_sse2)
    mandel_sse2(image, &spec);
    mandel_basic(image, &spec);

I don't know how to test for NEON, nor do I have the necessary hardware to test it, so on ARM assume it's always available.


Using SIMD intrinsics for the Mandelbrot set was just an exercise to learn how to use them. Unlike in Handmade Hero, where it makes a 1080p 60FPS software renderer feasible, I don't have an immediate, practical use for CPU SIMD, but, like so many similar techniques, I like having it ready in my toolbelt for the next time an opportunity arises.

tags: [ c ]

Minimal OpenGL 3.3 Core Profile Demo

When I was first attempting to learn OpenGL years ago, what I really wanted was a complete, minimal example program. OpenGL has enormous flexibility and I wanted to fully understand the fundamentals in isolation before moving on to more advanced features. I had been advised to specifically learn core profile, which drops nearly all the legacy parts of the API.

However, since much of the OpenGL-related content to be found online, even today, is outdated -- and, worse, it's not marked as such -- good, modern core profile examples have been hard to come by. The relevant examples I could find at the time were more complicated than necessary, due to the common problem that full 3D graphics are too closely conflated with OpenGL. The examples would include matrix libraries, texture loading, etc. This is a big reason I ended up settling on WebGL: a clean slate in a completely different community. (The good news is that this situation has already improved dramatically over the last few years!)

Until recently, all of my OpenGL experience had been WebGL. Wanting to break out of that, earlier this year I set up a minimal OpenGL 3.3 core profile demo in C, using GLFW and gl3w. You can find it here:

No 3D graphics, no matrix library, no textures. It's just a spinning red square.

It supports both Linux and Windows. The Windows' build is static, so it compiles to a single, easily distributable, standalone binary. With some minor tweaking it would probably support the BSDs as well. For simplicity's sake, the shaders are baked right into the source as strings, but if you're extending the demo for your own use, you may want to move them out into their own source files.

Why OpenGL 3.3?

I chose OpenGL 3.3 in particular for three reasons:

As far as "desktop" OpenGL goes, 3.3 is currently the prime target.


Until EGL someday fills this role, the process for obtaining an OpenGL context is specific to each operating system, where it's generally a pain in the butt. GLUT, the OpenGL Utility Toolkit, was a library to make this process uniform across the different platforms. It also normalized user input (keyboard and mouse) and provided some basic (and outdated) utility functions.

The original GLUT isn't quite open source (licensing issues) and it's no longer maintained. The open source replacement for GLUT is FreeGLUT. It's what you'd typically find on a Linux system in place of the original GLUT.

I just need a portable library that creates a window, handles keyboard and mouse events in that window, and gives me an OpenGL 3.3 core profile context. FreeGLUT does this well, but we can do better. One problem is that it includes a whole bunch of legacy cruft from GLUT: immediate mode rendering utilities, menus, spaceball support, lots of global state, and only one OpenGL context per process.

One of the biggest problems is that FreeGLUT doesn't have a swap interval function. This is used to lock the application's redraw rate to the system's screen refresh rate, preventing screen tearing and excessive resource consumption. I originally used FreeGLUT for the demo, and, as a workaround, had added my own macro work around this by finding the system's swap interval function, but it was a total hack.

The demo was initially written with FreeGLUT, but I switched over to GLFW since it's smaller, simpler, cleaner, and more modern. GLFW also has portable joystick handling. With the plethora of modern context+window creation libraries out there, it seems there's not much reason to use FreeGLUT anymore.

SDL 2.0 would also be an excellent choice. It goes beyond GLFW with threading, audio, networking, image loading, and timers: basically all the stuff you'd need when writing a game.

I'm sure there are some other good alternatives, especially when you're not sticking to plain C, but these are the libraries I'm familiar with at the time of this article.

Why gl3w?

If you didn't think the interface between OpenGL and the operating system was messy enough, I have good news for you. Neither the operating system nor the video card drivers are going to provide any of the correct headers, nor will you have anything meaningful to link against! For these, you're on your own.

The OpenGL Extension Wrangler Library (GLEW) was invented solve this problem. It dynamically loads the system's OpenGL libraries and finds all the relevant functions at run time. That way your application avoids linking to anything too specific. At compile time, it provides the headers defining all of the OpenGL functions.

Over the years, GLEW has become outdated, to this day having no support for core profile. So instead I used a replacement called gl3w. It's just like GLEW, but, as the name suggests, oriented around core profile ... exactly what I needed. Unlike GLEW, it is generated directly from Kronos' documentation by a script. In practice, you drop the generated code directly into your project (embedded) rather than rely on the system to provide it as a library.

A great (and probably better) alternative to gl3w is glLoadgen. It's the same idea -- an automatically generated OpenGL loader -- but allows for full customization of the output, such as the inclusion of select OpenGL extensions.


While I hope it serves an educational resources for others, I primarily have it for my own record-keeping, pedagogical, and reference purposes, born out of a weekend's worth of research. It's a starting point for future projects, and it's somewhere easy to start when I want to experiment with an idea.

Plus, someday I want to write a sweet, standalone game with fancy OpenGL graphics.

tags: [ opengl c ]

Raw Linux Threads via System Calls

Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main "thread's" parent process. They're manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It's elegant in the same way file descriptors are elegant.

Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent's execution context with the child.

It's so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I'll demonstrate how to do this on x86_64. All of the code with be written in NASM syntax since, IMHO, it's by far the best (see: nasm-mode).

I've put the complete demo here if you want to see it all at once:

An x86_64 Primer

I want you to be able to follow along even if you aren't familiar with x86_64 assembly, so here's a short primer of the relevant pieces. If you already know x86_64 assembly, feel free to skip to the next section.

x86_64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won't need them for threading.

The "r" prefix indicates that they're 64-bit registers. It won't be relevant in this article, but the same name prefixed with "e" indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86_64 they're almost completely interchangeable.

There's also a "rip" instruction pointer register that conceptually walks along the machine instructions as they're being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.

The Stack

The rsp register points to the "top" of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put "top" in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we'll be allocating our own stacks.

The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86_64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:

So, for example, to perform a function call like foo(1, 2, 3), store 1, 2 and 3 in rdi, rsi, and rdx, then call the function. The mov instruction stores the source (second) operand in its destination (first) operand. The call instruction pushes the current value of rip onto the stack, then sets rip (jumps) to the address of the target function. When the callee is ready to return, it uses the ret instruction to pop the original rip value off the stack and back into rip, returning control to the callee.

    mov rdi, 1
    mov rsi, 2
    mov rdx, 3
    call foo

Called functions must preserve the contents of these registers (the same value must be stored when the function returns):

System Calls

When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.

Each system call has an integer identifying it. This number is different on each platform, but, in Linux's case, it will never change. Instead of call, rax is set to the number of the desired system call and the syscall instruction makes the request to the OS kernel. Prior to x86_64, this was done with an old-fashioned interrupt. Because interrupts are slow, a special, statically-positioned "vsyscall" page (now deprecated as a security hazard), later vDSO, is provided to allow certain system calls to be made as function calls. We'll only need the syscall instruction in this article.

So, for example, the write() system call has this C prototype.

ssize_t write(int fd, const void *buf, size_t count);

On x86_64, the write() system call is at the top of the system call table as call 1 (read() is 0). Standard output is file descriptor 1 by default (standard input is 0). The following bit of code will write 10 bytes of data from the memory address buffer (a symbol defined elsewhere in the assembly program) to standard output. The number of bytes written, or -1 for error, will be returned in rax.

    mov rdi, 1        ; fd
    mov rsi, buffer
    mov rdx, 10       ; 10 bytes
    mov rax, 1        ; SYS_write

Effective Addresses

There's one last thing you need to know: registers often hold a memory address (i.e. a pointer), and you need a way to read the data behind that address. In NASM syntax, wrap the register in brackets (e.g. [rax]), which, if you're familiar with C, would be the same as dereferencing the pointer.

These bracket expressions, called an effective address, may be limited mathematical expressions to offset that base address entirely within a single instruction. This expression can include another register (index), a power-of-two scalar (bit shift), and an immediate signed offset. For example, [rax + rdx*8 + 12]. If rax is a pointer to a struct, and rdx is an array index to an element in array on that struct, only a single instruction is needed to read that element. NASM is smart enough to allow the assembly programmer to break this mold a little bit with more complex expressions, so long as it can reduce it to the [base + index*2^exp + offset] form.

The details of addressing aren't important this for this article, so don't worry too much about it if that didn't make sense.

Allocating a Stack

Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it's not essential, I won't cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.

The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.

You can't just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:

On x86_64, mmap() is system call 9. I'll define a function to allocate a stack with this C prototype.

void *stack_create(void);

The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.

void *mmap(void *addr, size_t length, int prot, int flags);

For flags, we'll choose a private, anonymous mapping that, being a stack, grows downward. Even with that last flag, the system call will still return the bottom address of the mapping, which will be important to remember later. It's just a simple matter of setting the arguments in the registers and making the system call.

%define SYS_mmap    9
%define STACK_SIZE  (4096 * 1024)   ; 4 MB

    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov rax, SYS_mmap

Now we can allocate new stacks (or stack-sized buffers) as needed.

Spawning a Thread

Spawning a thread is so simple that it doesn't even require a branch instruction! It's a call to clone() with two arguments: clone flags and a pointer to the new thread's stack. It's important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we're using, it takes two arguments.

long sys_clone(unsigned long flags, void *child_stack);

Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.

long thread_create(void (*)(void));

The function pointer argument is passed via rdi, per the ABI. Store this for safekeeping on the stack (push) in preparation for calling stack_create(). When it returns, the address of the low end of stack will be in rax.

    push rdi
    call stack_create
    lea rsi, [rax + STACK_SIZE - 8]
    pop qword [rsi]
    mov rax, SYS_clone

The second argument to clone() is a pointer to the high address of the stack (specifically, just above the stack). So we need to add STACK_SIZE to rax to get the high end. This is done with the lea instruction: load effective address. Despite the brackets, it doesn't actually read memory at that address, but instead stores the address in the destination register (rsi). I've moved it back by 8 bytes because I'm going to place the thread function pointer at the "top" of the new stack in the next instruction. You'll see why in a moment.

Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.

As you can see, it takes a lot of flags to create a thread with clone(). Most things aren't shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.

A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).

Now here's the really cool part, and the reason branching isn't needed. There's no reason to check rax to determine if we are the original thread (in which case we return to the caller) or if we're the new thread (in which case we jump to the thread function). Remember how we seeded the new stack with the thread function? When the new thread returns (ret), it will jump to the thread function with a completely empty stack. The original thread, using the original stack, will return to the caller.

The value returned by thread_create() is the process ID of the new thread, which is essentially the thread object (e.g. Pthread's pthread_t).

Cleaning Up

The thread function has to be careful not to return (ret) since there's nowhere to return. It will fall off the stack and terminate the program with a segmentation fault. Remember that threads are just processes? It must use the exit() syscall to terminate. This won't terminate the other threads.

%define SYS_exit    60

    mov rax, SYS_exit

Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.

More Exploration

If you found this interesting, be sure to check out the full demo link at the top of this article. Now with the ability to spawn threads, it's a great opportunity to explore and experiment with x86's synchronization primitives, such as the lock instruction prefix, xadd, and compare-and-exchange (cmpxchg). I'll discuss these in a future article.

tags: [ x86 linux c tutorial ]

NASM x86 Assembly Major Mode for Emacs

Last weekend I created a new Emacs mode, nasm-mode, for editing Netwide Assembler (NASM) x86 assembly programs. Over the past week I tweaked it until it felt comfortable enough to share on MELPA. It's got what you'd expect from a standard Emacs programming language mode: syntax highlighting, automatic indentation, and imenu support. It's not a full parser, but it knows all of NASM's instructions and directives.

Until recently I didn't really have preferences about x86 assemblers (GAS, NASM, YASM, FASM, MASM, etc.) or syntax (Intel, AT&T). I stuck to the GNU Assembler (GAS) since it's already there with all the other GNU development tools I know and love, and it's required for inline assembly in GCC. However, nasm-mode now marks my commitment to NASM as my primary x86 assembler.


I need an assembler that can assemble 16-bit code (8086, 8088, 80186, 80286), because real mode is fun. Despite its .code16gcc directive, GAS is not suitable for this purpose. It's just enough to get the CPU into protected mode -- as needed when writing an operating system with GCC -- and that's it. A different assembler is required for serious 16-bit programming.

GAS syntax has problems. I'm not talking about the argument order (source first or destination first), since there's no right answer to that one. The linked article covers a number of problems, with these being the big ones for me:

Being a portable assembler, GAS is the jack of all instruction sets, master of none. If I'm going to write a lot of x86 assembly, I want a tool specialized for the job.


I also looked at YASM, a rewrite of NASM. It supports 16-bit assembly and mostly uses NASM syntax. In my research I found that NASM used to lag behind in features due to slower development, which is what spawned YASM. In recent years this seems to have flipped around, with YASM lagging behind. If you're using YASM, nasm-mode should work pretty well for you, since it's still very similar.

YASM optionally supports GAS syntax, but this reintroduces almost all of GAS's problems. Even YASM's improvements (i.e. its ORG directive) become broken when switching to GAS syntax.


FASM is the "flat assembler," an assembler written in assembly language. This means it's only available on x86 platforms. While I don't really plan on developing x86 assembly on a Raspberry Pi, I'd rather not limit my options! I already regard 16-bit DOS programming as a form of embedded programming, and this may very well extend to the rest of x86 someday.

Also, it hasn't made its way into the various Linux distribution package repositories, including Debian, so it's already at a disadvantage for me.


This is Microsoft's assembler that comes with Visual Studio. Windows only and not open source, this is in no way a serious consideration. But since NASM's syntax was originally derived from MASM, it's worth mentioning. NASM takes the good parts of MASM and fixes the mistakes (such as the offset operator). It's different enough that nasm-mode would not work well with MASM.


It's not perfect, but it's got an excellent manual, it's a solid program that does exactly what it says it will do, has a powerful macro system, great 16-bit support, highly portable, easy to build, and its semantics and syntax has been carefully considered. It also comes with a simple, pure binary disassembler (ndisasm). In retrospect it seems like an obvious choice!

My one complaint would be that it's that it's too flexible about labels. The colon on labels is optional, which can lead to subtle bugs. NASM will warn about this under some conditions (orphan-labels). Combined with the preprocessor, the difference between a macro and a label is ambiguous, short of re-implementing the entire preprocessor in Emacs Lisp.

Why nasm-mode?

Emacs comes with an asm-mode for editing assembly code for various architectures. Unfortunately it's another jack-of-all-trades that's not very good. More so, it doesn't follow Emacs' normal editing conventions, having unusual automatic indentation and self-insertion behaviors. It's what prompted me to make nasm-mode.

To be fair, I don't think it's possible to write a major mode that covers many different instruction set architectures. Each architecture has its own quirks and oddities that essentially makes gives it a unique language. This is especially true with x86, which, from its 37 year tenure touched by so many different vendors, comes in a number of incompatible flavors. Each assembler/architecture pair needs its own major mode. I hope I just wrote NASM's.

One area where I'm still stuck is that I can't find an x86 style guide. It's easy to find half a dozen style guides of varying authority for any programming language that's more than 10 years old ... except x86. There's no obvious answer when it comes to automatic indentation. How are comments formatted and indented? How are instructions aligned? Should labels be on the same line as the instruction? Should labels require a colon? (I've decided this is "yes.") What about long label names? How are function prototypes/signatures documented? (The mode could take advantage of such a standard, a la ElDoc.) It seems everyone uses their own style. This is another conundrum for a generic asm-mode.

There are a couple of other nasm-modes floating around with different levels of completeness. Mine should supersede these, and will be much easier to maintain into the future as NASM evolves.

tags: [ emacs x86 ]

A Basic Just-In-Time Compiler

Monday's /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It's given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it's an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86_64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I'll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren't allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there's less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;

To keep things simple here, I'm just assuming the page size is 4kB. In a real program, we'd use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86_64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It's anonymous because it's not backed by a file.

struct asmbuf *
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);

Windows doesn't have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here's the equivalent in Win32.

struct asmbuf *

Anyone reading closely should notice that I haven't actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: "write xor execute." That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

asmbuf_finalize(struct asmbuf *buf)
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);

Or on Win32 (that last parameter is not allowed to be NULL),

asmbuf_finalize(struct asmbuf *buf)
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);

Finally, instead of free() it gets unmapped.

asmbuf_free(struct asmbuf *buf)
    munmap(buf, PAGE_SIZE);

And on Win32,

asmbuf_free(struct asmbuf *buf)
    VirtualFree(buf, 0, MEM_RELEASE);

I won't list the definitions here, but there are two "methods" for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We're only going to be concerned with three of x86_64's many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we'll be able to call from C like a normal function. Here's what it's prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that's where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   %rdi, %rax

There's a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.


So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86_64 documentation, look up the instructions we're using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60's.

Fortunately there's a much easier way. We'll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to as (GAS). It will produce a.out, which we'll disassemble with objdump -d.

$ as peek.s
$ objdump -d a.out

a.out:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   c3                      retq

That's straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   %rdi, %rax
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // retq

For each operation, we'll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator's instruction if it's small (32-bits or fewer), but I'm keeping it simple. To sneakily capture the "template" for this instruction I'm going to use 0x0123456789abcdef as the operand.

mov   $0x0123456789abcdef, %rdi

Which disassembled with objdump -d is,

0:  48 bf ef cd ab 89 67    movabs $0x123456789abcdef,%rdi
7:  45 23 01

Notice the operand listed little endian immediately after the instruction. That's also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   operand, %rdi
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
case '+':
    asmbuf_ins(buf, 3, 0x4801f8);   // add   %rdi, %rax
case '-':
    asmbuf_ins(buf, 3, 0x4829f8);   // sub   %rdi, %rax
case '*':
    asmbuf_ins(buf, 4, 0x480fafc7); // imul  %rdi, %rax
case '/':
    asmbuf_ins(buf, 3, 0x4831d2);   // xor   %rdx, %rdx
    asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  %rdi

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we're all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That's pretty cool if you ask me! Now this was an extremely simplified situation. There's no branching, no intermediate values, no function calls, and I didn't even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it's a 1:1 translation.

I'd like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that's more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

tags: [ c tutorial netsec x86 ]

Goblin-COM 7DRL 2015

Yesterday I completed my third entry to the annual Seven Day Roguelike (7DRL) challenge (previously: 2013 and 2014). This year's entry is called Goblin-COM.

As with previous years, the ideas behind the game are not all that original. The goal was to be a fantasy version of classic X-COM with an ANSI terminal interface. You are the ruler of a fledgling human nation that is under attack by invading goblins. You hire heroes, operate squads, construct buildings, and manage resource income.

The inspiration this year came from watching BattleBunny play OpenXCOM, an open source clone of the original X-COM. It had its major 1.0 release last year. Like the early days of OpenTTD, it currently depends on the original game assets. But also like OpenTTD, it surpasses the original game in every way, so there's no reason to bother running the original anymore. I've also recently been watching One F Jef play Silent Storm, which is another turn-based squad game with a similar combat simulation.

As in X-COM, the game is broken into two modes of play: the geoscape (strategic) and the battlescape (tactical). Unfortunately I ran out of time and didn't get to the battlescape part, though I'd like to add it in the future. What's left is a sort-of city-builder with some squad management. You can hire heroes and send them out in squads to eliminate goblins, but rather than dropping to the battlescape, battles always auto-resolve in your favor. Despite this, the game still has a story, a win state, and a lose state. I won't say what they are, so you have to play it for yourself!

Terminal Emulator Layer

My previous entries were HTML5 games, but this entry is a plain old standalone application. C has been my preferred language for the past few months, so that's what I used. Both UTF-8-capable ANSI terminals and the Windows console are supported, so it should be perfectly playable on any modern machine. Note, though, that some of the poorer-quality terminal emulators that you'll find in your Linux distribution's repositories (rxvt and its derivatives) are not Unicode-capable, which means they won't work with G-COM.

I didn't make use of ncurses, instead opting to write my own terminal graphics engine. That's because I wanted a single, small binary that was easy to build, and I didn't want to mess around with PDCurses. I've also been studying the Win32 API lately, so writing my own terminal platform layer would rather easy to do anyway.

I experimented with a number of terminal emulators -- LXTerminal, Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator -- but the least capable "terminal" by far is the Windows console, so it was the one to dictate the capabilities of the graphics engine. Some ANSI terminals are capable of 256 colors, bold, underline, and strikethrough fonts, but a highly portable API is basically limited to 16 colors (RGBCMYKW with two levels of intensity) for each of the foreground and background, and no other special text properties.

ANSI terminals also have a concept of a default foreground color and a default background color. Most applications that output color (git, grep, ls) leave the background color alone and are careful to choose neutral foreground colors. G-COM always sets the background color, so that the game looks the same no matter what the default colors are. Also, the Windows console doesn't really have default colors anyway, even if I wanted to use them.

I put in partial support for Unicode because I wanted to use interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported Unicode for a long time now, but since they added it too early, they're locked into the outdated UTF-16. For me this wasn't too bad, because few computers, Linux included, are equipped to render characters outside of the Basic Multilingual Plane anyway, so there's no need to deal with surrogate pairs. This is especially true for the Windows console, which can only render a very small set of characters: another limit on my graphics engine. Internally individual codepoints are handled as uint16_t and strings are handled as UTF-8.

I said partial support because, in addition to the above, it has no support for combining characters, or any other situation where a codepoint takes up something other than one space in the terminal. This requires lookup tables and dealing with pitfalls, but since I get to control exactly which characters were going to be used I didn't need any of that.

In spite of the limitations, I'm really happy with the graphical results. The waves are animated continuously, even while the game is paused, and it looks great. Here's GNOME Terminal's rendering, which I think looked the best by default.

I'll talk about how G-COM actually communicates with the terminal in another article. The interface between the game and the graphics engine is really clean (device.h), so it would be an interesting project to write a back end that renders the game to a regular window, no terminal needed.

Color Directive

I came up with a format directive to help me colorize everything. It runs in addition to the standard printf directives. Here's an example,

panel_printf(&panel, 1, 1, "Really save and quit? (Rk{y}/Rk{n})");

The color is specified by two characters, and the text it applies to is wrapped in curly brackets. There are eight colors to pick from: RGBCMYKW. That covers all the binary values for red, green, and blue. To specify an "intense" (bright) color, capitalize it. That means the Rk{...} above makes the wrapped text bright red.

Nested directives are also supported. (And, yes, that K means "high intense black," a.k.a. dark gray. A w means "low intensity white," a.k.a. light gray.)

panel_printf(p, x, y++, "Kk{♦}    wk{Rk{B}uild}     Kk{♦}");

And it mixes with the normal printf directives:

panel_printf(p, 1, y++, "(Rk{m}) Yk{Mine} [%s]", cost);

Single Binary

The GNU linker has a really nice feature for linking arbitrary binary data into your application. I used this to embed my assets into a single binary so that the user doesn't need to worry about any sort of data directory or anything like that. Here's what the make rule would look like:

$(LD) -r -b binary -o $@ $^

The -r specifies that output should be relocatable -- i.e. it can be fed back into the linker later when linking the final binary. The -b binary says that the input is just an opaque binary file ("plain" text included). The linker will create three symbols for each input file:

When then you can access from your C program like so:

extern const char _binary_filename_txt_start[];

I used this to embed the story texts, and I've used it in the past to embed images and textures. If you were to link zlib, you could easily compress these assets, too. I'm surprised this sort of thing isn't done more often!

Dumb Game Saves

To save time, and because it doesn't really matter, saves are just memory dumps. I took another page from Handmade Hero and allocate everything in a single, contiguous block of memory. With one exception, there are no pointers, so the entire block is relocatable. When references are needed, it's done via integers into the embedded arrays. This allows it to be cleanly reloaded in another process later. As a side effect, it also means there are no dynamic allocations (malloc()) while the game is running. Here's roughly what it looks like.

typedef struct game {
    uint64_t map_seed;
    map_t *map;
    long time;
    float wood, gold, food;
    long population;
    float goblin_spawn_rate;
    invader_t invaders[16];
    squad_t squads[16];
    hero_t heroes[128];
    game_event_t events[16];
} game_t;

The map pointer is that one exception, but that's because it's generated fresh after loading from the map_seed. Saving and loading is trivial (error checking omitted) and very fast.

game_save(game_t *game, FILE *out)
    fwrite(game, sizeof(*game), 1, out);

game_t *
game_load(FILE *in)
    game_t *game = malloc(sizeof(*game));
    fread(game, sizeof(*game), 1, in);
    game->map = map_generate(game->map_seed);
    return game;

The data isn't important enough to bother with rename+fsync durability. I'll risk the data if it makes savescumming that much harder!

The downside to this technique is that saves are generally not portable across architectures (particularly where endianness differs), and may not even portable between different platforms on the same architecture. I only needed to persist a single game state on the same machine, so this wouldn't be a problem.

Final Results

I'm definitely going to be reusing some of this code in future projects. The G-COM terminal graphics layer is nifty, and I already like it better than ncurses, whose API I've always thought was kind of ugly and old-fashioned. I like writing terminal applications.

Just like the last couple of years, the final game is a lot simpler than I had planned at the beginning of the week. Most things take longer to code than I initially expect. I'm still enjoying playing it, which is a really good sign. When I play, I'm having enough fun to deliberately delay the end of the game so that I can sprawl my nation out over the island and generate crazy income.

tags: [ game media ]

Generic C Reference Counting

As a result of making regular use of object-oriented programming in C, I've discovered a useful reference counting technique for the occasional dynamically allocated structs that need it. The situation arises when the same struct instance is shared between an arbitrary number of other data structures and I need to keep track of it all.

It's incredibly simple and lives entirely in a header file, so without further ado (ref.h):

#pragma once

struct ref {
    void (*free)(const struct ref *);
    int count;

static inline void
ref_inc(const struct ref *ref)
    ((struct ref *)ref)->count++;

static inline void
ref_dec(const struct ref *ref)
    if (--((struct ref *)ref)->count == 0)

It has only two fields: the reference count and a "method" that knows how to free the object once the reference count hits 0. Structs using this reference counter will know how to free themselves, so callers will never call a specific *_destroy()/*_free() function. Instead they call ref_dec() to decrement the reference counter and let it happen on its own.

I decided to go with a signed count because it allows for better error checking. It may be worth putting an assert() in ref_inc() and ref_dec() to ensure the count is always non-negative. I chose an int because it's fast, and anything smaller will be padded out to at least that size anyway. On x86_64, struct ref is 16 bytes.

This is basically all there is to a C++ shared_ptr, leveraging C++'s destructors and performing all increment/decrement work automatically.

Thread Safety

Those increments and decrements aren't thread safe, so this won't work as-is when data structures are shared between threads. If you're sure that you're using GCC on a capable platform, you can make use of its atomic builtins, making the reference counter completely thread safe.

static inline void
ref_inc(const struct ref *ref)
    __sync_add_and_fetch((int *)&ref->count, 1);

static inline void
ref_dec(const struct ref *ref)
    if (__sync_sub_and_fetch((int *)&ref->count, 1) == 0)

Or if you're using C11, make use of the new stdatomic.h.

static inline void
ref_inc(const struct ref *ref)
    atomic_fetch_add((int *)&ref->count, 1);

static inline void
ref_dec(const struct ref *ref)
    if (atomic_fetch_sub((int *)&ref->count, 1) == 1)

What's That Const?

There's a very deliberate decision to make all of the function arguments const, for both reference counting functions and the free() method. This may seem wrong because these functions are specifically intended to modify the reference count. There are dangerous-looking casts in each case to remove the const.

The reason for this is that's it's likely for someone holding a const pointer to one of these objects to want to keep their own reference. Their promise not to modify the object doesn't really apply to the reference count, which is merely embedded metadata. They would need to cast the const away before being permitted to call ref_inc() and ref_dec(). Rather than litter the program with dangerous casts, the casts are all kept in one place -- in the reference counting functions -- where they're strictly limited to mutating the reference counting fields.

On a related note, the stdlib.h free() function doesn't take a const pointer, so the free() method taking a const pointer is a slight departure from the norm. Taking a non-const pointer was a mistake in the C standard library. The free() function mutates the pointer itself -- including all other pointers to that object -- making it invalid. Semantically, it doesn't mutate the memory behind the pointer, so it's not actually violating the const. To compare, the Linux kernel kfree() takes a const void *.

Just as users may need to increment and decrement the counters on const objects, they'll also need to be able to free() them, so it's also a const.

Usage Example

So how does one use this generic reference counter? Embed a struct ref in your own structure and use our old friend: the container_of() macro. For anyone who's forgotten, this macro not part of standard C, but you can define it with offsetof().

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Here's a dumb linked list example where each node is individually reference counted. Adding an extra 16 bytes to each of your linked list nodes isn't normally going to help with much, but if the tail of the linked list is being shared between different data structures (such as other lists), reference counting makes things a lot simpler.

struct node {
    char id[64];
    float value;
    struct node *next;
    struct ref refcount;

I put refcount at the end so that we'll have to use container_of() in this example. It conveniently casts away the const for us.

static void
node_free(const struct ref *ref)
    struct node *node = container_of(ref, struct node, refcount);
    struct node *child = node->next;
    if (child)

Notice that it recursively decrements its child's reference count afterwards (intentionally tail recursive). A whole list will clean itself up when the head is freed and no part of the list is shared.

The allocation function sets up the free() function pointer and initializes the count to 1.

struct node *
node_create(char *id, float value)
    struct node *node = malloc(sizeof(*node));
    snprintf(node->id, sizeof(node->id), "%s", id);
    node->value = value;
    node->next = NULL;
    node->refcount = (struct ref){node_free, 1};
    return node;

(Side note: I used snprintf() because strncpy() is broken and strlcpy() is non-standard, so it's the most straightforward way to do this in standard C.);

And to start making some use of the reference counter, here's push and pop.

node_push(struct node **nodes, char *id, float value)
    struct node *node = node_create(id, value);
    node->next = *nodes;
    *nodes = node;

struct node *
node_pop(struct node **nodes)
    struct node *node = *nodes;
    *nodes = (*nodes)->next;
    if (*nodes)
    return node;

Notice node_pop() increments the reference count of the new head node before returning. That's because the node now has an additional reference: from *nodes and from the node that was just popped. It's up to the caller to free the returned node, which would decrement the count of the new head node, but not free it. Alternatively node_pop() could set next on the returned node to NULL rather than increment the counter, which would also prevent the returned node from freeing the new head when it gets freed. But it's probably more useful for the returned node to keep functioning as a list. That's what the reference counting is for, after all.

Finally, a simple program to exercise it all. It reads ID/value pairs from standard input.

node_print(struct node *node)
    for (; node; node = node->next)
        printf("%s = %f\n", node->id, node->value);

int main(void)
    struct node *nodes = NULL;
    char id[64];
    float value;
    while (scanf(" %63s %f", id, &value) == 2)
        node_push(&nodes, id, value);
    if (nodes != NULL) {
        struct node *old = node_pop(&nodes);
        node_push(&nodes, "foobar", 0.0f);
    return 0;

I've used this technique several times over the past few months. It's trivial to remember, so I just code it up from scratch each time I need it.

tags: [ c ]

Interactive Programming in C

I'm a huge fan of interactive programming (see: JavaScript, Java, Lisp, Clojure). That is, modifying and extending a program while it's running. For certain kinds of non-batch applications, it takes much of the tedium out of testing and tweaking during development. Until last week I didn't know how to apply interactive programming to C. How does one go about redefining functions in a running C program?

Last week in Handmade Hero (days 21-25), Casey Muratori added interactive programming to the game engine. This is especially useful in game development, where the developer might want to tweak, say, a boss fight without having to restart the entire game after each tweak. Now that I've seen it done, it seems so obvious. The secret is to build almost the entire application as a shared library.

This puts a serious constraint on the design of the program: it cannot keep any state in global or static variables, though this should be avoided anyway. Global state will be lost each time the shared library is reloaded. In some situations, this can also restrict use of the C standard library, including functions like malloc(), depending on how these functions are implemented or linked. For example, if the C standard library is statically linked, functions with global state may introduce global state into the shared library. It's difficult to know what's safe to use. This works fine in Handmade Hero because the core game, the part loaded as a shared library, makes no use of external libraries, including the standard library.

Additionally, the shared library must be careful with its use of function pointers. The functions being pointed at will no longer exist after a reload. This is a real issue when combining interactive programming with object oriented C.

An example with the Game of Life

To demonstrate how this works, let's go through an example. I wrote a simple ncurses Game of Life demo that's easy to modify. You can get the entire source here if you'd like to play around with it yourself on a Unix-like system.

Quick start:

  1. In a terminal run make then ./main. Press r randomize and q to quit.
  2. Edit game.c to change the Game of Life rules, add colors, etc.
  3. In a second terminal run make. Your changes will be reflected immediately in the original program!

As of this writing, Handmade Hero is being written on Windows, so Casey is using a DLL and the Win32 API, but the same technique can be applied on Linux, or any other Unix-like system, using libdl. That's what I'll be using here.

The program will be broken into two parts: the Game of Life shared library ("game") and a wrapper ("main") whose job is only to load the shared library, reload it when it updates, and call it at a regular interval. The wrapper is agnostic about the operation of the "game" portion, so it could be re-used almost untouched in another project.

To avoid maintaining a whole bunch of function pointer assignments in several places, the API to the "game" is enclosed in a struct. This also eliminates warnings from the C compiler about mixing data and function pointers. The layout and contents of the game_state struct is private to the game itself. The wrapper will only handle a pointer to this struct.

struct game_state;

struct game_api {
    struct game_state *(*init)();
    void (*finalize)(struct game_state *state);
    void (*reload)(struct game_state *state);
    void (*unload)(struct game_state *state);
    bool (*step)(struct game_state *state);

In the demo the API is made of 5 functions. The first 4 are primarily concerned with loading and unloading.

The library will provide a filled out API struct as a global variable, GAME_API. This is the only exported symbol in the entire shared library! All functions will be declared static, including the ones referenced by the structure.

const struct game_api GAME_API = {
    .init     = game_init,
    .finalize = game_finalize,
    .reload   = game_reload,
    .unload   = game_unload,
    .step     = game_step

dlopen, dlsym, and dlclose

The wrapper is focused on calling dlopen(), dlsym(), and dlclose() in the right order at the right time. The game will be compiled to the file, so that's what will be loaded. It's written in the source with a ./ to force the name to be used as a filename. The wrapper keeps track of everything in a game struct.

const char *GAME_LIBRARY = "./";

struct game {
    void *handle;
    ino_t id;
    struct game_api api;
    struct game_state *state;

The handle is the value returned by dlopen(). The id is the inode of the shared library, as returned by stat(). The rest is defined above. Why the inode? We could use a timestamp instead, but that's indirect. What we really care about is if the shared object file is actually a different file than the one that was loaded. The file will never be updated in place, it will be replaced by the compiler/linker, so the timestamp isn't what's important.

Using the inode is a much simpler situation than in Handmade Hero. Due to Windows' broken file locking behavior, the game DLL can't be replaced while it's being used. To work around this limitation, the build system and the loader have to rely on randomly-generated filenames.

void game_load(struct game *game)

The purpose of the game_load() function is to load the game API into a game struct, but only if either it hasn't been loaded yet or if it's been updated. Since it has several independent failure conditions, let's examine it in parts.

struct stat attr;
if ((stat(GAME_LIBRARY, &attr) == 0) && (game->id != attr.st_ino)) {

First, use stat() to determine if the library's inode is different than the one that's already loaded. The id field will be 0 initially, so as long as stat() succeeds, this will load the library the first time.

    if (game->handle) {

If a library is already loaded, unload it first, being sure to call unload() to inform the library that it's being updated. It's critically important that dlclose() happens before dlopen(). On my system, dlopen() looks only at the string it's given, not the file behind it. Even though the file has been replaced on the filesystem, dlopen() will see that the string matches a library already opened and return a pointer to the old library. (Is this a bug?) The handles are reference counted internally by libdl.

    void *handle = dlopen(GAME_LIBRARY, RTLD_NOW);

Finally load the game library. There's a race condition here that cannot be helped due to limitations of dlopen(). The library may have been updated again since the call to stat(). Since we can't ask dlopen() about the inode of the library it opened, we can't know. But as this is only used during development, not in production, it's not a big deal.

    if (handle) {
        game->handle = handle;
        game->id = attr.st_ino;
        /* ... more below ... */
    } else {
        game->handle = NULL;
        game->id = 0;

If dlopen() fails, it will return NULL. In the case of ELF, this will happen if the compiler/linker is still in the process of writing out the shared library. Since the unload was already done, this means no game will be loaded when game_load returns. The user of the struct needs to be prepared for this eventuality. It will need to try loading again later (i.e. a few milliseconds). It may be worth filling the API with stub functions when no library is loaded.

    const struct game_api *api = dlsym(game->handle, "GAME_API");
    if (api != NULL) {
        game->api = *api;
        if (game->state == NULL)
            game->state = game->api.init();
    } else {
        game->handle = NULL;
        game->id = 0;

When the library loads without error, look up the GAME_API struct that was mentioned before and copy it into the local struct. Copying rather than using the pointer avoids one more layer of redirection when making function calls. The game state is initialized if it hasn't been already, and the reload() function is called to inform the game it's just been reloaded.

If looking up the GAME_API fails, close the handle and consider it a failure.

The main loop calls game_load() each time around. And that's it!

int main(void)
    struct game game = {0};
    for (;;) {
        if (game.handle)
            if (!game.api.step(game.state))
    return 0;

Now that I have this technique in by toolbelt, it has me itching to develop a proper, full game in C with OpenGL and all, perhaps in another Ludum Dare. The ability to develop interactively is very appealing.

tags: [ c tutorial ]

How to build DOS COM files with GCC

This past weekend I participated in Ludum Dare #31. Before the theme was even announced, due to recent fascination I wanted to make an old school DOS game. DOSBox would be the target platform since it's the most practical way to run DOS applications anymore, despite modern x86 CPUs still being fully backwards compatible all the way back to the 16-bit 8086.

I successfully created and submitted a DOS game called DOS Defender. It's a 32-bit 80386 real mode DOS COM program. All assets are embedded in the executable and there are no external dependencies, so the entire game is packed into that 10kB binary.

You'll need a joystick/gamepad in order to play. I included mouse support in the Ludum Dare release in order to make it easier to review, but this was removed because it doesn't work well.

The most technically interesting part is that I didn't need any DOS development tools to create this! I only used my every day Linux C compiler (gcc). It's not actually possible to build DOS Defender in DOS. Instead, I'm treating DOS as an embedded platform, which is the only form in which DOS still exists today. Along with DOSBox and DOSEMU, this is a pretty comfortable toolchain.

If all you care about is how to do this yourself, skip to the "Tricking GCC" section, where we'll write a "Hello, World" DOS COM program with Linux's GCC.

Finding the right tools

I didn't have GCC in mind when I started this project. What really triggered all of this was that I had noticed Debian's bcc package, Bruce's C Compiler, that builds 16-bit 8086 binaries. It's kept around for compiling x86 bootloaders and such, but it can also be used to compile DOS COM files, which was the part that interested me.

For some background: the Intel 8086 was a 16-bit microprocessor released in 1978. It had none of the fancy features of today's CPU: no memory protection, no floating point instructions, and only up to 1MB of RAM addressable. All modern x86 desktops and laptops can still pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same limited addressing and all. That's some serious backwards compatibility. This feature is called real mode. It's the mode in which all x86 computers boot. Modern operating systems switch to protected mode as soon as possible, which provides virtual addressing and safe multi-tasking. DOS is not one of these operating systems.

Unfortunately, bcc is not an ANSI C compiler. It supports a subset of K&R C, along with inline x86 assembly. Unlike other 8086 C compilers, it has no notion of "far" or "long" pointers, so inline assembly is required to access other memory segments (VGA, clock, etc.). Side note: the remnants of these 8086 "long pointers" still exists today in the Win32 API: LPSTR, LPWORD, LPDWORD, etc. The inline assembly isn't anywhere near as nice as GCC's inline assembly. The assembly code has to manually load variables from the stack so, since bcc supports two different calling conventions, the assembly ends up being hard-coded to one calling convention or the other.

Given all its limitations, I went looking for alternatives.


DJGPP is the DOS port of GCC. It's a very impressive project, bringing almost all of POSIX to DOS. The DOS ports of many programs are built with DJGPP. In order to achieve this, it only produces 32-bit protected mode programs. If a protected mode program needs to manipulate hardware (i.e. VGA), it must make requests to a DOS Protected Mode Interface (DPMI) service. If I used DJGPP, I couldn't make a single, standalone binary as I had wanted, since I'd need to include a DPMI server. There's also a performance penalty for making DPMI requests.

Getting a DJGPP toolchain working can be difficult, to put it kindly. Fortunately I found a useful project, build-djgpp, that makes it easy, at least on Linux.

Either there's a serious bug or the official DJGPP binaries have become infected again, because in my testing I kept getting the "Not COFF: check for viruses" error message when running my programs in DOSBox. To double check that it's not an infection on my own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as a clean room. It's impossible for this ARM-based device to get infected with an x86 virus. It still had the same problem, and all the binary hashes matched up between the machines, so it's not my fault.

So given the DPMI issue and the above, I moved on.

Tricking GCC

What I finally settled on is a neat hack that involves "tricking" GCC into producing real mode DOS COM files, so long as it can target 80386 (as is usually the case). The 80386 was released in 1985 and was the first 32-bit x86 microprocessor. GCC still targets this instruction set today, even in the x86_64 toolchain. Unfortunately, GCC cannot actually produce 16-bit code, so my main goal of targeting 8086 would not be achievable. This doesn't matter, though, since DOSBox, my intended platform, is an 80386 emulator.

In theory this should even work unchanged with MinGW, but there's a long-standing MinGW bug that prevents it from working right ("cannot perform PE operations on non PE output file"). It's still do-able, and I did it myself, but you'll need to drop the OUTPUT_FORMAT directive and add an extra objcopy step (objcopy -O binary).

Hello World in DOS

To demonstrate how to do all this, let's make a DOS "Hello, World" COM program using GCC on Linux.

There's a significant burden with this technique: there will be no standard library. It's basically like writing an operating system from scratch, except for the few services DOS provides. This means no printf() or anything of the sort. Instead we'll ask DOS to print a string to the terminal. Making a request to DOS means firing an interrupt, which means inline assembly!

DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x2F. The big one, and the one we're interested in, is 0x21, function 0x09 (print string). Between DOS and BIOS, there are thousands of functions called this way. I'm not going to try to explain x86 assembly, but in short the function number is stuffed into register ah and interrupt 0x21 is fired. Function 0x09 also takes an argument, the pointer to the string to be printed, which is passed in registers dx and ds.

Here's the GCC inline assembly print() function. Strings passed to this function must be terminated with a $. Why? Because DOS.

static void print(char *string)
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");

The assembly is declared volatile because it has a side effect (printing the string). To GCC, the assembly is an opaque hunk, and the optimizer relies in the output/input/clobber constraints (the last three lines). For DOS programs like this, all inline assembly will have side effects. This is because it's not being written for optimization but to access hardware and DOS, things not accessible to plain C.

Care must also be taken by the caller, because GCC doesn't know that the memory pointed to by string is ever read. It's likely the array that backs the string needs to be declared volatile too. This is all foreshadowing into what's to come: doing anything in this environment is an endless struggle against the optimizer. Not all of these battles can be won.

Now for the main function. The name of this function shouldn't matter, but I'm avoiding calling it main() since MinGW has a funny ideas about mangling this particular symbol, even when it's asked not to.

int dosmain(void)
    print("Hello, World!\n$");
    return 0;

COM files are limited to 65,279 bytes in side. This is because an x86 memory segment is 64kB and COM files are simply loaded by DOS to 0x0100 in the segment and executed. There are no headers, it's just a raw binary. Since a COM program can never be of any significant size, and no real linking needs to occur (freestanding), the entire thing will be compiled as one translation unit. It will be one call to GCC with a bunch of options.

Compiler Options

Here are the essential compiler options.

-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding

Since no standard libraries are in use, the only difference between gnu99 and c99 is that trigraphs are disabled (as they should be) and inline assembly can be written as asm instead of __asm__. It's a no brainer. This project will be so closely tied to GCC that I don't care about using GCC extensions anyway.

I'm using -Os to keep the compiled output as small as possible. It will also make the program run faster. This is important when targeting DOSBox because, by default, it will deliberately run as slow as a machine from the 1980's. I want to be able to fit in that constraint. If the optimizer is causing problems, you may need to temporarily make this -O0 to determine if the problem is your fault or the optimizer's fault.

You see, the optimizer doesn't understand that the program will be running in real mode, and under its addressing constraints. It will perform all sorts of invalid optimizations that break your perfectly valid programs. It's not a GCC bug since we're doing crazy stuff here. I had to rework my code a number of times to stop the optimizer from breaking my program. For example, I had to avoid returning complex structs from functions because they'd sometimes be filled with garbage. The real danger here is that a future version of GCC will be more clever and will break more stuff. In this battle, volatile is your friend.

Th next option is -nostdlib, since there are no valid libraries for us to link against, even statically.

The options -m32 -march=i386 set the compiler to produce 80386 code. If I was writing a bootloader for a modern computer, targeting 80686 would be fine, too, but DOSBox is 80386.

The -ffreestanding argument requires that GCC not emit code that calls built-in standard library helper functions. Sometimes instead of emitting code to do something, it emits code that calls a built-in function to do it, especially with math operators. This was one of the main problems I had with bcc, where this behavior couldn't be disabled. This is most commonly used in writing bootloaders and kernels. And now DOS COM files.

Linker Options

The -Wl option is used to pass arguments to the linker (ld). We need it since we're doing all this in one call to GCC.


The --nmagic turns off page alignment of sections. One, we don't need this. Two, that would waste precious space. In my tests it doesn't appear to be necessary, but I'm including it just in case.

The --script option tells the linker that we want to use a custom linker script. This allows us to precisely lay out the sections (text, data, bss, rodata) of our program. Here's the com.ld script.

    . = 0x0100;
    .text :
    .data :
    _heap = ALIGN(4);

The OUTPUT_FORMAT(binary) says not to put this into an ELF (or PE, etc.) file. The linker should just dump the raw code. A COM file is just raw code, so this means the linker will produce a COM file!

I had said that COM files are loaded to 0x0100. The fourth line offsets the binary to this location. The first byte of the COM file will still be the first byte of code, but it will be designed to run from that offset in memory.

What follows is all the sections, text (program), data (static data), bss (zero-initialized data), rodata (strings). Finally I mark the end of the binary with the symbol _heap. This will come in handy later for writing sbrk(), after we're done with "Hello, World." I've asked for the _heap position to be 4-byte aligned.

We're almost there.

Program Startup

The linker is usually aware of our entry point (main) and sets that up for us. But since we asked for "binary" output, we're on our own. If the print() function is emitted first, our program's execution will begin with executing that function, which is invalid. Our program needs a little header stanza to get things started.

The linker script has a STARTUP option for handling this, but to keep it simple we'll put that right in the program. This is usually called crt0.o or Boot.o, in case those names every come up in your own reading. This inline assembly must be the very first thing in our code, before any includes and such. DOS will so most of the setup for us, we really just have to jump to the entry point.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C, %ah\n"
     "int   $0x21\n");

The .code16gcc tells the assembler that we're going to be running in real mode, so that it makes the proper adjustment. Despite the name, this will not make it produce 16-bit code! First it calls dosmain, the function we wrote above. Then it informs DOS, using function 0x4C (terminate with return code), that we're done, passing the exit code along in the 1-byte register al (already set by dosmain). This inline assembly is automatically volatile because it has no inputs or outputs.

Everything at Once

Here's the entire C program.

asm (".code16gcc\n"
     "call  dosmain\n"
     "mov   $0x4C,%ah\n"
     "int   $0x21\n");

static void print(char *string)
    asm volatile ("mov   $0x09, %%ah\n"
                  "int   $0x21\n"
                  : /* no output */
                  : "d"(string)
                  : "ah");

int dosmain(void)
    print("Hello, World!\n$");
    return 0;

I won't repeat com.ld. Here's the call to GCC.

gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
    -o -Wl,--nmagic,--script=com.ld hello.c

And testing it in DOSBox:

From here if you want fancy graphics, it's just a matter of making an interrupt and writing to VGA memory. If you want sound you can perform an interrupt for the PC speaker. I haven't sorted out how to call Sound Blaster yet. It was from this point that I grew DOS Defender.

Memory Allocation

To cover one more thing, remember that _heap symbol? We can use it to implement sbrk() for dynamic memory allocation within the main program segment. This is real mode, and there's no virtual memory, so we're free to write to any memory we can address at any time. Some of this is reserved (i.e. low and high memory) for hardware. So using sbrk() specifically isn't really necessary, but it's interesting to implement ourselves.

As is normal on x86, your text and segments are at a low address (0x0100 in this case) and the stack is at a high address (around 0xffff in this case). On Unix-like systems, the memory returned by malloc() comes from two places: sbrk() and mmap(). What sbrk() does is allocates memory just above the text/data segments, growing "up" towards the stack. Each call to sbrk() will grow this space (or leave it exactly the same). That memory would then managed by malloc() and friends.

Here's how we can get sbrk() in a COM program. Notice I have to define my own size_t, since we don't have a standard library.

typedef unsigned short  size_t;

extern char _heap;
static char *hbreak = &_heap;

static void *sbrk(size_t size)
    char *ptr = hbreak;
    hbreak += size;
    return ptr;

It just sets a pointer to _heap and grows it as needed. A slightly smarter sbrk() would be careful about alignment as well.

In the making of DOS Defender an interesting thing happened. I was (incorrectly) counting on the memory return by my sbrk() being zeroed. This was the case the first time the game ran. However, DOS doesn't zero this memory between programs. When I would run my game again, it would pick right up where it left off, because the same data structures with the same contents were loaded back into place. A pretty cool accident! It's part of what makes this a fun embedded platform.

tags: [ c debian tutorial game ]

LZSS Quine Puzzle

When I was a kid I spent some time playing a top-down, 2D, puzzle/action, 1993, MS-DOS game called God of Thunder. It came on a shareware CD, now long lost, called Games People Play. A couple decades later I was recently reminded of the game and decided to dig it up and play it again. It's not quite as exciting as I remember it -- nostalgia really warps perception -- but it's still an interesting game nonetheless.

That got me thinking about how difficult it might be to modify ("mod") the game to add my own levels and puzzles. It's a tiny game, so there aren't many assets to reverse engineer. Unpacked, the game just barely fits on a 1.44 MB high density floppy disk. That was probably one of the game's primary design constraints. It also means it's almost certainly employing some sort of home-brew compression algorithm in order to fit more content. I find these sorts of things absolutely interesting and delightful.

You see, back in those old days, compression wasn't really a "solved" problem like it is today. They had to design and implement their own algorithms, with varying degrees of success. Today if you need compression for a project, you just grab zlib. Released in 1995, it implements the most widely used compression algorithm today, DEFLATE, with a tidy, in-memory API. zlib is well-tested, thoroughly optimized, and sits in a nearly-perfect sweet spot between compression ratio and performance. There's even an embeddable version. Since spinning platters are so slow compared to CPUs, compression is likely to speed up an application simply because fewer bytes need to go to and from the disk. Today it's less about saving storage space and more about reducing input/output demands.

Fortunately for me, someone has already reversed engineered most of the God of Thunder assets. It uses its own flavor of Lempel-Ziv-Storer-Szymanski (LZSS), which itself is derived from LZ77, one of the algorithms used in DEFLATE. The original LZSS paper focuses purely on the math, describing the algorithm in terms of symbols with no concern for how it's actually serialized into bits. Those specific details were decided by the game's developers, and that's what I'll be describing below.

As an adult I'm finding the God of Thunder asset formats to be more interesting than the game itself. It's a better puzzle! I really enjoy studying the file formats of various applications, especially older ones that didn't have modern standards to build on. Usually lots of thought and engineering goes into the design these formats -- and, too often, not enough thought goes into it. The format's specifics reveal insights into the internal workings of the application, sometimes exposing unanticipated failure modes. Prying apart odd, proprietary formats (i.e. "data reduction") is probably my favorite kind of work at my day job, and it comes up fairly often.

God of Thunder LZSS Definition

An LZSS compression stream is made up of two kinds of chunks: literals and back references. A literal chunk is passed through to the output buffer unchanged. A reference chunk is a pair of numbers: a length and an offset backwards into the output buffer. Only a single bit is needed for each chunk to identify its type.

To avoid any sort of complicated and slow bit wrangling, the God of Thunder developers (or whoever inspired them) came up with the smart idea to stage 8 of these bits up at once as a single byte, a "control" byte. Since literal chunks are 1 byte and reference chunks are 2 bytes, everything falls onto clean byte boundaries. Every group of 8 chunks is prefixed with one of these control bytes, and so every LZSS compression stream begins with a control byte. The least significant bit controls the 1st chunk in the group and the most significant bit controls the 8th chunk. A 1 denotes a literal and a 0 denotes a reference.

So, for example, a control byte of 0xff means to pass through unchanged the next 8 bytes of the compression stream. This would be the least efficient compression scenario, because the "compressed" stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come entirely from the back references.

A back reference is two bytes little endian (this was in MS-DOS running on x86), the lower 12 bits are the offset and the upper 4 bits are the length, minus 2. That is, you read the 4 length bits and add 2. This is because it doesn't make any sense to reference a length shorter than 2: a literal chunk would be shorter. The offset doesn't have anything added to it. This was a design mistake since an offset of 0 doesn't make any sense. It refers to a byte just outside the output buffer. It should have been stored as the offset minus 1.

A 12-bit offset means up to a 4kB sliding window of output may be referenced at any time. A 4-bit length, plus two, means up to 17 bytes may be copied in a single back reference. Compared to other compression algorithms, this is rather short.

It's important to note that the length is allowed to extend beyond the output buffer (offset < length). The bytes are, in effect, copied one at a time into the output and may potentially be reused within the same operation (like the opposite of memmove). An offset of 1 and a length of 10 means "repeat the last output byte 10 times."

That's the entire format! It's extremely simple but reasonably effective for the game's assets.

Worst Case and Best Case

In the worst case, such as compressing random data, the compression stream will be at most 112.5% (9/8) bigger than the uncompressed stream.

In the best case, such as a long string of zeros, the compressed stream will be, at minimum, 12.5% (1/8) the size of the decompressed stream. Think about it this way: imagine every chunk is a reference of maximum length. That's 1 control byte plus 16 (8 * 2) reference bytes, for a total of 17 compressed bytes. This emits 17 * 8 decompressed bytes, 17 being the maximum length from 8 chunks. Conveniently those two 17s cancel, leaving a factor of 8 for the best case.

LZSS End of Stream

If you're paying really close attention, you may have noticed that by grouping 8 control bits at a time, the length of the input stream is, strictly speaking, constrained to certain lengths. What if, during compression, the input stream stream comes up short of exactly those 8 chunks? As is, there's no way to communicate a premature end to the stream. There are three ways around this using a small amount of metadata, each differing in robustness.

  1. Keep track of the size of the decompressed data. When that many bytes have been emitted, halt. This is how God of Thunder handles it. A small validation check could be performed here. The output stream should always end between chunks, not in the middle of a chunk (i.e. in the middle of copying a back reference). Some of the bits in the control byte may contain arbitrary data that doesn't effect the output, which is a concern when hashing compressed data. My suggestion: require the unused control bits to be 0, which allows for an additional validation check. The output stream should never end just short of a literal chunk.

  2. Keep track of the size of the compressed data. Halt when no more chunks are encountered. A similar, weaker validation check can be performed here: the input stream should never stop between two bytes of a reference. It's weaker because it's less sensitive to corruption, making it harder to detect. The same unused control bit padding situation applies here.

  3. Use an out-of-band end marker (EOF). This is very similar to keeping track of the input size (the filesystem is doing it), but has the weakest validation of all. The stream could be accidentally truncated at any point between chunks, which is undetectable. This makes it the least sensitive to corruption.

An LZSS Quine

After spending some time playing around with this format, I thought about what it would take to make an LZSS quine. That is, find an LZSS compressed stream that decompresses to itself. It's been done for DEFLATE, which I imagine is a much harder problem. There are zip files containing exact copies of themselves, recursively. I'm pretty confident it's never been done for this exact compression format, simply because it's so specific to this old MS-DOS game.

I haven't figured it out yet, so you won't find the solution here. This, dear readers, is my challenge to you! Using the format described above, craft an LZSS quine. LZSS doesn't have no-op chunks (i.e. length = 0), which makes this harder than it would otherwise be. It may not even be possible, which, in that case, your challenge is to prove it!

So far I've determined that it begins with at least 4kB of 0xff. Why is this? First, as I mentioned, all compression streams begin with a control byte. Second, no references can be made until at least one literal byte has been passed, so the first bit (LSB) of the first byte is a 1, and the second byte is exactly the same as the first byte. So the first two bytes are xxxxxx1, with the x being "don't care (yet)."

If the next chunk is a back reference, those first two bytes become xxxxxx01. It could only reference that one byte (so offset = 1), and the length would need to be at least two, ensuring at least the first three bytes of output all have that same pattern. However, on the most significant byte of the reference chunk, this conflicts with having an offset of 1 because the 9th bit of the offset is set to 1, forcing the offset to an invalid 257 bytes. Therefore, the second chunk must be a literal.

This pattern continues until the first eight chunks are all literals, which means the quine begins with at least 9 0xff bytes. Going on, this also means the first back reference is going to be 0xffff (offset = 4095, length = 17), so the sliding window needs to be filled enough to make that a offset valid. References would then be used to "catch up" with the compression stream, then some magic is needed to finish off the stream.

That's where I'm stuck.

tags: [ compression ]