null program

LZSS Quine Puzzle

When I was a kid I spent some time playing a top-down, 2D, puzzle/action, 1993, MS-DOS game called God of Thunder. It came on a shareware CD, now long lost, called Games People Play. A couple decades later I was recently reminded of the game and decided to dig it up and play it again. It's not quite as exciting as I remember it -- nostalgia really warps perception -- but it's still an interesting game nonetheless.

That got me thinking about how difficult it might be to modify ("mod") the game to add my own levels and puzzles. It's a tiny game, so there aren't many assets to reverse engineer. Unpacked, the game just barely fits on a 1.44 MB high density floppy disk. That was probably one of the game's primary design constraints. It also means it's almost certainly employing some sort of home-brew compression algorithm in order to fit more content. I find these sorts of things absolutely interesting and delightful.

You see, back in those old days, compression wasn't really a "solved" problem like it is today. They had to design and implement their own algorithms, with varying degrees of success. Today if you need compression for a project, you just grab zlib. Released in 1995, it implements the most widely used compression algorithm today, DEFLATE, with a tidy, in-memory API. zlib is well-tested, thoroughly optimized, and sits in a nearly-perfect sweet spot between compression ratio and performance. There's even an embeddable version. Since spinning platters are so slow compared to CPUs, compression is likely to speed up an application simply because fewer bytes need to go to and from the disk. Today it's less about saving storage space and more about reducing input/output demands.

Fortunately for me, someone has already reversed engineered most of the God of Thunder assets. It uses its own flavor of Lempel-Ziv-Storer-Szymanski (LZSS), which itself is derived from LZ77, one of the algorithms used in DEFLATE. The original LZSS paper focuses purely on the math, describing the algorithm in terms of symbols with no concern for how it's actually serialized into bits. Those specific details were decided by the game's developers, and that's what I'll be describing below.

As an adult I'm finding the God of Thunder asset formats to be more interesting than the game itself. It's a better puzzle! I really enjoy studying the file formats of various applications, especially older ones that didn't have modern standards to build on. Usually lots of thought and engineering goes into the design these formats -- and, too often, not enough thought goes into it. The format's specifics reveal insights into the internal workings of the application, sometimes exposing unanticipated failure modes. Prying apart odd, proprietary formats (i.e. "data reduction") is probably my favorite kind of work at my day job, and it comes up fairly often.

God of Thunder LZSS Definition

An LZSS compression stream is made up of two kinds of chunks: literals and back references. A literal chunk is passed through to the output buffer unchanged. A reference chunk is a pair of numbers: a length and an offset backwards into the output buffer. Only a single bit is needed for each chunk to identify its type.

To avoid any sort of complicated and slow bit wrangling, the God of Thunder developers (or whoever inspired them) came up with the smart idea to stage 8 of these bits up at once as a single byte, a "control" byte. Since literal chunks are 1 byte and reference chunks are 2 bytes, everything falls onto clean byte boundaries. Every group of 8 chunks is prefixed with one of these control bytes, and so every LZSS compression stream begins with a control byte. The least significant bit controls the 1st chunk in the group and the most significant bit controls the 8th chunk. A 1 denotes a literal and a 0 denotes a reference.

So, for example, a control byte of 0xff means to pass through unchanged the next 8 bytes of the compression stream. This would be the least efficient compression scenario, because the "compressed" stream is 112.5% (9/8) bigger than the uncompressed stream. Gains come entirely from the back references.

A back reference is two bytes little endian (this was in MS-DOS running on x86), the lower 12 bits are the offset and the upper 4 bits are the length, minus 2. That is, you read the 4 length bits and add 2. This is because it doesn't make any sense to reference a length shorter than 2: a literal chunk would be shorter. The offset doesn't have anything added to it. This was a design mistake since an offset of 0 doesn't make any sense. It refers to a byte just outside the output buffer. It should have been stored as the offset minus 1.

A 12-bit offset means up to a 4kB sliding window of output may be referenced at any time. A 4-bit length, plus two, means up to 17 bytes may be copied in a single back reference. Compared to other compression algorithms, this is rather short.

It's important to note that the length is allowed to extend beyond the output buffer (offset < length). The bytes are, in effect, copied one at a time into the output and may potentially be reused within the same operation (like the opposite of memmove). An offset of 1 and a length of 10 means "repeat the last output byte 10 times."

That's the entire format! It's extremely simple but reasonably effective for the game's assets.

Worst Case and Best Case

In the worst case, such as compressing random data, the compression stream will be at most 112.5% (9/8) bigger than the uncompressed stream.

In the best case, such as a long string of zeros, the compressed stream will be, at minimum, 12.5% (1/8) the size of the decompressed stream. Think about it this way: imagine every chunk is a reference of maximum length. That's 1 control byte plus 16 (8 * 2) reference bytes, for a total of 17 compressed bytes. This emits 17 * 8 decompressed bytes, 17 being the maximum length from 8 chunks. Conveniently those two 17s cancel, leaving a factor of 8 for the best case.

LZSS End of Stream

If you're paying really close attention, you may have noticed that by grouping 8 control bits at a time, the length of the input stream is, strictly speaking, constrained to certain lengths. What if, during compression, the input stream stream comes up short of exactly those 8 chunks? As is, there's no way to communicate a premature end to the stream. There are three ways around this using a small amount of metadata, each differing in robustness.

  1. Keep track of the size of the decompressed data. When that many bytes have been emitted, halt. This is how God of Thunder handles it. A small validation check could be performed here. The output stream should always end between chunks, not in the middle of a chunk (i.e. in the middle of copying a back reference). Some of the bits in the control byte may contain arbitrary data that doesn't effect the output, which is a concern when hashing compressed data. My suggestion: require the unused control bits to be 0, which allows for an additional validation check. The output stream should never end just short of a literal chunk.

  2. Keep track of the size of the compressed data. Halt when no more chunks are encountered. A similar, weaker validation check can be performed here: the input stream should never stop between two bytes of a reference. It's weaker because it's less sensitive to corruption, making it harder to detect. The same unused control bit padding situation applies here.

  3. Use an out-of-band end marker (EOF). This is very similar to keeping track of the input size (the filesystem is doing it), but has the weakest validation of all. The stream could be accidentally truncated at any point between chunks, which is undetectable. This makes it the least sensitive to corruption.

An LZSS Quine

After spending some time playing around with this format, I thought about what it would take to make an LZSS quine. That is, find an LZSS compressed stream that decompresses to itself. It's been done for DEFLATE, which I imagine is a much harder problem. There are zip files containing exact copies of themselves, recursively. I'm pretty confident it's never been done for this exact compression format, simply because it's so specific to this old MS-DOS game.

I haven't figured it out yet, so you won't find the solution here. This, dear readers, is my challenge to you! Using the format described above, craft an LZSS quine. LZSS doesn't have no-op chunks (i.e. length = 0), which makes this harder than it would otherwise be. It may not even be possible, which, in that case, your challenge is to prove it!

So far I've determined that it begins with at least 4kB of 0xff. Why is this? First, as I mentioned, all compression streams begin with a control byte. Second, no references can be made until at least one literal byte has been passed, so the first bit (LSB) of the first byte is a 1, and the second byte is exactly the same as the first byte. So the first two bytes are xxxxxx1, with the x being "don't care (yet)."

If the next chunk is a back reference, those first two bytes become xxxxxx01. It could only reference that one byte (so offset = 1), and the length would need to be at least two, ensuring at least the first three bytes of output all have that same pattern. However, on the most significant byte of the reference chunk, this conflicts with having an offset of 1 because the 9th bit of the offset is set to 1, forcing the offset to an invalid 257 bytes. Therefore, the second chunk must be a literal.

This pattern continues until the first eight chunks are all literals, which means the quine begins with at least 9 0xff bytes. Going on, this also means the first back reference is going to be 0xffff (offset = 4095, length = 17), so the sliding window needs to be filled enough to make that a offset valid. References would then be used to "catch up" with the compression stream, then some magic is needed to finish off the stream.

That's where I'm stuck.

tags: [ compression ]

C Object Oriented Programming

Object oriented programming, polymorphism in particular, is essential to nearly any large, complex software system. Without it, decoupling different system components is difficult. C doesn't come with object oriented capabilities, so large C programs tend to grow their own out of C's primitives. This includes huge C projects like the Linux kernel, BSD kernels, and SQLite.

Starting Simple

Suppose you're writing a function pass_match() that takes an input stream, an output stream, and a pattern. It works sort of like grep. It passes to the output each line of input that matches the pattern. The pattern string contains a shell glob pattern to be handled by POSIX fnmatch(). Here's what the interface looks like.

void pass_match(FILE *in, FILE *out, const char *pattern);

Glob patterns are simple enough that pre-compilation, as would be done for a regular expression, is unnecessary. The bare string is enough.

Some time later the customer wants the program to support regular expressions in addition to shell-style glob patterns. For efficiency's sake, regular expressions need to be pre-compiled and so will not be passed to the function as a string. It will instead be a POSIX regex_t object. A quick-and-dirty approach might be to accept both and match whichever one isn't NULL.

void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);

Bleh. This is ugly and won't scale well. What happens when more kinds of filters are needed? It would be much better to accept a single object that covers both cases, and possibly even another kind of filter in the future.

A Generalized Filter

One of the most common ways to customize the the behavior of a function in C is to pass a function pointer. For example, the final argument to qsort() is a comparator that determines how objects get sorted.

For pass_match(), this function would accept a string and return a boolean value deciding if the string should be passed to the output stream. It gets called once on each line of input.

void pass_match(FILE *in, FILE *out, bool (*match)(const char *));

However, this has one of the same problems as qsort(): the passed function lacks context. It needs a pattern string or regex_t object to operate on. In other languages these would be attached to the function as a closure, but C doesn't have closures. It would need to be smuggled in via a global variable, which is not good.

static regex_t regex;  // BAD!!!

bool regex_match(const char *string)
    return regexec(&regex, string, 0, NULL, 0) == 0;

Because of the global variable, in practice pass_match() would be neither reentrant nor thread-safe. We could take a lesson from GNU's qsort_r() and accept a context to be passed to the filter function. This simulates a closure.

void pass_match(FILE *in, FILE *out,
                bool (*match)(const char *, void *), void *context);

The provided context pointer would be passed to the filter function as the second argument, and no global variables are needed. This would probably be good enough for most purposes and it's about as simple as possible. The interface to pass_match() would cover any kind of filter.

But wouldn't it be nice to package the function and context together as one object?

More Abstraction

How about putting the context on a struct and making an interface out of that? Here's a tagged union that behaves as one or the other.

enum filter_type { GLOB, REGEX };

struct filter {
    enum filter_type type;
    union {
        const char *pattern;
        regex_t regex;
    } context;

There's one function for interacting with this struct: filter_match(). It checks the type member and calls the correct function with the correct context.

bool filter_match(struct filter *filter, const char *string)
    switch (filter->type) {
    case GLOB:
        return fnmatch(filter->context.pattern, string, 0) == 0;
    case REGEX:
        return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
    abort(); // programmer error

And the pass_match() API now looks like this. This will be the final change to pass_match(), both in implementation and interface.

void pass_match(FILE *input, FILE *output, struct filter *filter);

It still doesn't care how the filter works, so it's good enough to cover all future cases. It just calls filter_match() on the pointer it was given. However, the switch and tagged union aren't friendly to extension. Really, it's outright hostile. We finally have some degree of polymorphism, but it's crude. It's like building duct tape into a design. Adding new behavior means adding another switch case. This is a step backwards. We can do better.


With the switch we're no longer taking advantage of function pointers. So what about putting a function pointer on the struct?

struct filter {
    bool (*match)(struct filter *, const char *);

The filter itself is passed as the first argument, providing context. In object oriented languages, that's the implicit this argument. To avoid requiring the caller to worry about this detail, we'll hide it in a new switch-free version of filter_match().

bool filter_match(struct filter *filter, const char *string)
    return filter->match(filter, string);

Notice we're still lacking the actual context, the pattern string or the regex object. Those will be different structs that embed the filter struct.

struct filter_regex {
    struct filter filter;
    regex_t regex;

struct filter_glob {
    struct filter filter;
    const char *pattern;

For both the original filter struct is the first member. This is critical. We're going to be using a trick called type punning. The first member is guaranteed to be positioned at the beginning of the struct, so a pointer to a struct filter_glob is also a pointer to a struct filter. Notice any resemblance to inheritance?

Each type, glob and regex, needs its own match method.

static bool
method_match_regex(struct filter *filter, const char *string)
    struct filter_regex *regex = (struct filter_regex *) filter;
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;

static bool
method_match_glob(struct filter *filter, const char *string)
    struct filter_glob *glob = (struct filter_glob *) filter;
    return fnmatch(glob->pattern, string, 0) == 0;

I've prefixed them with method_ to indicate their intended usage. I declared these static because they're completely private. Other parts of the program will only be accessing them through a function pointer on the struct. This means we need some constructors in order to set up those function pointers. (For simplicity, I'm not error checking.)

struct filter *filter_regex_create(const char *pattern)
    struct filter_regex *regex = malloc(sizeof(*regex));
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.match = method_match_regex;
    return &regex->filter;

struct filter *filter_glob_create(const char *pattern)
    struct filter_glob *glob = malloc(sizeof(*glob));
    glob->pattern = pattern;
    glob->filter.match = method_match_glob;
    return &glob->filter;

Now this is real polymorphism. It's really simple from the user's perspective. They call the correct constructor and get a filter object that has the desired behavior. This object can be passed around trivially, and no other part of the program worries about how it's implemented. Best of all, since each method is a separate function rather than a switch case, new kinds of filter subtypes can be defined independently. Users can create their own filter types that work just as well as the two "built-in" filters.

Cleaning Up

Oops, the regex filter needs to be cleaned up when it's done, but the user, by design, won't know how to do it. Let's add a free() method.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);

void filter_free(struct filter *filter)
    return filter->free(filter);

And the methods for each. These would also be assigned in the constructor.

static void
method_free_regex(struct filter *f)
    struct filter_regex *regex = (struct filter_regex *) f;

static void
method_free_glob(struct filter *f)

The glob constructor should perhaps strdup() its pattern as a private copy, in which case it would be freed here.

Object Composition

A good rule of thumb is to prefer composition over inheritance. Having tidy filter objects opens up some interesting possibilities for composition. Here's an AND filter that composes two arbitrary filter objects. It only matches when both its subfilters match. It supports short circuiting, so put the faster, or most discriminating, filter first in the constructor (user's responsibility).

struct filter_and {
    struct filter filter;
    struct filter *sub[2];

static bool
method_match_and(struct filter *f, const char *s)
    struct filter_and *and = (struct filter_and *) f;
    return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);

static void
method_free_and(struct filter *f)
    struct filter_and *and = (struct filter_and *) f;

struct filter *filter_and(struct filter *a, struct filter *b)
    struct filter_and *and = malloc(sizeof(*and));
    and->sub[0] = a;
    and->sub[1] = b;
    and->filter.match = method_match_and;
    and-> = method_free_and;
    return &and->filter;

It can combine a regex filter and a glob filter, or two regex filters, or two glob filters, or even other AND filters. It doesn't care what the subfilters are. Also, the free() method here frees its subfilters. This means that the user doesn't need to keep hold of every filter created, just the "top" one in the composition.

To make composition filters easier to use, here are two "constant" filters. These are statically allocated, shared, and are never actually freed.

static bool
method_match_any(struct filter *f, const char *string)
    return true;

static bool
method_match_none(struct filter *f, const char *string)
    return false;

static void
method_free_noop(struct filter *f)

struct filter FILTER_ANY  = { method_match_any,  method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };

The FILTER_NONE filter will generally be used with a (theoretical) filter_or() and FILTER_ANY will generally be used with the previously defined filter_and().

Here's a simple program that composes multiple glob filters into a single filter, one for each program argument.

int main(int argc, char **argv)
    struct filter *filter = &FILTER_ANY;
    for (char **p = argv + 1; *p; p++)
        filter = filter_and(filter_glob_create(*p), filter);
    pass_match(stdin, stdout, filter);
    return 0;

Notice only one call to filter_free() is needed to clean up the entire filter.

Multiple Inheritance

As I mentioned before, the filter struct must be the first member of filter subtype structs in order for type punning to work. If we want to "inherit" from two different types like this, they would both need to be in this position: a contradiction.

Fortunately type punning can be generalized such that it the first-member constraint isn't necessary. This is commonly done through a container_of() macro. Here's a C99-conforming definition.

#include <stddef.h>

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Given a pointer to a member of a struct, the container_of() macro allows us to back out to the containing struct. Suppose the regex struct was defined differently, so that the regex_t member came first.

struct filter_regex {
    regex_t regex;
    struct filter filter;

The constructor remains unchanged. The casts in the methods change to the macro.

static bool
method_match_regex(struct filter *f, const char *string)
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;

static void
method_free_regex(struct filter *f)
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);


It's a constant, compile-time computed offset, so there should be no practical performance impact. The filter can now participate freely in other intrusive data structures, like linked lists and such. It's analogous to multiple inheritance.


Say we want to add a third method, clone(), to the filter API, to make an independent copy of a filter, one that will need to be separately freed. It will be like the copy assignment operator in C++. Each kind of filter will need to define an appropriate "method" for it. As long as new methods like this are added at the end, this doesn't break the API, but it does break the ABI regardless.

struct filter {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);

The filter object is starting to get big. It's got three pointers -- 24 bytes on modern systems -- and these pointers are the same between all instances of the same type. That's a lot of redundancy. Instead, these pointers could be shared between instances in a common table called a virtual method table, commonly known as a vtable.

Here's a vtable version of the filter API. The overhead is now only one pointer regardless of the number of methods in the interface.

struct filter {
    struct filter_vtable *vtable;

struct filter_vtable {
    bool (*match)(struct filter *, const char *);
    void (*free)(struct filter *);
    struct filter *(*clone)(struct filter *);

Each type creates its own vtable and links to it in the constructor. Here's the regex filter re-written for the new vtable API and clone method. This is all the tricks in one basket for a big object oriented C finale!

struct filter *filter_regex_create(const char *pattern);

struct filter_regex {
    regex_t regex;
    const char *pattern;
    struct filter filter;

static bool
method_match_regex(struct filter *f, const char *string)
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return regexec(&regex->regex, string, 0, NULL, 0) == 0;

static void
method_free_regex(struct filter *f)
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);

static struct filter *
method_clone_regex(struct filter *f)
    struct filter_regex *regex = container_of(f, struct filter_regex, filter);
    return filter_regex_create(regex->pattern);

/* vtable */
struct filter_vtable filter_regex_vtable = {
    method_match_regex, method_free_regex, method_clone_regex

/* constructor */
struct filter *filter_regex_create(const char *pattern)
    struct filter_regex *regex = malloc(sizeof(*regex));
    regex->pattern = pattern;
    regcomp(&regex->regex, pattern, REG_EXTENDED);
    regex->filter.vtable = &filter_regex_vtable;
    return &regex->filter;

This is almost exactly what's going on behind the scenes in C++. When a method/function is declared virtual, and therefore dispatches based on the run-time type of its left-most argument, it's listed in the vtables for classes that implement it. Otherwise it's just a normal function. This is why functions need to be declared virtual ahead of time in C++.

In conclusion, it's relatively easy to get the core benefits of object oriented programming in plain old C. It doesn't require heavy use of macros, nor do users of these systems need to know that underneath it's an object system, unless they want to extend it for themselves.

Here's the whole example program once if you're interested in poking:

tags: [ c cpp ]

Emacs Autotetris Mode

For more than a decade now, Emacs has come with a built-in Tetris clone, originally written by XEmacs' Glynn Clements. Just run M-x tetris any time you want to play. For anyone too busy to waste time playing Tetris, earlier this year I wrote an autotetris-mode that will play the Emacs game automatically.

Load the source, autotetris-mode.el and M-x autotetris. It will start the built-in Tetris but make all the moves itself. It works best when byte compiled.

At the time I had read an article and was interested in trying my hand at my own Tetris AI. Like most things Emacs, the built-in Tetris game is very hackable. It's also pretty simple and easy to understand. Rather than write my own I chose to build upon this one.


It's not a particularly strong AI. It doesn't pay attention to the next piece in queue, it doesn't know the game's basic shapes, and it doesn't try to maximize the score (clearing multiple rows at once). The goal is to continue running for as long as possible. But since it's able to get to the point where the game is so fast that the AI is unable to move pieces fast enough (it's rate limited like a human player), that means it's good enough.

When a new piece appears at the top of the screen, the AI, in memory, tries placing it in all possible positions and all possible orientations. For each of these positions it runs a heuristic on the resulting game state, summing five metrics. Each metric is scaled by a hand-tuned weight to adjust its relative priority. Smaller is better, so the position with the lowest score is selected.

Number of Holes

A hole is any open space that has a solid block above it, even if that hole is accessible without passing through a solid block. Count these holes.

Maximum Height

Add the height of the tallest column. Column height includes any holes in the column. The game ends when a column touches the top of the screen (or something like that), so this should be kept in check.

Mean Height

Add the mean height of all columns. The higher this is, the closer we are to losing the game. Since each row will have at least one hole, this will be a similar measure to the hole count.

Height Disparity

Add the difference between the shortest column height and the tallest column height. If this number is large it means we're not making effective use of the playing area. It also discourages the AI from getting into that annoying situation we all remember: when you really need a 4x1 piece that never seems to come. Those are the brief moments when I truly believe the version I'm playing has to be rigged.

Surface Roughness

Take the root mean square of the column heights. A rougher surface leaves fewer options when placing pieces. This measure will be similar to the disparity measurement.

Emacs-specific Details

With a position selected, the AI sends player inputs at a limited rate to the game itself, moving the piece into place. This is done by calling tetris-move-right, tetris-move-left, and tetris-rotate-next, which, in the normal game, are bound to the arrow keys.

The built-in tetris-mode isn't quite designed for this kind of extension, so it needs a little bit of help. I defined two pieces of advice to create hooks. These hooks alert my AI to two specific events in the game: the game start and a fresh, new piece.

(defadvice tetris-new-shape (after autotetris-new-shape-hook activate)
  (run-hooks 'autotetris-new-shape-hook))

(defadvice tetris-start-game (after autotetris-start-game-hook activate)
  (run-hooks 'autotetris-start-game-hook))

I talked before about the problems with global state. Fortunately, tetris-mode doesn't store any game state in global variables. It stores everything in buffer-local variables, which can be exploited for use in the AI. To perform the "in memory" heuristic checks, it creates a copy of the game state and manipulates the copy. The copy is made by way of clone-buffer on the *Tetris* buffer. The tetris-mode functions all work equally as well on the clone, so I can use the existing game rules to properly place the next piece in each available position. The game's own rules take care of clearing rows and checking for collisions for me. I wrote an autotetris-save-excursion function to handle the messy details.

(defmacro autotetris-save-excursion (&rest body)
  "Restore tetris game state after BODY completes."
  (declare (indent defun))
  `(with-current-buffer tetris-buffer-name
     (let ((autotetris-saved (clone-buffer "*Tetris-saved*")))
           (with-current-buffer autotetris-saved
             (kill-local-variable 'kill-buffer-hook)
         (kill-buffer autotetris-saved)))))

The kill-buffer-hook variable is also cloned, but I don't want tetris-mode to respond to the clone being killed, so I clear out the hook.

That's basically all there is to it! While watching it feels like it's making dumb mistakes, not placing pieces in optimal positions, but it recovers well from these situations almost every time, so it must know what it's doing. Currently it's a better player than me, which is my rule-of-thumb for calling an AI successful.

tags: [ emacs interactive ]

Global State: A Tale of Two Bad C APIs

Mutable global variables are evil. You've almost certainly heard that before, but it's worth repeating. It makes programs, and libraries especially, harder to understand, harder to optimize, more fragile, more error prone, and less useful. If you're using global state in a way that's visible to users of your API, and it's not essential to the domain, you're almost certainly doing something wrong.

In this article I'm going to use two well-established C APIs to demonstrate why global state is bad for APIs: BSD regular expressions and POSIX Getopt.

BSD Regular Expressions

The BSD regular expression API dates back to 4.3BSD, released in 1986. It's just a pair of functions: one compiles the regex, the other executes it on a string.

char *re_comp(const char *regex);
int   re_exec(const char *string);

It's immediately obvious that there's hidden internal state. Where else would the resulting compiled regex object be? Also notice there's no re_free(), or similar, for releasing resources held by the compiled result. That's because, due to its limited design, it doesn't hold any. It's entirely in static memory, which means there's some upper limit on the complexity of the regex given to this API. Suppose an implementation does use dynamically allocated memory. It seems this might not matter when only one compiled regex is allowed. However, this would create warnings in Valgrind and make it harder to use for bug testing.

This API is not thread-safe. Only one thread can use it at a time. It's not reentrant. While using a regex, calling another function that might use a regex means you have to recompile when it returns, just in case. The global state being entirely hidden, there's no way to tell if another part of the program used it.

Fixing BSD Regular Expressions

This API has been deprecated for some time now, so hopefully no one's using it anymore. 15 years after the BSD regex API came out, POSIX standardized a much better API. It operates on an opaque regex_t object, on which all state is stored. There's no global state.

int    regcomp(regex_t *preg, const char *regex, int cflags);
int    regexec(const regex_t *preg, const char *string, ...);
size_t regerror(int errcode, const regex_t *preg, ...);
void   regfree(regex_t *preg);

This is what a good API looks like.


POSIX defines a C API called Getopt for parsing command line arguments. It's a single function that operates on the argc and argv values provided to main(). An option string specifies which options are valid and whether or not they require an argument. Typical use looks like this,

int main(int argc, char **argv)
    int option;
    while ((option = getopt(argc, argv, "ab:c:d")) != -1) {
        switch (option) {
            case 'a':
            /* ... */
    /* ... */
    return 0;

The b and c options require an argument, indicated by the colons. When encountered, this argument is passed through a global variable optarg. There are four external global variables in total.

extern char *optarg;
extern int optind, opterr, optopt;

If an invalid option is found, getopt() will automatically print a locale-specific error message and return ?. The opterr variable can be used to disable this message and the optopt variable is used to get the actual invalid option character.

The optind variable keeps track of Getopt's progress. It slides along argv as each option is processed. In a minimal, strictly POSIX-compliant Getopt, this is all the global state required.

The argc value in main(), and therefore the same parameter in getopt(), is completely redundant and serves no real purpose. Just like the C strings it points to, the argv vector is guaranteed to be NULL-terminated. At best it's a premature optimization.

Threading an Reentrancy

The most immediate problem is that the entire program can only parse one argument vector at a time. It's not thread-safe. This leaves out the possibility of parsing argument vectors in other threads. For example, if the program is a server that exposes a shell-like interface to remote users, and multiple threads are used to handle those requests, it won't be able to take advantage of Getopt.

The second problem is that, even in a single-threaded application, the program can't pause to parse a different argument vector before returning. It's not reentrant. For example, suppose one of the arguments to the program is a string containing more arguments to be parsed for some subsystem.

#  -s    Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"

In theory, the value of optind could be saved and restored. However, this isn't portable. POSIX doesn't explicitly declare that the entire state is captured by optind, nor is it required to be. Implementations are allowed to have internal, hidden global state. This has implications in resetting Getopt.

Resetting Getopt

In a minimal, strict Getopt, resetting Getopt for parsing another argument vector is just a matter of setting optind to back to its original value of 1. However, this idiom isn't portable, and POSIX provides no portable method for resetting the global parser state.

Real implementations of Getopt go beyond POSIX. Probably the most popular extra feature is option grouping. Typically, multiple options can be grouped into a single argument, so long as only the final option requires an argument.

$ myprogram -adb foo

After processing a, optind cannot be incremented, because it's still working on the first argument. This means there's another internal counter for stepping across the group. In glibc this is called nextchar. Setting optind to 1 will not reset this internal counter, nor would it be detectable by Getopt if it was already set to 1. The glibc way to reset Getopt is to set optind to 0, which is otherwise an invalid value. Some other Getopt implementations follow this idiom, but it's not entirely portable.

Not only does Getopt have nasty global state, the user has no way to reliably control it!

Error Printing

I mentioned that Getopt will automatically print an error message unless disabled with opterr. There's no way to get at this error message, should you want to redirect it somewhere else. It's more hidden, internal state. You could write your own message, but you'd lose out on the automatic locale support.

Fixing Getopt

The way Getopt should have been designed was to accept a context argument and store all state on that context. Following other POSIX APIs (pthreads, regex), the context itself would be an opaque object. In typical use it would have automatic (i.e. stack) duration. The context would either be zero initialized or a function would be provided to initialize it. It might look something like this (in the zero-initialized case).

int getopt(getopt_t *ctx, char **argv, const chat *optstring);

Instead of optarg and optopt global variables, these values would be obtained by interrogating the context. The same applies for optind and the diagnostic message.

const char *getopt_optarg(getopt_t *ctx);
int         getopt_optopt(getopt_t *ctx);
int         getopt_optind(getopt_t *ctx);
const char *getopt_opterr(getopt_t *ctx);

Alternatively, instead of getopt_optind() the API could have a function that continues processing, but returns non-option arguments instead of options. It would return NULL when no more arguments are left. This is the API I'd prefer, because it would allow for argument permutation (allow options to come after non-options, per GNU Getopt) without actually modifying the argument vector. This common extension to Getopt could be added cleanly. The real Getopt isn't designed well for extension.

const char *getopt_next_arg(getopt_t *ctx);

This API eliminates the global state and, as a result, solves all of the problems listed above. It's essentially the same API defined by Popt and my own embeddable Optparse. They're much better options if the limitations of POSIX-style Getopt are an issue.

tags: [ c ]

The Billion Pi Challenge

The challenge: As quickly as possible, find all occurrences of a given sequence of digits in the first one billion digits of pi. You don't have to compute pi yourself for this challenge. For example, "141592653" appears 4 times: at positions 1, 427,238,911, 570,434,346, and 678,096,434.

To my surprise, this turned out to be harder than I expected. A straightforward scan with Boyer-Moore-Horspool across the entire text file is already pretty fast. On modern, COTS hardware it takes about 6 seconds. Comparing bytes is cheap and it's largely an I/O-bound problem. This means building fancy indexes tends to make it slower because it's more I/O demanding.

The challenge was inspired by The Pi-Search Page, which offers a search on the first 200 million digits. There's also a little write-up about how their pi search works. I wanted to try to invent my own solution. I did eventually come up with something that worked, which can be found here. It's written in plain old C.

You might want to give the challenge a shot on your own before continuing!


The first thing I tried was SQLite. I thought an index (B-tree) over fixed-length substrings would be efficient. A LIKE condition with a right-hand wildcard is sargable and would work well with the index. Here's the schema I tried.


There will be 1 row for each position, i.e. 1 billion rows. Using INTEGER PRIMARY KEY means position will be used directly for row IDs, saving some database space.

After the data has been inserted by sliding a window along pi, I build an index. It's better to build an index after data is in the database than before.

CREATE INDEX sequence_index ON digits (sequence, position)

This takes several hours to complete. When it's done the database is a whopping 60GB! Remember I said that this is very much an I/O-bound problem? I wasn't kidding. This doesn't work well at all. Here's the a search for the example sequence.

SELECT position, sequence FROM digits
WHERE sequence LIKE '141592653%'

You get your answers after about 15 minutes of hammering on the disk.

Sometime later I realized that up to 18-digits sequences could be encoded into an integer, so that TEXT column could be a much simpler INTEGER. Unfortunately this doesn't really improve anything. I also tried this in PostgreSQL but it was even worse. I gave up after 24 hours of waiting on it. These databases are not built for such long, skinny tables, at least not without beefy hardware.

Offset DB

A couple weeks later I had another idea. A query is just a sequence of digits, so it can be trivially converted into a unique number. As before, pick a fixed length for sequences (n) for the index and an appropriate stride. The database would be one big file. To look up a sequence, treat that sequence as an offset into the database and seek into the database file to that offset times the stride. The total size of the database is 10^n * stride.

In this quick and dirty illustration, n=4 and stride=4 (far too small for that n).

For example, if the fixed-length for sequences is 6 and the stride is 4,000 bytes, looking up "141592" is just a matter of seeking to byte 141,592 * 4,000 and reading in positions until some sort of sentinel. The stride must be long enough to store all the positions for any indexed sequence.

For this purpose, the digits of pi are practically random numbers. The good news is that it means a fixed stride will work well. Any particular sequence appears just as often as any other. The chance a specific n-length sequence begins at a specific position is 1 / 10^n. There are 1 billion positions, so a particular sequence will have 1e9 / 10^n positions associated with it, which is a good place to start for picking a stride.

The bad news is that building the index means jumping around the database essentially at random for each write. This will break any sort of cache between the program and the hard drive. It's incredibly slow, even mmap()ed. The workaround is to either do it entirely in RAM (needs at least 6GB of RAM for 1 billion digits!) or to build it up over many passes. I didn't try it on an SSD but maybe the random access is more tolerable there.

Adding an Index

Doing all the work in memory makes it easier to improve the database format anyway. It can be broken into an index section and a tables section. Instead of a fixed stride for the data, front-load the database with a similar index that points to the section (table) of the database file that holds that sequence's pi positions. Each of the 10^n positions gets a single integer in the index at the front of the file. Looking up the positions for a sequence means parsing the sequence as a number, seeking to that offset into the beginning of the database, reading in another offset integer, and then seeking to that new offset. Now the database is compact and there are no concerns about stride.

No sentinel mark is needed either. The tables are concatenated in order in the table part of the database. To determine where to stop, take a peek at the next sequence's start offset in the index. Its table immediately follows, so this doubles as an end offset. For convenience, one final integer in the index will point just beyond the end of the database, so the last sequence (99999...) doesn't require special handling.

Searching Shorter and Longer

If the database built for fixed length sequences, how is a sequence of a different length searched? The two cases, shorter and longer, are handled differently.

If the sequence is shorter, fill in the remaining digits, ...000 to ...999, and look up each sequence. For example, if n=6 and we're searching for "1415", get all the positions for "141500", "141501", "141502", ..., "141599" and concatenate them. Fortunately the database already has them stored this way! Look up the offsets for "141500" and "141600" and grab everything in between. The downside is that the pi positions are only partially sorted, so they may require sorting before presenting to the user.

If the sequence is longer, the original digits file will be needed. Get the table for the subsequence fixed-length prefix, then seek into the digits file checking each of the pi positions for a full match. This requires lots of extra seeking, but a long sequence will naturally have fewer positions to test. For example, if n=7 and we're looking for "141592653", look up the "1415926" table in the database and check each of its 106 positions.

With this database searches are only a few milliseconds (though very much subject to cache misses). Here's my program in action, from the repository linked above.

$ time ./pipattern 141592653
1: 14159265358979323
427238911: 14159265303126685
570434346: 14159265337906537
678096434: 14159265360713718

real    0m0.004s
user    0m0.000s
sys 0m0.000s

I call that challenge completed!

tags: [ c math ]

C11 Lock-free Stack

C11, the latest C standard revision, hasn't received anywhere near the same amount of fanfare as C++11. I'm not sure why this is. Some of the updates to each language are very similar, such as formal support for threading and atomic object access. Three years have passed and some parts of C11 still haven't been implemented by any compilers or standard libraries yet. Since there's not yet a lot of discussion online about C11, I'm basing much of this article on my own understanding of the C11 draft. I may be under-using the _Atomic type specifier and not paying enough attention to memory ordering constraints.

Still, this is a good opportunity to break new ground with a demonstration of C11. I'm going to use the new stdatomic.h portion of C11 to build a lock-free data structure. To compile this code you'll need a C compiler and C library with support for both C11 and the optional stdatomic.h features. As of this writing, as far as I know only GCC 4.9, released April 2014, supports this. It's in Debian unstable but not in Wheezy.

If you want to take a look before going further, here's the source. The test code in the repository uses plain old pthreads because C11 threads haven't been implemented by anyone yet.

I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won't run just anywhere. I'll discuss all this below. So sorry for the long article. It's just a lot more complex a topic than I had anticipated!


A lock-free data structure doesn't require the use of mutex locks. More generally, it's a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations -- transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it's usually safer, because there's no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there's additional risk of starvation (livelock), where a thread is unable to make progress.

As a demonstration, I'll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it's going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there's only one value to be updated when pushing and popping: the pointer to the head of the list.

Here's what the API will look like. I'll define lstack_t shortly. I'm making it an opaque type because its fields should never be accessed directly. The goal is to completely hide the atomic semantics from the users of the stack.

int     lstack_init(lstack_t *lstack, size_t max_size);
void    lstack_free(lstack_t *lstack);
size_t  lstack_size(lstack_t *lstack);
int     lstack_push(lstack_t *lstack, void *value);
void   *lstack_pop (lstack_t *lstack);

Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.

Notice there's a maximum size declared at initialization time. While lock-free allocation is possible [PDF], C makes no guarantees that malloc() is lock-free, so being truly lock-free means not calling malloc(). An important secondary benefit to pre-allocating the stack's memory is that this implementation doesn't require the use of hazard pointers, which would be far more complicated than the stack itself.

The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being "full."

The int return value of lstack_init() and lstack_push() is for error codes, returning 0 for success. The only way these can fail is by running out of memory. This is an issue regardless of being lock-free: systems can simply run out of memory. In the push case it means the stack is full.


Here's the definition for a node in the stack. Neither field needs to be accessed atomically, so they're not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it's effectively immutable (outside of reuse). Users never need to touch this structure.

struct lstack_node {
    void *value;
    struct lstack_node *next;

Internally a lstack_t is composed of two stacks: the value stack (head) and the free node stack (free). These will be handled identically by the atomic functions, so it's really a matter of convention which stack is which. All nodes are initially placed on the free stack and the value stack starts empty. Here's what an internal stack looks like.

struct lstack_head {
    uintptr_t aba;
    struct lstack_node *node;

There's still no atomic declaration here because the struct is going to be handled as an entire unit. The aba field is critically important for correctness and I'll go over it shortly. It's declared as a uintptr_t because it needs to be the same size as a pointer. Now, this is not guaranteed by C11 -- it's only guaranteed to be large enough to hold any valid void * pointer, so it could be even larger -- but this will be the case on any system that has the required hardware support for this lock-free stack. This struct is therefore the size of two pointers. If that's not true for any reason, this code will not link. Users will never directly access or handle this struct either.

Finally, here's the actual stack structure.

typedef struct {
    struct lstack_node *node_buffer;
    _Atomic struct lstack_head head, free;
    _Atomic size_t size;
} lstack_t;

Notice the use of the new _Atomic qualifier. Atomic values may have different size, representation, and alignment requirements in order to satisfy atomic access. These values should never be accessed directly, even just for reading (use atomic_load()).

The size field is for convenience to check the number of elements on the stack. It's accessed separately from the stack nodes themselves, so it's not safe to read size and use the information to make assumptions about future accesses (e.g. checking if the stack is empty before popping off an element). Since there's no way to lock the lock-free stack, there's otherwise no way to estimate the size of the stack during concurrent access without completely disassembling it via lstack_pop().

There's no reason to use volatile here. That's a separate issue from atomic operations. The C11 stdatomic.h macros and functions will ensure atomic values are accessed appropriately.

Stack Functions

As stated before, all nodes are initially placed on the internal free stack. During initialization they're allocated in one solid chunk, chained together, and pinned on the free pointer. The initial assignments to atomic values are done through ATOMIC_VAR_INIT, which deals with memory access ordering concerns. The aba counters don't actually need to be initialized. Garbage, indeterminate values are just fine, but not initializing them would probably look like a mistake.

int lstack_init(lstack_t *lstack, size_t max_size)
    lstack->head.aba = ATOMIC_VAR_INIT(0);
    lstack->head.node = ATOMIC_VAR_INIT(NULL);
    lstack->size = ATOMIC_VAR_INIT(0);

    /* Pre-allocate all nodes. */
    lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
    if (lstack->node_buffer == NULL)
        return ENOMEM;
    for (size_t i = 0; i < max_size - 1; i++)
        lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
    lstack->node_buffer[max_size - 1].next = NULL;
    lstack->free.aba = ATOMIC_VAR_INIT(0);
    lstack->free.node = ATOMIC_VAR_INIT(lstack->node_buffer);
    return 0;

The free nodes will not necessarily be used in the same order that they're placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.

The reverse of lstack_init() is simple, and it's assumed concurrent access has terminated. The stack is no longer valid, at least not until lstack_init() is used again. This one is declared inline and put in the header.

inline void lstack_free(lstack_t *lstack)

To read an atomic value we need to use atomic_load(). Give it a pointer to an atomic value, it dereferences the pointer and returns the value. This is used in another inline function for reading the size of the stack.

inline size_t lstack_size(lstack_t *lstack)
    return atomic_load(&lstack->size);

Push and Pop

For operating on the two stacks there will be two internal, static functions, push and pop. These deal directly in nodes, accepting and returning them, so they're not suitable to expose in the API (users aren't meant to be aware of nodes). This is the most complex part of lock-free stacks. Here's pop().

static struct lstack_node *pop(_Atomic struct lstack_head *head)
    struct lstack_head next, orig = atomic_load(head);
    do {
        if (orig.node == NULL)
            return NULL;  // empty stack
        next.aba = orig.aba + 1;
        next.node = orig.node->next;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
    return orig.node;

It's centered around the new C11 stdatomic.h function atomic_compare_exchange_weak(). This is an atomic operation more generally called compare-and-swap (CAS). On x86 there's an instruction specifically for this, cmpxchg. Give it a pointer to the atomic value to be updated (head), a pointer to the value it's expected to be (orig), and a desired new value (next). If the expected and actual values match, it's updated to the new value. If not, it reports a failure and updates the expected value to the latest value. In the event of a failure we start all over again, which requires the while loop. This is an optimistic strategy.

The "weak" part means it will sometimes spuriously fail where the "strong" version would otherwise succeed. In exchange for more failures, calling the weak version is faster. Use the weak version when the body of your do ... while loop is fast and the strong version when it's slow (when trying again is expensive), or if you don't need a loop at all. You usually want to use weak.

The alternative to CAS is load-link/store-conditional. It's a stronger primitive that doesn't suffer from the ABA problem described next, but it's also not available on x86_64. On other platforms, one or both of atomic_compare_exchange_*() will be implemented using LL/SC, but we still have to code for the worst case (CAS).

The ABA Problem

The aba field is here to solve the ABA problem by counting the number of changes that have been made to the stack. It will be updated atomically alongside the pointer. Reasoning about the ABA problem is where I got stuck last time writing this article.

Suppose aba didn't exist and it was just a pointer being swapped. Say we have two threads, A and B.

The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They'll work with append-only data structures, where pointers are never recycled, but that's it.

The aba field solves the problem because it's incremented every time the pointer is updated. Remember that this internal stack struct is two pointers wide? That's 16 bytes on a 64-bit system. The entire 16 bytes is compared by CAS and they all have to match for it to succeed. Since B, or other threads, will increment aba at least twice (once to remove the node, and once to put it back in place), A will never mistake the recycled pointer for the old one. There's a special double-width CAS instruction specifically for this purpose, cmpxchg16. This is generally called DCAS. It's available on most x86_64 processors. On Linux you can check /proc/cpuinfo for support. It will be listed as cx16.

If it's not available at compile-time this program won't link. The function that wraps cmpxchg16 won't be there. You can tell GCC to assume it's there with the -mcx16 flag. The same rule here applies to C++11's new std::atomic.

There's still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There's nothing we can do about this, but if you witness this in the wild you need to immediately stop what you're doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.

Hazard Pointers and Garbage Collection

Another problem in pop() is dereferencing orig.node to access its next field. By the time we get to it, the node pointed to by orig.node may have already been removed from the stack and freed. If the stack was using malloc() and free() for allocations, it may even have had free() called on it. If so, the dereference would be undefined behavior -- a segmentation fault, or worse.

There are three ways to deal with this.

  1. Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won't be a problem. However, if we're interacting with a garbage collector we're not really lock-free.

  2. Hazard pointers. Each thread keeps track of what nodes it's currently accessing and other threads aren't allowed to free nodes on this list. This is messy and complicated.

  3. Never free nodes. This implementation recycles nodes, but they're never truly freed until lstack_free(). It's always safe to dereference a node pointer because there's always a node behind it. It may point to a node that's on the free list or one that was even recycled since we got the pointer, but the aba field deals with any of those issues.

Reference counting on the node won't work here because we can't get to the counter fast enough (atomically). It too would require dereferencing in order to increment. The reference counter could potentially be packed alongside the pointer and accessed by a DCAS, but we're already using those bytes for aba.


Push is a lot like pop.

static void push(_Atomic struct lstack_head *head, struct lstack_node *node)
    struct lstack_head next, orig = atomic_load(head);
    do {
        node->next = orig.node;
        next.aba = orig.aba + 1;
        next.node = node;
    } while (!atomic_compare_exchange_weak(head, &orig, next));

It's counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn't take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.

API Push and Pop

The API push and pop functions are built on these internal atomic functions.

int lstack_push(lstack_t *lstack, void *value)
    struct lstack_node *node = pop(&lstack->free);
    if (node == NULL)
        return ENOMEM;
    node->value = value;
    push(&lstack->head, node);
    atomic_fetch_add(&lstack->size, 1);
    return 0;

Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there's an instant where the stack size is listed as one shorter than it actually is. However, since there's no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.

Popping is the same thing in reverse.

void *lstack_pop(lstack_t *lstack)
    struct lstack_node *node = pop(&lstack->head);
    if (node == NULL)
        return NULL;
    atomic_fetch_sub(&lstack->size, 1);
    void *value = node->value;
    push(&lstack->free, node);
    return value;

Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It's really simple with the primitive push and pop.

SHA1 Demo

The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It's not really exercising the library thoroughly because there are no contended pops, but I couldn't think of a better example at the time.

The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.

tags: [ c ]

Stabilizing C's Quicksort

The C standard library includes a quicksort function called qsort(). It sorts homogeneous arrays of arbitrary type. The interface is exactly what you'd expect given the constraints of the language.

void qsort(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *));

It takes a pointer to the first element of the array, the number of members, the size of each member, and a comparator function. The comparator has to operate on void * pointers because C doesn't have templates or generics or anything like that. That's two interfaces where type safety is discarded: the arguments passed to qsort() and again when it calls the comparator function.

One of the significant flaws of this interface is the lack of context for the comparator. C doesn't have closures, which in other languages would cover this situation. If the sort function depends on some additional data, such as in Graham scan where points are sorted relative to a selected point, the extra information needs to be smuggled in through a global variable. This is not reentrant and wouldn't be safe in a multi-threaded environment. There's a GNU extension here, qsort_r(), that takes an additional context argument, allowing for reentrant comparators.

Quicksort has some really nice properties. It's in-place, so no temporary memory needs to be allocated. If implemented properly it only consumes O(log n) space, which is the stack growth during recursion. Memory usage is localized, so it plays well with caching.

That being said, qsort() is also a classic example of an API naming mistake. Few implementations actually use straight quicksort. For example, glibc's qsort() is merge sort (in practice), and the other major libc implementations use a hybrid approach. Programs using their language's sort function shouldn't be concerned with how it's implemented. All the matters is the interface and whether or not it's a stable sort. OpenBSD made the exact same mistake when they introduced arc4random(), which no longer uses RC4.

Since quicksort is an unstable sort -- there are multiple possible results when the array contains equivalent elements -- this means qsort() is not guaranteed to be stable, even if internally the C library is using a stable sort like merge sort. The C standard library has no stable sort function.

Comparator Composability

The unfortunate side effect of unstable sorts is that they hurt composability. For example, let's say we have a person struct like this,

struct person {
    const char *first, *last;
    int age;

Here are a couple of comparators to sort either by name or by age. As a side note, strcmp() automatically works correctly with UTF-8 so this program isn't limited to old-fashioned ASCII names.

#include <string.h>

int compare_name(const void *a, const void *b)
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int last = strcmp(pa->last, pb->last);
    return last != 0 ? last : strcmp(pa->first, pb->first);

int compare_age(const void *a, const void *b)
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    return pa->age - pb->age;

And since we'll need it later, here's a COUNT_OF macro to get the length of arrays at compile time. There's a less error prone version out there, but I'm keeping it simple.

#define COUNT_OF(x) (sizeof(x) / sizeof(0[x]))

Say we want to sort by name, then age. When using a stable sort, this is accomplished by sorting on each field separately in reverse order of preference: a composition of individual comparators. Here's an attempt at using quicksort to sort an array of people by age then name.

struct person people[] = {
    {"Joe", "Shmoe", 24},
    {"John", "Doe", 30},
    {"Alan", "Smithee", 42},
    {"Jane", "Doe", 30}

qsort(people, COUNT_OF(people), sizeof(struct person), compare_name);
qsort(people, COUNT_OF(people), sizeof(struct person), compare_age);

But this doesn't always work. Jane should come before John, but the original sort was completely lost in the second sort.

Joe Shmoe, 24
John Doe, 30
Jane Doe, 30
Alan Smithee, 42

This could be fixed by defining a new comparator that operates on both fields at once, compare_age_name(), and performing a single sort. But what if later you want to sort by name then age? Now you need compare_name_age(). If a third field was added, there would need to be 6 (3!) different comparator functions to cover all the possibilities. If you had 6 fields, you'd need 720 comparators! Composability has been lost to a combinatorial nightmare.

Pointer Comparison

The GNU libc documentation claims that qsort() can be made stable by using pointer comparison as a fallback. That is, when the relevant fields are equivalent, use their array position to resolve the difference.

If you want the effect of a stable sort, you can get this result by writing the comparison function so that, lacking other reason distinguish between two elements, it compares them by their addresses.

This is not only false, it's dangerous! Because elements may be sorted in-place, even in glibc, their position will change during the sort. The comparator will be using their current positions, not the starting positions. What makes it dangerous is that the comparator will return different orderings throughout the sort as elements are moved around the array. This could result in an infinite loop, or worse.

Making it Stable

The most direct way to work around the unstable sort is to eliminate any equivalencies. Equivalent elements can be distinguished by adding an intrusive order field which is set after each sort. The comparators will fall back on this sort to maintain the original ordering.

struct person {
    const char *first, *last;
    int age;
    size_t order;

And the new comparators.

int compare_name_stable(const void *a, const void *b)
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int last = strcmp(pa->last, pb->last);
    if (last != 0)
        return last;
    int first = strcmp(pa->first, pb->first);
    if (first != 0)
        return first;
    return pa->order - pb->order;

int compare_age_stable(const void *a, const void *b)
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int age = pa->age - pb->age;
    return age != 0 ? age : pa->order - pb->order;

The first sort doesn't need to be stable, but there's not much reason to keep around two definitions.

qsort(people, COUNT_OF(people), sizeof(people[0]), compare_name_stable);
for (size_t i = 0; i < COUNT_OF(people); i++)
    people[i].order = i;
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_age_stable);

And the result:

Joe Shmoe, 24
Jane Doe, 30
John Doe, 30
Alan Smithee, 42

Without defining any new comparators I can sort by name then age just by swapping the calls to qsort(). At the cost of an extra bookkeeping field, the number of comparator functions needed as fields are added is O(n) and not O(n!) despite using an unstable sort.

tags: [ c ]

An RC4 Password Hashing Function

There was an interesting /r/DailyProgrammer challenge this week to write a program that properly hashes passwords for storage in an account database. It's a necessary part of any system that needs to securely authenticate users. Storing user passwords in plain text is not just bad practice, it's irresponsible. If the database is compromised, the attacker learns every user's password. Since people are likely to re-use passwords on different websites, the database could be used to infiltrate accounts on other sites.

The solution to this problem is to run the password through a one-way cryptographic hash function before storing it in the database. When the database is compromised, it's more difficult to work backwards to recover the passwords. Examples of one-way hash functions are MD5, SHA-1, and the SHA-2 family. Block ciphers can also be converted into hash functions (e.g. bcrypt from Blowfish), though it must be done carefully since block ciphers were generally designed with different goals in mind.

// => "3858f62230ac3c915f300c664312c63f"

// => "8843d7f92416211de9ebb963ff4ce28125932878"

However, these particular functions (SHA-1 and SHA-2) alone are poor password hashing functions: they're much too fast! An offline attacker can mount a rapid brute-force attack on these kinds of hashes. They also don't include a salt, a unique, non-secret, per-hash value used as additional hash function input. Without this, an attacker could prepare the entire attack ahead of time -- a rainbow table. Once the hashes are obtained, reversing them is just a matter of looking them up in the table.

Good password hashing needs to be slow and it needs support for salt. Examples of algorithms well-suited for this purpose are PBKDF2, bcrypt, and scrypt. These are the functions you'd want to use in a real application today. Each of these are also more generally key derivation functions. They can strengthen a relatively short human-memorable passphrase by running it through a long, slow procedure before making use of it as an encryption key. An brute-force attacker would need to perform this slow procedure for each individual guess.

Alternatively, if you're stuck using a fast hash function anyway, it could be slowed down by applying the function thousands or even millions of times recursively to its own output. This is what I did in order to strengthen my GPG passphrase. However, you're still left with the problem of applying the salt. The naive approach would be a plain string concatenation with the password, but this likely to be vulnerable to a length extension attack. The proper approach would be to use HMAC.

RC4 Solution

For my solution to the challenge, I wasn't looking for something strong enough to do key derivation. I just need a slow hash function that properly handles a salt. Another important goal was to keep the solution small enough to post as a reddit comment, and I wanted to do it without using any external crypto libraries. If I'm using a library, I might as well just include/import/require PBKDF2 and make it a 2-liner, but that would be boring. I wanted it to be a reasonably short C program with no external dependencies other than standard libraries. Not counting the porcelain, the final result weighs in at 115 lines of C, so I think I achieved all my goals.

So what's the smallest, modern cryptographic algorithm I'm aware of? That would be RC4, my favorite random generator! Unlike virtually every other cryptographic algorithm, it's easy to commit to memory and to implement from scratch without any reference documentation. Similarly, this password hashing function can be implemented entirely from memory (if you can imagine yourself in some outlandish sitation where that would be needed).

Unfortunately, RC4 has had a lot of holes punched in it over the years. The initial output has been proven to be biased, leaking key material, and there's even good reason to believe it may already be broken by nation state actors. Despite this, RC4 remains the most widely used stream cipher today due to its inclusion in TLS. Most importantly here, almost none of RC4's weaknesses apply to this situation -- we're only using a few bytes of output -- so it's still a very strong algorithm. Besides, what I'm developing is a proof of concept, not something to be used in a real application. It would be interesting to see how long it takes for someone to break this (maybe even decades).

Before I dive into the details, I'll link to the source repository. As of this writing there are C and Elisp implementations of the algorithm, and they will properly validate each other's hashes. I call it RC4HASH.

Here are some example hashes for the password "foobar". It's different each time because each has a unique salt. Notice the repeated byte 12 in the 5th byte position of the hash.

$ ./rc4hash -p foobar
$ ./rc4hash -p foobar
$ ./rc4hash -p foobar

Each also validates as correct.

$ ./rc4hash -p foobar -v c56cdbe5...b486c49d
$ ./rc4hash -p foobar -v a1ea954b...03ef3e8d
$ ./rc4hash -p foobar -v 5603de35...bd25f3f4

The Algorithm

RC4 is a stream cipher, which really just means it's a fancy random number generator. How can we turn this into a hash function? The content to be hashed can be fed to the key schedule algorithm in 256-byte chunks, as if it were a key. The key schedule is a cipher initialization stage that shuffles up the cipher state without generating output. To put all this in terms of C, here's what the RC4 struct and initialization looks like: a 256-element byte array initialized with 0-255, and two array indexes.

struct rc4 {
    uint8_t S[256];
    uint8_t i, j;

void rc4_init(struct rc4 *k) {
    k->i = k->j = 0;
    for (int i = 0; i < 256; i++) {
        k->S[i] = i;

The key schedule shuffles this state according to a given key.

#define SWAP(a, b) if (a ^ b) {a ^= b; b ^= a; a ^= b;}

void rc4_schedule(struct rc4 *k, const uint8_t *key, size_t length) {
    int j = 0;
    for (int i = 0; i < 256; i++) {
        j = (j + k->S[i] + key[i % length]) % 256;
        SWAP(k->S[i], k->S[j]);

Notice it doesn't touch the array indexes. It can be called over and over with different key material to keep shuffling the state. This is how I'm going to mix the salt into the password. The key schedule will first be run on the salt, then again on the password.

To produce the hash output, emit the desired number of bytes from the cipher. Ta da! It's now an RC4-based salted hash function.

void rc4_emit(struct rc4 *k, uint8_t *buffer, size_t count) {
    for (size_t b = 0; b < count; b++) {
        k->j += k->S[++k->i];
        SWAP(k->S[k->i], k->S[k->j]);
        buffer[b] = k->S[(k->S[k->i] + k->S[k->j]) & 0xFF];

/* Throwaway 64-bit hash example. Assumes strlen(passwd) <= 256. */
uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
    struct rc4 rc4;
    rc4_schedule(&rc4, salt, salt_len);
    rc4_schedule(&rc4, passwd, strlen(passwd));
    uint64_t hash;
    rc4_emit(&rc4, &hash, sizeof(hash));
    return hash;

Both password and salt are the inputs to hash function. In order to validate a password against a hash, we need to keep track of the salt. The easiest way to do this is to concatenate it to the hash itself, making it part of the hash. Remember, it's not a secret value, so this is safe. For my solution, I chose to use a 32-bit salt and prepend it to 20 bytes of generator output, just like an initialization vector (IV). To validate a user, all we need is a hash and a password provided by the user attempting to authenticate.

Fixing a Flaw

Right now there's a serious flaw. If you want to find it for yourself, stop reading here. It will need to get fixed before this hash function is any good.

It's trivial to find a collision, with is the death knell for any cryptographic hash function. Certain kinds of passwords will collapse down to the simplest case.

hash("x", "salt", 4);
// => 8622913094354299445
hash("xx", "salt", 4);
// => 8622913094354299445
hash("xxx", "salt", 4);
// => 8622913094354299445
hash("xxxx", "salt", 4);
// => 8622913094354299445

hash("abc", "salt", 4);
// => 8860606953758435703
hash("abcabc", "salt", 4);
// => 8860606953758435703

Notice a pattern? Take a look at the RC4 key schedule function. Using modular arithmetic, the password wraps around repeating itself over 256 bytes. This means passwords with repeating patterns will mutate the cipher identically regardless of the number of repeats, so they result in the same hash. A password "abcabcabc" will be accepted as "abc".

The fix is to avoid wrapping the password. Instead, the RC4 generator, seeded only by the salt, is used to pad the password out to 256 bytes without repeating.

uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
    struct rc4 rc4;
    rc4_schedule(&rc4, salt, salt_len);
    uint8_t padded[256];
    memcpy(padded, passwd, strlen(passwd));
    rc4_emit(&rc4, padded + strlen(passwd), 256 - strlen(passwd));
    rc4_schedule(&rc4, padded, sizeof(padded));
    uint64_t hash;
    rc4_emit(&rc4, &hash, sizeof(hash));
    return hash;

This should also help mix the RC4 state up a bit more before generating the output. I'm no cryptanalyst, though, so I don't know if it's worth much.

Slowing It Down

The next problem is that this is way too fast! It shuffles bytes around for a few microseconds and it's done. So far it also doesn't address the problems of biases in RC4's initial output. We'll kill two birds with one stone for this one.

To fix this we'll add an adaptive difficulty factor. It will be a value that determines how much work will be done to compute the hash. It's adaptive because the system administrator can adjust it at any time without affecting previous hashes. To accomplish this, like the salt, the difficulty factor will be appended to the hash output. All the required information will come packaged together in the hash.

The difficulty factor comes into play in two areas. First, it determines how many times the key schedule is run. This is the same modification CipherSaber-2 uses in order to strengthen RC4's weak key schedule. However, rather than run it on the order of 20 times, our hash function will be running it hundreds of thousands of times. Second, the difficulty will also determine how many initial bytes of output are discarded before we start generating the hash.

I decided on an unsigned 8-bit value for the difficulty. The number of key schedules will be 1 shifted left by this number of bits (i.e. pow(2, difficulty)). This makes the minimum number of key schedules 1, since any less doesn't make sense. The number of bytes skipped is the same bitshift, minus 1, times 64 ((pow(2, difficulty) - 1) * 64), the muliplication is so that it can skip large swaths output. Therefore the implementation so far has a difficulty of zero: one key schedule round and zero bytes of output skipped.

The dynamic range of the difficulty factor (0-255) puts the the time needed on a modern computer to compute an RC4HASH between a few microseconds (0) to the billions of years (255). That should be a more than sufficient amount of future proofing, especially considering that we're using RC4, which will likely be broken before the difficulty factor ever tops out.

I won't show the code to do this since that's how it's implemented in the final version, so go look at the repository instead. The final hash is 26 bytes long: a 208-bit hash. The first 4 bytes are the salt (grabbed from /dev/urandom in my implementations), the byte is the difficulty factor, and the final 21 bytes are RC4 output.

In the example hashes above, the 12 constant byte is the difficulty factor. The default difficulty factor is 18 (0x12). I've considered XORing this with some salt-seeded RC4 output just to make the hash look nice, but that just seems like arbitrary complexity for no real gains. With the default difficulty, it takes almost a second for my computers to compute the hash.

I believe RC4HASH should be quite resistant to GPGPU attacks. RC4 is software oriented, involving many random array reads and writes rather than SIMD-style operations. GPUs are really poor at this sort of thing, so they should take a significant performance hit when running RC4HASH.

Break My Hash!

For those interested in breaking RC4HASH, here are a couple of hashes of English language passphrases. Each is about one short sentence in length ([a-zA-Z !.,]+). I'm not keeping track of the sentences I used, so the only way to get them will be to break the hash, even for me.


If you can find a string that validates with these hashes, especially if it's not the original passphrase, you win! I don't have any prizes in mind right now, but perhaps some Bitcoin would be in order if your attack is interesting.

tags: [ c crypto ]

A GPU Approach to Particle Physics

The next project in my GPGPU series is a particle physics engine that computes the entire physics simulation on the GPU. Particles are influenced by gravity and will bounce off scene geometry. This WebGL demo uses a shader feature not strictly required by the OpenGL ES 2.0 specification, so it may not work on some platforms, especially mobile devices. It will be discussed later in the article.

It's interactive. The mouse cursor is a circular obstacle that the particles bounce off of, and clicking will place a permanent obstacle in the simulation. You can paint and draw structures through which the the particles will flow.

Here's an HTML5 video of the demo in action, which, out of necessity, is recorded at 60 frames-per-second and a high bitrate, so it's pretty big. Video codecs don't gracefully handle all these full-screen particles very well and lower framerates really don't capture the effect properly. I also added some appropriate sound that you won't hear in the actual demo.

On a modern GPU, it can simulate and draw over 4 million particles at 60 frames per second. Keep in mind that this is a JavaScript application, I haven't really spent time optimizing the shaders, and it's living within the constraints of WebGL rather than something more suitable for general computation, like OpenCL or at least desktop OpenGL.

Encoding Particle State as Color

Just as with the Game of Life and path finding projects, simulation state is stored in pairs of textures and the majority of the work is done by a fragment shader mapped between them pixel-to-pixel. I won't repeat myself with the details of setting this up, so refer to the Game of Life article if you need to see how it works.

For this simulation, there are four of these textures instead of two: a pair of position textures and a pair of velocity textures. Why pairs of textures? There are 4 channels, so every one of these components (x, y, dx, dy) could be packed into its own color channel. This seems like the simplest solution.

The problem with this scheme is the lack of precision. With the R8G8B8A8 internal texture format, each channel is one byte. That's 256 total possible values. The display area is 800 by 600 pixels, so not even every position on the display would be possible. Fortunately, two bytes, for a total of 65,536 values, is plenty for our purposes.

The next problem is how to encode values across these two channels. It needs to cover negative values (negative velocity) and it should try to take full advantage of dynamic range, i.e. try to spread usage across all of those 65,536 values.

To encode a value, multiply the value by a scalar to stretch it over the encoding's dynamic range. The scalar is selected so that the required highest values (the dimensions of the display) are the highest values of the encoding.

Next, add half the dynamic range to the scaled value. This converts all negative values into positive values with 0 representing the lowest value. This representation is called Excess-K. The downside to this is that clearing the texture (glClearColor) with transparent black no longer sets the decoded values to 0.

Finally, treat each channel as a digit of a base-256 number. The OpenGL ES 2.0 shader language has no bitwise operators, so this is done with plain old division and modulus. I made an encoder and decoder in both JavaScript and GLSL. JavaScript needs it to write the initial values and, for debugging purposes, so that it can read back particle positions.

vec2 encode(float value) {
    value = value * scale + OFFSET;
    float x = mod(value, BASE);
    float y = floor(value / BASE);
    return vec2(x, y) / BASE;

float decode(vec2 channels) {
    return (dot(channels, vec2(BASE, BASE * BASE)) - OFFSET) / scale;

And JavaScript. Unlike normalized GLSL values above (0.0-1.0), this produces one-byte integers (0-255) for packing into typed arrays.

function encode(value, scale) {
    var b = Particles.BASE;
    value = value * scale + b * b / 2;
    var pair = [
        Math.floor((value % b) / b * 255),
        Math.floor(Math.floor(value / b) / b * 255)
    return pair;

function decode(pair, scale) {
    var b = Particles.BASE;
    return (((pair[0] / 255) * b +
             (pair[1] / 255) * b * b) - b * b / 2) / scale;

The fragment shader that updates each particle samples the position and velocity textures at that particle's "index", decodes their values, operates on them, then encodes them back into a color for writing to the output texture. Since I'm using WebGL, which lacks multiple rendering targets (despite having support for gl_FragData), the fragment shader can only output one color. Position is updated in one pass and velocity in another as two separate draws. The buffers are not swapped until after both passes are done, so the velocity shader (intentionally) doesn't uses the updated position values.

There's a limit to the maximum texture size, typically 8,192 or 4,096, so rather than lay the particles out in a one-dimensional texture, the texture is kept square. Particles are indexed by two-dimensional coordinates.

It's pretty interesting to see the position or velocity textures drawn directly to the screen rather than the normal display. It's another domain through which to view the simulation, and it even helped me identify some issues that were otherwise hard to see. The output is a shimmering array of color, but with definite patterns, revealing a lot about the entropy (or lack thereof) of the system. I'd share a video of it, but it would be even more impractical to encode than the normal display. Here are screenshots instead: position, then velocity. The alpha component is not captured here.

Entropy Conservation

One of the biggest challenges with running a simulation like this on a GPU is the lack of random values. There's no rand() function in the shader language, so the whole thing is deterministic by default. All entropy comes from the initial texture state filled by the CPU. When particles clump up and match state, perhaps from flowing together over an obstacle, it can be difficult to work them back apart since the simulation handles them identically.

To mitigate this problem, the first rule is to conserve entropy whenever possible. When a particle falls out of the bottom of the display, it's "reset" by moving it back to the top. If this is done by setting the particle's Y value to 0, then information is destroyed. This must be avoided! Particles below the bottom edge of the display tend to have slightly different Y values, despite exiting during the same iteration. Instead of resetting to 0, a constant value is added: the height of the display. The Y values remain different, so these particles are more likely to follow different routes when bumping into obstacles.

The next technique I used is to supply a single fresh random value via a uniform for each iteration This value is added to the position and velocity of reset particles. The same value is used for all particles for that particular iteration, so this doesn't help with overlapping particles, but it does help to break apart "streams". These are clearly-visible lines of particles all following the same path. Each exits the bottom of the display on a different iteration, so the random value separates them slightly. Ultimately this stirs in a few bits of fresh entropy into the simulation on each iteration.

Alternatively, a texture containing random values could be supplied to the shader. The CPU would have to frequently fill and upload the texture, plus there's the issue of choosing where to sample the texture, itself requiring a random value.

Finally, to deal with particles that have exactly overlapped, the particle's unique two-dimensional index is scaled and added to the position and velocity when resetting, teasing them apart. The random value's sign is multiplied by the index to avoid bias in any particular direction.

To see all this in action in the demo, make a big bowl to capture all the particles, getting them to flow into a single point. This removes all entropy from the system. Now clear the obstacles. They'll all fall down in a single, tight clump. It will still be somewhat clumped when resetting at the top, but you'll see them spraying apart a little bit (particle indexes being added). These will exit the bottom at slightly different times, so the random value plays its part to work them apart even more. After a few rounds, the particles should be pretty evenly spread again.

The last source of entropy is your mouse. When you move it through the scene you disturb particles and introduce some noise to the simulation.

Textures as Vertex Attribute Buffers

This project idea occurred to me while reading the OpenGL ES shader language specification (PDF). I'd been wanting to do a particle system, but I was stuck on the problem how to draw the particles. The texture data representing positions needs to somehow be fed back into the pipeline as vertices. Normally a buffer texture -- a texture backed by an array buffer -- or a pixel buffer object -- asynchronous texture data copying -- might be used for this, but WebGL has none these features. Pulling texture data off the GPU and putting it all back on as an array buffer on each frame is out of the question.

However, I came up with a cool technique that's better than both those anyway. The shader function texture2D is used to sample a pixel in a texture. Normally this is used by the fragment shader as part of the process of computing a color for a pixel. But the shader language specification mentions that texture2D is available in vertex shaders, too. That's when it hit me. The vertex shader itself can perform the conversion from texture to vertices.

It works by passing the previously-mentioned two-dimensional particle indexes as the vertex attributes, using them to look up particle positions from within the vertex shader. The shader would run in GL_POINTS mode, emitting point sprites. Here's the abridged version,

attribute vec2 index;

uniform sampler2D positions;
uniform vec2 statesize;
uniform vec2 worldsize;
uniform float size;

// float decode(vec2) { ...

void main() {
    vec4 psample = texture2D(positions, index / statesize);
    vec2 p = vec2(decode(psample.rg), decode(;
    gl_Position = vec4(p / worldsize * 2.0 - 1.0, 0, 1);
    gl_PointSize = size;

The real version also samples the velocity since it modulates the color (slow moving particles are lighter than fast moving particles).

However, there's a catch: implementations are allowed to limit the number of vertex shader texture bindings to 0 (GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS). So technically vertex shaders must always support texture2D, but they're not required to support actually having textures. It's sort of like food service on an airplane that doesn't carry passengers. These platforms don't support this technique. So far I've only had this problem on some mobile devices.

Outside of the lack of support by some platforms, this allows every part of the simulation to stay on the GPU and paves the way for a pure GPU particle system.


An important observation is that particles do not interact with each other. This is not an n-body simulation. They do, however, interact with the rest of the world: they bounce intuitively off those static circles. This environment is represented by another texture, one that's not updated during normal iteration. I call this the obstacle texture.

The colors on the obstacle texture are surface normals. That is, each pixel has a direction to it, a flow directing particles in some direction. Empty space has a special normal value of (0, 0). This is not normalized (doesn't have a length of 1), so it's an out-of-band value that has no effect on particles.

(I didn't realize until I was done how much this looks like the Greendale Community College flag.)

A particle checks for a collision simply by sampling the obstacle texture. If it finds a normal at its location, it changes its velocity using the shader function reflect. This function is normally used for reflecting light in a 3D scene, but it works equally well for slow-moving particles. The effect is that particles bounce off the the circle in a natural way.

Sometimes particles end up on/in an obstacle with a low or zero velocity. To dislodge these they're given a little nudge in the direction of the normal, pushing them away from the obstacle. You'll see this on slopes where slow particles jiggle their way down to freedom like jumping beans.

To make the obstacle texture user-friendly, the actual geometry is maintained on the CPU side of things in JavaScript. It keeps a list of these circles and, on updates, redraws the obstacle texture from this list. This happens, for example, every time you move your mouse on the screen, providing a moving obstacle. The texture provides shader-friendly access to the geometry. Two representations for two purposes.

When I started writing this part of the program, I envisioned that shapes other than circles could place placed, too. For example, solid rectangles: the normals would look something like this.

So far these are unimplemented.

Future Ideas

I didn't try it yet, but I wonder if particles could interact with each other by also drawing themselves onto the obstacles texture. Two nearby particles would bounce off each other. Perhaps the entire liquid demo could run on the GPU like this. If I'm imagining it correctly, particles would gain volume and obstacles forming bowl shapes would fill up rather than concentrate particles into a single point.

I think there's still some more to explore with this project.

tags: [ webgl media interactive gpgpu javascript ]

A GPU Approach to Path Finding

Last time I demonstrated how to run Conway's Game of Life entirely on a graphics card. This concept can be generalized to any cellular automaton, including automata with more than two states. In this article I'm going to exploit this to solve the shortest path problem for two-dimensional grids entirely on a GPU. It will be just as fast as traditional searches on a CPU.

The JavaScript side of things is essentially the same as before -- two textures with fragment shader in between that steps the automaton forward -- so I won't be repeating myself. The only parts that have changed are the cell state encoding (to express all automaton states) and the fragment shader (to code the new rules).

Included is a pure JavaScript implementation of the cellular automaton (State.js) that I used for debugging and experimentation, but it doesn't actually get used in the demo. A fragment shader (12state.frag) encodes the full automaton rules for the GPU.

Maze-solving Cellular Automaton

There's a dead simple 2-state cellular automaton that can solve any perfect maze of arbitrary dimension. Each cell is either OPEN or a WALL, only 4-connected neighbors are considered, and there's only one rule: if an OPEN cell has only one OPEN neighbor, it becomes a WALL.

On each step the dead ends collapse towards the solution. In the above GIF, in order to keep the start and finish from collapsing, I've added a third state (red) that holds them open. On a GPU, you'd have to do as many draws as the length of the longest dead end.

A perfect maze is a maze where there is exactly one solution. This technique doesn't work for mazes with multiple solutions, loops, or open spaces. The extra solutions won't collapse into one, let alone the shortest one.

To fix this we need a more advanced cellular automaton.

Path-solving Cellular Automaton

I came up with a 12-state cellular automaton that can not only solve mazes, but will specifically find the shortest path. Like above, it only considers 4-connected neighbors.

If we wanted to consider 8-connected neighbors, everything would be the same, but it would require 20 states (n, ne, e, se, s, sw, w, nw) instead of 12. The rules are still pretty simple.

This can be generalized for cellular grids of any arbitrary dimension, and it could even run on a GPU for higher dimensions, limited primarily by the number of texture uniform bindings (2D needs 1 texture binding, 3D needs 2 texture bindings, 4D needs 8 texture bindings ... I think). But if you need to find the shortest path along a five-dimensional grid, I'd like to know why!

So what does it look like?

FLOW cells flood the entire maze. Branches of the maze are search in parallel as they're discovered. As soon as an END cell is touched, a ROUTE is traced backwards along the flow to the BEGIN cell. It requires double the number of steps as the length of the shortest path.

Note that the FLOW cell keep flooding the maze even after the END was found. It's a cellular automaton, so there's no way to communicate to these other cells that the solution was discovered. However, when running on a GPU this wouldn't matter anyway. There's no bailing out early before all the fragment shaders have run.

What's great about this is that we're not limited to mazes whatsoever. Here's a path through a few connected rooms with open space.

Maze Types

The worst-case solution is the longest possible shortest path. There's only one frontier and running the entire automaton to push it forward by one cell is inefficient, even for a GPU.

The way a maze is generated plays a large role in how quickly the cellular automaton can solve it. A common maze generation algorithm is a random depth-first search (DFS). The entire maze starts out entirely walled in and the algorithm wanders around at random plowing down walls, but never breaking into open space. When it comes to a dead end, it unwinds looking for new walls to knock down. This methods tends towards long, winding paths with a low branching factor.

The mazes you see in the demo are Kruskal's algorithm mazes. Walls are knocked out at random anywhere in the maze, without breaking the perfect maze rule. It has a much higher branching factor and makes for a much more interesting demo.

Skipping the Route Step

On my computers, with a 1023x1023 Kruskal maze it's about an order of magnitude slower (see update below) than A* (rot.js's version) for the same maze. Not very impressive! I believe this gap will close with time, as GPUs become parallel faster than CPUs get faster. However, there's something important to consider: it's not only solving the shortest path between source and goal, it's finding the shortest path between the source and any other point. At its core it's a breadth-first grid search.

Update: One day after writing this article I realized that glReadPixels was causing a gigantic bottlebeck. By only checking for the end conditions once every 500 iterations, this method is now equally fast as A* on modern graphics cards, despite taking up to an extra 499 iterations. In just a few more years, this technique should be faster than A*.

Really, there's little use in ROUTE step. It's a poor fit for the GPU. It has no use in any real application. I'm using it here mainly for demonstration purposes. If dropped, the cellular automaton would become 6 states: OPEN, WALL, and four flavors of FLOW. Seed the source point with a FLOW cell (arbitrary direction) and run the automaton until all of the OPEN cells are gone.

Detecting End State

The ROUTE cells do have a useful purpose, though. How do we know when we're done? We can poll the BEGIN cell to check for when it becomes a ROUTE cell. Then we know we've found the solution. This doesn't necessarily mean all of the FLOW cells have finished propagating, though, especially in the case of a DFS-maze.

In a CPU-based solution, I'd keep a counter and increment it every time an OPEN cell changes state. The the counter doesn't change after an iteration, I'm done. OpenGL 4.2 introduces an atomic counter that could serve this role, but this isn't available in OpenGL ES / WebGL. The only thing left to do is use glReadPixels to pull down the entire thing and check for end state on the CPU.

The original 2-state automaton above also suffers from this problem.

Encoding Cell State

Cells are stored per pixel in a GPU texture. I spent quite some time trying to brainstorm a clever way to encode the twelve cell states into a vec4 color. Perhaps there's some way to exploit blending to update cell states, or make use of some other kind of built-in pixel math. I couldn't think of anything better than a straight-forward encoding of 0 to 11 into a single color channel (red in my case).

int state(vec2 offset) {
    vec2 coord = (gl_FragCoord.xy + offset) / scale;
    vec4 color = texture2D(maze, coord);
    return int(color.r * 11.0 + 0.5);

This leaves three untouched channels for other useful information. I experimented (uncommitted) with writing distance in the green channel. When an OPEN cell becomes a FLOW cell, it adds 1 to its adjacent FLOW cell distance. I imagine this could be really useful in a real application: put your map on the GPU, run the cellular automaton a sufficient number of times, pull the map back off (glReadPixels), and for every point you know both the path and total distance to the source point.


As mentioned above, I ran the GPU maze-solver against A* to test its performance. I didn't yet try running it against Dijkstra’s algorithm on a CPU over the entire grid (one source, many destinations). If I had to guess, I'd bet the GPU would come out on top for grids with a high branching factor (open spaces, etc.) so that its parallelism is most effectively exploited, but Dijkstra's algorithm would win in all other cases.

Overall this is more of a proof of concept than a practical application. It's proof that we can trick OpenGL into solving mazes for us!

tags: [ ai webgl javascript gpgpu ]
Fork me on GitHub