null program

C11 Lock-free Stack

C11, the latest C standard revision, hasn't received anywhere near the same amount of fanfare as C++11. I'm not sure why this is. Some of the updates to each language are very similar, such as formal support for threading and atomic object access. Three years have passed and some parts of C11 still haven't been implemented by any compilers or standard libraries yet. Since there's not yet a lot of discussion online about C11, I'm basing much of this article on my own understanding of the C11 draft. I may be under-using the _Atomic type specifier and not paying enough attention to memory ordering constraints.

Still, this is a good opportunity to break new ground with a demonstration of C11. I'm going to use the new stdatomic.h portion of C11 to build a lock-free data structure. To compile this code you'll need a C compiler and C library with support for both C11 and the optional stdatomic.h features. As of this writing, as far as I know only GCC 4.9, released April 2014, supports this. It's in Debian unstable but not in Wheezy.

If you want to take a look before going further, here's the source. The test code in the repository uses plain old pthreads because C11 threads haven't been implemented by anyone yet.

I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won't run just anywhere. I'll discuss all this below. So sorry for the long article. It's just a lot more complex a topic than I had anticipated!

Lock-free

A lock-free data structure doesn't require the use of mutex locks. More generally, it's a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations -- transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it's usually safer, because there's no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there's additional risk of starvation (livelock), where a thread is unable to make progress.

As a demonstration, I'll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it's going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there's only one value to be updated when pushing and popping: the pointer to the head of the list.

Here's what the API will look like. I'll define lstack_t shortly. I'm making it an opaque type because its fields should never be accessed directly. The goal is to completely hide the atomic semantics from the users of the stack.

int     lstack_init(lstack_t *lstack, size_t max_size);
void    lstack_free(lstack_t *lstack);
size_t  lstack_size(lstack_t *lstack);
int     lstack_push(lstack_t *lstack, void *value);
void   *lstack_pop (lstack_t *lstack);

Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.

Notice there's a maximum size declared at initialization time. While lock-free allocation is possible [PDF], C makes no guarantees that malloc() is lock-free, so being truly lock-free means not calling malloc(). An important secondary benefit to pre-allocating the stack's memory is that this implementation doesn't require the use of hazard pointers, which would be far more complicated than the stack itself.

The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed it for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being "full."

The int return value of lstack_init() and lstack_push() is for error codes, returning 0 for success. The only way these can fail is by running out of memory. This is an issue regardless of being lock-free: systems can simply run out of memory. In the push case it means the stack is full.

Structures

Here's the definition for a node in the stack. Neither field needs to be accessed atomically, so they're not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it's effectively immutable (outside of reuse). Users never need to touch this structure.

struct lstack_node {
    void *value;
    struct lstack_node *next;
};

Internally a lstack_t is composed of two stacks: the value stack (head) and the free node stack (free). These will be handled identically by the atomic functions, so it's really a matter of convention which stack is which. All nodes are initially placed on the free stack and the value stack starts empty. Here's what an internal stack looks like.

struct lstack_head {
    uintptr_t aba;
    struct lstack_node *node;
};

There's still no atomic declaration here because the struct is going to be handled as an entire unit. The aba field is critically important for correctness and I'll go over it shortly. It's declared as a uintptr_t because it needs to be the same size as a pointer. Now, this is not guaranteed by C11 -- it's only guaranteed to be large enough to hold any valid void * pointer, so it could be even larger -- but this will be the case on any system that has the required hardware support for this lock-free stack. This struct is therefore the size of two pointers. If that's not true for any reason, this code will not link. Users will never directly access or handle this struct either.

Finally, here's the actual stack structure.

typedef struct {
    struct lstack_node *node_buffer;
    _Atomic struct lstack_head head, free;
    _Atomic size_t size;
} lstack_t;

Notice the use of the new _Atomic qualifier. Atomic values may have different size, representation, and alignment requirements in order to satisfy atomic access. These values should never be accessed directly, even just for reading (atomic_load()).

The size field is for convenience to check the number of elements on the stack. It's accessed separately from the stack nodes themselves, so it's not safe to read size and use the information to make assumptions about future accesses (e.g. checking if the stack is empty before popping off an element). Since there's no way to lock the lock-free stack, there's otherwise no way to estimate the size of the stack during concurrent access without completely disassembling it via lstack_pop().

There's no reason to use volatile here. That's a separate issue from atomic operations. The C11 stdatomic.h macros and functions will ensure atomic values are accessed appropriately.

Stack Functions

As stated before, all nodes are initially placed on the internal free stack. During initialization they're allocated in one solid chunk, chained together, and pinned on the free pointer. The initial assignments to atomic values are done through ATOMIC_VAR_INIT, which deals with memory access ordering concerns. The aba counters don't actually need to be initialized. Garbage, indeterminate values are just fine, but not initializing them would probably look like a mistake.

int lstack_init(lstack_t *lstack, size_t max_size)
{
    lstack->head.aba = ATOMIC_VAR_INIT(0);
    lstack->head.node = ATOMIC_VAR_INIT(NULL);
    lstack->size = ATOMIC_VAR_INIT(0);

    /* Pre-allocate all nodes. */
    lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
    if (lstack->node_buffer == NULL)
        return ENOMEM;
    for (size_t i = 0; i < max_size - 1; i++)
        lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
    lstack->node_buffer[max_size - 1].next = NULL;
    lstack->free.aba = ATOMIC_VAR_INIT(0);
    lstack->free.node = ATOMIC_VAR_INIT(lstack->node_buffer);
    return 0;
}

The free nodes will not necessarily be used in the same order that they're placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.

The reverse of lstack_init() is simple, and it's assumed concurrent access has terminated. The stack is no longer valid, at least not until lstack_init() is used again. This one is declared inline and put in the header.

inline void lstack_free(lstack_t *lstack)
{
    free(lstack->node_buffer);
}

To read an atomic value we need to use atomic_load(). Give it a pointer to an atomic value, it dereferences the pointer and returns the value. This is used in another inline function for reading the size of the stack.

inline size_t lstack_size(lstack_t *lstack)
{
    return atomic_load(&lstack->size);
}

Push and Pop

For operating on the two stacks there will be two internal, static functions, push and pop. These deal directly in nodes, accepting and returning them, so they're not suitable to expose in the API (users aren't meant to be aware of nodes). This is the most complex part of lock-free stacks. Here's pop().

static struct
lstack_node *pop(_Atomic struct lstack_head *head)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        if (orig.node == NULL)
            return NULL;  // empty stack
        next.aba = orig.aba + 1;
        next.node = orig.node->next;
    } while(!atomic_compare_exchange_weak(head, &orig, next));
    return orig.node;
}

It's centered around the new C11 stdatomic.h function atomic_compare_exchange_weak(). This is an atomic operation more generally called compare-and-swap (CAS). On x86 there's an instruction specifically for this, cmpxchg. Give it a pointer to the atomic value to be updated (head), a pointer to the value it's expected to be (orig), and a desired new value (next). If the expected and actual values match, it's updated to the new value. If not, it reports a failure and updates the expected value to the latest value. In the event of a failure we start all over again, which requires the while loop. This is an optimistic strategy.

The "weak" part means it will sometimes spuriously fail where the "strong" version would otherwise succeed. In exchange for more failures, calling the weak version is faster. Use the weak version when the body of your do ... while loop is fast and the strong version when it's slow (when trying again is expensive), or if you don't need a loop at all. You usually want to use weak.

The alternative to CAS is load-link/store-conditional. It's a stronger primitive that doesn't suffer from the ABA problem described next, but it's also not available on x86_64. On other platforms, one or both of atomic_compare_exchange_*() will be implemented using LL/SC, but we still have to code for the worst case (CAS).

The ABA Problem

The aba field is here to solve the ABA problem by counting the number of changes that have been made to the stack. It will be updated atomically alongside the pointer. Reasoning about the ABA problem is where I got stuck last time writing this article.

Suppose aba didn't exist and it was just a pointer being swapped. Say we have two threads, A and B.

The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer value changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They'll work with append-only data structures, where pointers are never recycled, but that's it.

The aba field solves the problem because it's incremented every time the pointer is updated. Remember that this internal stack struct is two pointers wide? That's 16 bytes on a 64-bit system. The entire 16 bytes is compared by CAS and they all have to match for it to succeed. Since B, or other threads, will increment aba at least twice (once to remove the node, and once to put it back in place), A will never mistake the recycled pointer for the old one. There's a special double-width CAS instruction specifically for this purpose, cmpxchg16. It's available on most x86_64 processors. On Linux you can check /proc/cpuinfo for support. It will be listed as cx16.

If it's not available at compile-time this program won't link. The function that wraps cmpxchg16 won't be there. You can tell GCC to assume it's there with the -mcx16 flag. The same rule here applies to C++11's new std::atomic.

There's still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There's nothing we can do about this, but if you witness this in the wild you need to immediately stop what you're doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.

Hazard Pointers and Garbage Collection

Another problem in pop() is dereferencing orig.node to access its next field. By the time we get to it, the node pointed to by orig.node may have already been removed from the stack and freed. If the stack was using malloc() and free() for allocations, it may even have had free() called on it. If so, the dereference would be undefined behavior -- a segmentation fault, or worse.

There are three ways to deal with this.

  1. Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won't be a problem. However, if we're interacting with a garbage collector we're not really lock-free.

  2. Hazard pointers. Each thread keeps track of what nodes it's currently accessing and other threads aren't allowed to free nodes on this list. This is messy and complicated.

  3. Never free nodes. This implementation recycles nodes, but they're never truly freed until lstack_free(). It's always safe to dereference a node pointer because there's always a node behind it. It may point to a node that's on the free list or one that was even recycled since we got the pointer, but the aba field deals with any of those issues.

Reference counting on the node won't work here because we can't get to the counter fast enough (atomically). It too would require dereferencing in order to increment. The reference counter could potentially be packed alongside the pointer and accessed by a double-wide CAS, but we're already using those bytes for aba.

Push

Push is a lot like pop.

static void push(_Atomic struct lstack_head *head, struct lstack_node *node)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        node->next = orig.node;
        next.aba = orig.aba + 1;
        next.node = node;
    } while(!atomic_compare_exchange_weak(head, &orig, next));
}

It's counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn't take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.

API Push and Pop

The API push and pop functions are built on these internal atomic functions.

int lstack_push(lstack_t *lstack, void *value)
{
    struct lstack_node *node = pop(&lstack->free);
    if (node == NULL)
        return ENOMEM;
    node->value = value;
    push(&lstack->head, node);
    atomic_fetch_add(&lstack->size, 1);
    return 0;
}

Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there's an instant where the stack size is one shorted than it actually is. However, since there's no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.

Popping is the same thing in reverse.

void *lstack_pop(lstack_t *lstack)
{
    struct lstack_node *node = pop(&lstack->head);
    if (node == NULL)
        return NULL;
    atomic_fetch_sub(&lstack->size, 1);
    void *value = node->value;
    push(&lstack->free, node);
    return value;
}

Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It's really simple with the primitive push and pop.

SHA1 Demo

The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same stack. It's not really exercising the library thoroughly because no contended pops occur, but I couldn't think of a better example at the time.

The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more useful than a stack, particularly for common consumer-producer scenarios.

tags: [ c ]

Stabilizing C's Quicksort

The C standard library includes a quicksort function called qsort(). It sorts homogeneous arrays of arbitrary type. The interface is exactly what you'd expect given the constraints of the language.

void qsort(void *base, size_t nmemb, size_t size,
           int (*compar)(const void *, const void *));

It takes a pointer to the first element of the array, the number of members, the size of each member, and a comparator function. The comparator has to operate on void * pointers because C doesn't have templates or generics or anything like that. That's two interfaces where type safety is discarded: the arguments passed to qsort() and again when it calls the comparator function.

One of the significant flaws of this interface is the lack of context for the comparator. C doesn't have closures, which in other languages would cover this situation. If the sort function depends on some additional data, such as in Graham scan where points are sorted relative to a selected point, the extra information needs to be smuggled in through a global variable. This is not reentrant and wouldn't be safe in a multi-threaded environment. There's a GNU extension here, qsort_r(), that takes an additional context argument, allowing for reentrant comparators.

Quicksort has some really nice properties. It's in-place, so no temporary memory needs to be allocated. If implemented properly it only consumes O(log n) space, which is the stack growth during recursion. Memory usage is localized, so it plays well with caching.

That being said, qsort() is also a classic example of an API naming mistake. Few implementations actually use straight quicksort. For example, glibc's qsort() is merge sort (in practice), and the other major libc implementations use a hybrid approach. Programs using their language's sort function shouldn't be concerned with how it's implemented. All the matters is the interface and whether or not it's a stable sort. OpenBSD made the exact same mistake when they introduced arc4random(), which no longer uses RC4.

Since quicksort is an unstable sort -- there are multiple possible results when the array contains equivalent elements -- this means qsort() is not guaranteed to be stable, even if internally the C library is using a stable sort like merge sort. The C standard library has no stable sort function.

Comparator Composability

The unfortunate side effect of unstable sorts is that they hurt composability. For example, let's say we have a person struct like this,

struct person {
    const char *first, *last;
    int age;
};

Here are a couple of comparators to sort either by name or by age. As a side note, strcmp() automatically works correctly with UTF-8 so this program isn't limited to old-fashioned ASCII names.

#include <string.h>

int compare_name(const void *a, const void *b)
{
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int last = strcmp(pa->last, pb->last);
    return last != 0 ? last : strcmp(pa->first, pb->first);
}

int compare_age(const void *a, const void *b)
{
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    return pa->age - pb->age;
}

And since we'll need it later, here's a COUNT_OF macro to get the length of arrays at compile time. There's a less error prone version out there, but I'm keeping it simple.

#define COUNT_OF(x) (sizeof(x) / sizeof(0[x]))

Say we want to sort by name, then age. When using a stable sort, this is accomplished by sorting on each field separately in reverse order of preference: a composition of individual comparators. Here's an attempt at using quicksort to sort an array of people by age then name.

struct person people[] = {
    {"Joe", "Shmoe", 24},
    {"John", "Doe", 30},
    {"Alan", "Smithee", 42},
    {"Jane", "Doe", 30}
};

qsort(people, COUNT_OF(people), sizeof(struct person), compare_name);
qsort(people, COUNT_OF(people), sizeof(struct person), compare_age);

But this doesn't always work. Jane should come before John, but the original sort was completely lost in the second sort.

Joe Shmoe, 24
John Doe, 30
Jane Doe, 30
Alan Smithee, 42

This could be fixed by defining a new comparator that operates on both fields at once, compare_age_name(), and performing a single sort. But what if later you want to sort by name then age? Now you need compare_name_age(). If a third field was added, there would need to be 6 (3!) different comparator functions to cover all the possibilities. If you had 6 fields, you'd need 720 comparators! Composability has been lost to a combinatorial nightmare.

Pointer Comparison

The GNU libc documentation claims that qsort() can be made stable by using pointer comparison as a fallback. That is, when the relevant fields are equivalent, use their array position to resolve the difference.

If you want the effect of a stable sort, you can get this result by writing the comparison function so that, lacking other reason distinguish between two elements, it compares them by their addresses.

This is not only false, it's dangerous! Because elements may be sorted in-place, even in glibc, their position will change during the sort. The comparator will be using their current positions, not the starting positions. What makes it dangerous is that the comparator will return different orderings throughout the sort as elements are moved around the array. This could result in an infinite loop, or worse.

Making it Stable

The most direct way to work around the unstable sort is to eliminate any equivalencies. Equivalent elements can be distinguished by adding an intrusive order field which is set after each sort. The comparators will fall back on this sort to maintain the original ordering.

struct person {
    const char *first, *last;
    int age;
    size_t order;
};

And the new comparators.

int compare_name_stable(const void *a, const void *b)
{
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int last = strcmp(pa->last, pb->last);
    if (last != 0)
        return last;
    int first = strcmp(pa->first, pb->first);
    if (first != 0)
        return first;
    return pa->order - pb->order;
}

int compare_age_stable(const void *a, const void *b)
{
    struct person *pa = (struct person *) a;
    struct person *pb = (struct person *) b;
    int age = pa->age - pb->age;
    return age != 0 ? age : pa->order - pb->order;
}

The first sort doesn't need to be stable, but there's not much reason to keep around two definitions.

qsort(people, COUNT_OF(people), sizeof(people[0]), compare_name_stable);
for (size_t i = 0; i < COUNT_OF(people); i++)
    people[i].order = i;
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_age_stable);

And the result:

Joe Shmoe, 24
Jane Doe, 30
John Doe, 30
Alan Smithee, 42

Without defining any new comparators I can sort by name then age just by swapping the calls to qsort(). At the cost of an extra bookkeeping field, the number of comparator functions needed as fields are added is O(n) and not O(n!) despite using an unstable sort.

tags: [ c ]

An RC4 Password Hashing Function

There was an interesting /r/DailyProgrammer challenge this week to write a program that properly hashes passwords for storage in an account database. It's a necessary part of any system that needs to securely authenticate users. Storing user passwords in plain text is not just bad practice, it's irresponsible. If the database is compromised, the attacker learns every user's password. Since people are likely to re-use passwords on different websites, the database could be used to infiltrate accounts on other sites.

The solution to this problem is to run the password through a one-way cryptographic hash function before storing it in the database. When the database is compromised, it's more difficult to work backwards to recover the passwords. Examples of one-way hash functions are MD5, SHA-1, and the SHA-2 family. Block ciphers can also be converted into hash functions (e.g. bcrypt from Blowfish), though it must be done carefully since block ciphers were generally designed with different goals in mind.

md5("foobar");
// => "3858f62230ac3c915f300c664312c63f"

sha1("foobar");
// => "8843d7f92416211de9ebb963ff4ce28125932878"

However, these particular functions (SHA-1 and SHA-2) alone are poor password hashing functions: they're much too fast! An offline attacker can mount a rapid brute-force attack on these kinds of hashes. They also don't include a salt, a unique, non-secret, per-hash value used as additional hash function input. Without this, an attacker could prepare the entire attack ahead of time -- a rainbow table. Once the hashes are obtained, reversing them is just a matter of looking them up in the table.

Good password hashing needs to be slow and it needs support for salt. Examples of algorithms well-suited for this purpose are PBKDF2, bcrypt, and scrypt. These are the functions you'd want to use in a real application today. Each of these are also more generally key derivation functions. They can strengthen a relatively short human-memorable passphrase by running it through a long, slow procedure before making use of it as an encryption key. An brute-force attacker would need to perform this slow procedure for each individual guess.

Alternatively, if you're stuck using a fast hash function anyway, it could be slowed down by applying the function thousands or even millions of times recursively to its own output. This is what I did in order to strengthen my GPG passphrase. However, you're still left with the problem of applying the salt. The naive approach would be a plain string concatenation with the password, but this likely to be vulnerable to a length extension attack. The proper approach would be to use HMAC.

RC4 Solution

For my solution to the challenge, I wasn't looking for something strong enough to do key derivation. I just need a slow hash function that properly handles a salt. Another important goal was to keep the solution small enough to post as a reddit comment, and I wanted to do it without using any external crypto libraries. If I'm using a library, I might as well just include/import/require PBKDF2 and make it a 2-liner, but that would be boring. I wanted it to be a reasonably short C program with no external dependencies other than standard libraries. Not counting the porcelain, the final result weighs in at 115 lines of C, so I think I achieved all my goals.

So what's the smallest, modern cryptographic algorithm I'm aware of? That would be RC4, my favorite random generator! Unlike virtually every other cryptographic algorithm, it's easy to commit to memory and to implement from scratch without any reference documentation. Similarly, this password hashing function can be implemented entirely from memory (if you can imagine yourself in some outlandish sitation where that would be needed).

Unfortunately, RC4 has had a lot of holes punched in it over the years. The initial output has been proven to be biased, leaking key material, and there's even good reason to believe it may already be broken by nation state actors. Despite this, RC4 remains the most widely used stream cipher today due to its inclusion in TLS. Most importantly here, almost none of RC4's weaknesses apply to this situation -- we're only using a few bytes of output -- so it's still a very strong algorithm. Besides, what I'm developing is a proof of concept, not something to be used in a real application. It would be interesting to see how long it takes for someone to break this (maybe even decades).

Before I dive into the details, I'll link to the source repository. As of this writing there are C and Elisp implementations of the algorithm, and they will properly validate each other's hashes. I call it RC4HASH.

Here are some example hashes for the password "foobar". It's different each time because each has a unique salt. Notice the repeated byte 12 in the 5th byte position of the hash.

$ ./rc4hash -p foobar
c56cdbe512c922a2f9682cc0dfa21259e4924304e9e9b486c49d
$ ./rc4hash -p foobar
a1ea954b1296052a7cd766eb989bfd52915ab267733503ef3e8d
$ ./rc4hash -p foobar
5603de351288547e12b89585171f40cf480001b21dcfbd25f3f4

Each also validates as correct.

$ ./rc4hash -p foobar -v c56cdbe5...b486c49d
valid
$ ./rc4hash -p foobar -v a1ea954b...03ef3e8d
valid
$ ./rc4hash -p foobar -v 5603de35...bd25f3f4
valid

The Algorithm

RC4 is a stream cipher, which really just means it's a fancy random number generator. How can we turn this into a hash function? The content to be hashed can be fed to the key schedule algorithm in 256-byte chunks, as if it were a key. The key schedule is a cipher initialization stage that shuffles up the cipher state without generating output. To put all this in terms of C, here's what the RC4 struct and initialization looks like: a 256-element byte array initialized with 0-255, and two array indexes.

struct rc4 {
    uint8_t S[256];
    uint8_t i, j;
};

void rc4_init(struct rc4 *k) {
    k->i = k->j = 0;
    for (int i = 0; i < 256; i++) {
        k->S[i] = i;
    }
}

The key schedule shuffles this state according to a given key.

#define SWAP(a, b) if (a ^ b) {a ^= b; b ^= a; a ^= b;}

void rc4_schedule(struct rc4 *k, const uint8_t *key, size_t length) {
    int j = 0;
    for (int i = 0; i < 256; i++) {
        j = (j + k->S[i] + key[i % length]) % 256;
        SWAP(k->S[i], k->S[j]);
    }
}

Notice it doesn't touch the array indexes. It can be called over and over with different key material to keep shuffling the state. This is how I'm going to mix the salt into the password. The key schedule will first be run on the salt, then again on the password.

To produce the hash output, emit the desired number of bytes from the cipher. Ta da! It's now an RC4-based salted hash function.

void rc4_emit(struct rc4 *k, uint8_t *buffer, size_t count) {
    for (size_t b = 0; b < count; b++) {
        k->j += k->S[++k->i];
        SWAP(k->S[k->i], k->S[k->j]);
        buffer[b] = k->S[(k->S[k->i] + k->S[k->j]) & 0xFF];
    }
}

/* Throwaway 64-bit hash example. Assumes strlen(passwd) <= 256. */
uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
    struct rc4 rc4;
    rc4_init(&rc4);
    rc4_schedule(&rc4, salt, salt_len);
    rc4_schedule(&rc4, passwd, strlen(passwd));
    uint64_t hash;
    rc4_emit(&rc4, &hash, sizeof(hash));
    return hash;
}

Both password and salt are the inputs to hash function. In order to validate a password against a hash, we need to keep track of the salt. The easiest way to do this is to concatenate it to the hash itself, making it part of the hash. Remember, it's not a secret value, so this is safe. For my solution, I chose to use a 32-bit salt and prepend it to 20 bytes of generator output, just like an initialization vector (IV). To validate a user, all we need is a hash and a password provided by the user attempting to authenticate.

Fixing a Flaw

Right now there's a serious flaw. If you want to find it for yourself, stop reading here. It will need to get fixed before this hash function is any good.

It's trivial to find a collision, with is the death knell for any cryptographic hash function. Certain kinds of passwords will collapse down to the simplest case.

hash("x", "salt", 4);
// => 8622913094354299445
hash("xx", "salt", 4);
// => 8622913094354299445
hash("xxx", "salt", 4);
// => 8622913094354299445
hash("xxxx", "salt", 4);
// => 8622913094354299445

hash("abc", "salt", 4);
// => 8860606953758435703
hash("abcabc", "salt", 4);
// => 8860606953758435703

Notice a pattern? Take a look at the RC4 key schedule function. Using modular arithmetic, the password wraps around repeating itself over 256 bytes. This means passwords with repeating patterns will mutate the cipher identically regardless of the number of repeats, so they result in the same hash. A password "abcabcabc" will be accepted as "abc".

The fix is to avoid wrapping the password. Instead, the RC4 generator, seeded only by the salt, is used to pad the password out to 256 bytes without repeating.

uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
    struct rc4 rc4;
    rc4_init(&rc4);
    rc4_schedule(&rc4, salt, salt_len);
    uint8_t padded[256];
    memcpy(padded, passwd, strlen(passwd));
    rc4_emit(&rc4, padded + strlen(passwd), 256 - strlen(passwd));
    rc4_schedule(&rc4, padded, sizeof(padded));
    uint64_t hash;
    rc4_emit(&rc4, &hash, sizeof(hash));
    return hash;
}

This should also help mix the RC4 state up a bit more before generating the output. I'm no cryptanalyst, though, so I don't know if it's worth much.

Slowing It Down

The next problem is that this is way too fast! It shuffles bytes around for a few microseconds and it's done. So far it also doesn't address the problems of biases in RC4's initial output. We'll kill two birds with one stone for this one.

To fix this we'll add an adaptive difficulty factor. It will be a value that determines how much work will be done to compute the hash. It's adaptive because the system administrator can adjust it at any time without affecting previous hashes. To accomplish this, like the salt, the difficulty factor will be appended to the hash output. All the required information will come packaged together in the hash.

The difficulty factor comes into play in two areas. First, it determines how many times the key schedule is run. This is the same modification CipherSaber-2 uses in order to strengthen RC4's weak key schedule. However, rather than run it on the order of 20 times, our hash function will be running it hundreds of thousands of times. Second, the difficulty will also determine how many initial bytes of output are discarded before we start generating the hash.

I decided on an unsigned 8-bit value for the difficulty. The number of key schedules will be 1 shifted left by this number of bits (i.e. pow(2, difficulty)). This makes the minimum number of key schedules 1, since any less doesn't make sense. The number of bytes skipped is the same bitshift, minus 1, times 64 ((pow(2, difficulty) - 1) * 64), the muliplication is so that it can skip large swaths output. Therefore the implementation so far has a difficulty of zero: one key schedule round and zero bytes of output skipped.

The dynamic range of the difficulty factor (0-255) puts the the time needed on a modern computer to compute an RC4HASH between a few microseconds (0) to the billions of years (255). That should be a more than sufficient amount of future proofing, especially considering that we're using RC4, which will likely be broken before the difficulty factor ever tops out.

I won't show the code to do this since that's how it's implemented in the final version, so go look at the repository instead. The final hash is 26 bytes long: a 208-bit hash. The first 4 bytes are the salt (grabbed from /dev/urandom in my implementations), the byte is the difficulty factor, and the final 21 bytes are RC4 output.

In the example hashes above, the 12 constant byte is the difficulty factor. The default difficulty factor is 18 (0x12). I've considered XORing this with some salt-seeded RC4 output just to make the hash look nice, but that just seems like arbitrary complexity for no real gains. With the default difficulty, it takes almost a second for my computers to compute the hash.

I believe RC4HASH should be quite resistant to GPGPU attacks. RC4 is software oriented, involving many random array reads and writes rather than SIMD-style operations. GPUs are really poor at this sort of thing, so they should take a significant performance hit when running RC4HASH.

Break My Hash!

For those interested in breaking RC4HASH, here are a couple of hashes of English language passphrases. Each is about one short sentence in length ([a-zA-Z !.,]+). I'm not keeping track of the sentences I used, so the only way to get them will be to break the hash, even for me.

f0622dde127f9ab5aaee710aa4bfb17a224f7e6e93745f7ae948
8ee9cdec12feabed5c2fde0a51a2381b522f5d2bd483717d4a96

If you can find a string that validates with these hashes, especially if it's not the original passphrase, you win! I don't have any prizes in mind right now, but perhaps some Bitcoin would be in order if your attack is interesting.

tags: [ c crypto ]

A GPU Approach to Particle Physics

The next project in my GPGPU series is a particle physics engine that computes the entire physics simulation on the GPU. Particles are influenced by gravity and will bounce off scene geometry. This WebGL demo uses a shader feature not strictly required by the OpenGL ES 2.0 specification, so it may not work on some platforms, especially mobile devices. It will be discussed later in the article.

It's interactive. The mouse cursor is a circular obstacle that the particles bounce off of, and clicking will place a permanent obstacle in the simulation. You can paint and draw structures through which the the particles will flow.

Here's an HTML5 video of the demo in action, which, out of necessity, is recorded at 60 frames-per-second and a high bitrate, so it's pretty big. Video codecs don't gracefully handle all these full-screen particles very well and lower framerates really don't capture the effect properly. I also added some appropriate sound that you won't hear in the actual demo.

On a modern GPU, it can simulate and draw over 4 million particles at 60 frames per second. Keep in mind that this is a JavaScript application, I haven't really spent time optimizing the shaders, and it's living within the constraints of WebGL rather than something more suitable for general computation, like OpenCL or at least desktop OpenGL.

Encoding Particle State as Color

Just as with the Game of Life and path finding projects, simulation state is stored in pairs of textures and the majority of the work is done by a fragment shader mapped between them pixel-to-pixel. I won't repeat myself with the details of setting this up, so refer to the Game of Life article if you need to see how it works.

For this simulation, there are four of these textures instead of two: a pair of position textures and a pair of velocity textures. Why pairs of textures? There are 4 channels, so every one of these components (x, y, dx, dy) could be packed into its own color channel. This seems like the simplest solution.

The problem with this scheme is the lack of precision. With the R8G8B8A8 internal texture format, each channel is one byte. That's 256 total possible values. The display area is 800 by 600 pixels, so not even every position on the display would be possible. Fortunately, two bytes, for a total of 65,536 values, is plenty for our purposes.

The next problem is how to encode values across these two channels. It needs to cover negative values (negative velocity) and it should try to take full advantage of dynamic range, i.e. try to spread usage across all of those 65,536 values.

To encode a value, multiply the value by a scalar to stretch it over the encoding's dynamic range. The scalar is selected so that the required highest values (the dimensions of the display) are the highest values of the encoding.

Next, add half the dynamic range to the scaled value. This converts all negative values into positive values with 0 representing the lowest value. This representation is called Excess-K. The downside to this is that clearing the texture (glClearColor) with transparent black no longer sets the decoded values to 0.

Finally, treat each channel as a digit of a base-256 number. The OpenGL ES 2.0 shader language has no bitwise operators, so this is done with plain old division and modulus. I made an encoder and decoder in both JavaScript and GLSL. JavaScript needs it to write the initial values and, for debugging purposes, so that it can read back particle positions.

vec2 encode(float value) {
    value = value * scale + OFFSET;
    float x = mod(value, BASE);
    float y = floor(value / BASE);
    return vec2(x, y) / BASE;
}

float decode(vec2 channels) {
    return (dot(channels, vec2(BASE, BASE * BASE)) - OFFSET) / scale;
}

And JavaScript. Unlike normalized GLSL values above (0.0-1.0), this produces one-byte integers (0-255) for packing into typed arrays.

function encode(value, scale) {
    var b = Particles.BASE;
    value = value * scale + b * b / 2;
    var pair = [
        Math.floor((value % b) / b * 255),
        Math.floor(Math.floor(value / b) / b * 255)
    ];
    return pair;
}

function decode(pair, scale) {
    var b = Particles.BASE;
    return (((pair[0] / 255) * b +
             (pair[1] / 255) * b * b) - b * b / 2) / scale;
}

The fragment shader that updates each particle samples the position and velocity textures at that particle's "index", decodes their values, operates on them, then encodes them back into a color for writing to the output texture. Since I'm using WebGL, which lacks multiple rendering targets (despite having support for gl_FragData), the fragment shader can only output one color. Position is updated in one pass and velocity in another as two separate draws. The buffers are not swapped until after both passes are done, so the velocity shader (intentionally) doesn't uses the updated position values.

There's a limit to the maximum texture size, typically 8,192 or 4,096, so rather than lay the particles out in a one-dimensional texture, the texture is kept square. Particles are indexed by two-dimensional coordinates.

It's pretty interesting to see the position or velocity textures drawn directly to the screen rather than the normal display. It's another domain through which to view the simulation, and it even helped me identify some issues that were otherwise hard to see. The output is a shimmering array of color, but with definite patterns, revealing a lot about the entropy (or lack thereof) of the system. I'd share a video of it, but it would be even more impractical to encode than the normal display. Here are screenshots instead: position, then velocity. The alpha component is not captured here.

Entropy Conservation

One of the biggest challenges with running a simulation like this on a GPU is the lack of random values. There's no rand() function in the shader language, so the whole thing is deterministic by default. All entropy comes from the initial texture state filled by the CPU. When particles clump up and match state, perhaps from flowing together over an obstacle, it can be difficult to work them back apart since the simulation handles them identically.

To mitigate this problem, the first rule is to conserve entropy whenever possible. When a particle falls out of the bottom of the display, it's "reset" by moving it back to the top. If this is done by setting the particle's Y value to 0, then information is destroyed. This must be avoided! Particles below the bottom edge of the display tend to have slightly different Y values, despite exiting during the same iteration. Instead of resetting to 0, a constant value is added: the height of the display. The Y values remain different, so these particles are more likely to follow different routes when bumping into obstacles.

The next technique I used is to supply a single fresh random value via a uniform for each iteration This value is added to the position and velocity of reset particles. The same value is used for all particles for that particular iteration, so this doesn't help with overlapping particles, but it does help to break apart "streams". These are clearly-visible lines of particles all following the same path. Each exits the bottom of the display on a different iteration, so the random value separates them slightly. Ultimately this stirs in a few bits of fresh entropy into the simulation on each iteration.

Alternatively, a texture containing random values could be supplied to the shader. The CPU would have to frequently fill and upload the texture, plus there's the issue of choosing where to sample the texture, itself requiring a random value.

Finally, to deal with particles that have exactly overlapped, the particle's unique two-dimensional index is scaled and added to the position and velocity when resetting, teasing them apart. The random value's sign is multiplied by the index to avoid bias in any particular direction.

To see all this in action in the demo, make a big bowl to capture all the particles, getting them to flow into a single point. This removes all entropy from the system. Now clear the obstacles. They'll all fall down in a single, tight clump. It will still be somewhat clumped when resetting at the top, but you'll see them spraying apart a little bit (particle indexes being added). These will exit the bottom at slightly different times, so the random value plays its part to work them apart even more. After a few rounds, the particles should be pretty evenly spread again.

The last source of entropy is your mouse. When you move it through the scene you disturb particles and introduce some noise to the simulation.

Textures as Vertex Attribute Buffers

This project idea occurred to me while reading the OpenGL ES shader language specification (PDF). I'd been wanting to do a particle system, but I was stuck on the problem how to draw the particles. The texture data representing positions needs to somehow be fed back into the pipeline as vertices. Normally a buffer texture -- a texture backed by an array buffer -- or a pixel buffer object -- asynchronous texture data copying -- might be used for this, but WebGL has none these features. Pulling texture data off the GPU and putting it all back on as an array buffer on each frame is out of the question.

However, I came up with a cool technique that's better than both those anyway. The shader function texture2D is used to sample a pixel in a texture. Normally this is used by the fragment shader as part of the process of computing a color for a pixel. But the shader language specification mentions that texture2D is available in vertex shaders, too. That's when it hit me. The vertex shader itself can perform the conversion from texture to vertices.

It works by passing the previously-mentioned two-dimensional particle indexes as the vertex attributes, using them to look up particle positions from within the vertex shader. The shader would run in GL_POINTS mode, emitting point sprites. Here's the abridged version,

attribute vec2 index;

uniform sampler2D positions;
uniform vec2 statesize;
uniform vec2 worldsize;
uniform float size;

// float decode(vec2) { ...

void main() {
    vec4 psample = texture2D(positions, index / statesize);
    vec2 p = vec2(decode(psample.rg), decode(psample.ba));
    gl_Position = vec4(p / worldsize * 2.0 - 1.0, 0, 1);
    gl_PointSize = size;
}

The real version also samples the velocity since it modulates the color (slow moving particles are lighter than fast moving particles).

However, there's a catch: implementations are allowed to limit the number of vertex shader texture bindings to 0 (GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS). So technically vertex shaders must always support texture2D, but they're not required to support actually having textures. It's sort of like food service on an airplane that doesn't carry passengers. These platforms don't support this technique. So far I've only had this problem on some mobile devices.

Outside of the lack of support by some platforms, this allows every part of the simulation to stay on the GPU and paves the way for a pure GPU particle system.

Obstacles

An important observation is that particles do not interact with each other. This is not an n-body simulation. They do, however, interact with the rest of the world: they bounce intuitively off those static circles. This environment is represented by another texture, one that's not updated during normal iteration. I call this the obstacle texture.

The colors on the obstacle texture are surface normals. That is, each pixel has a direction to it, a flow directing particles in some direction. Empty space has a special normal value of (0, 0). This is not normalized (doesn't have a length of 1), so it's an out-of-band value that has no effect on particles.

(I didn't realize until I was done how much this looks like the Greendale Community College flag.)

A particle checks for a collision simply by sampling the obstacle texture. If it finds a normal at its location, it changes its velocity using the shader function reflect. This function is normally used for reflecting light in a 3D scene, but it works equally well for slow-moving particles. The effect is that particles bounce off the the circle in a natural way.

Sometimes particles end up on/in an obstacle with a low or zero velocity. To dislodge these they're given a little nudge in the direction of the normal, pushing them away from the obstacle. You'll see this on slopes where slow particles jiggle their way down to freedom like jumping beans.

To make the obstacle texture user-friendly, the actual geometry is maintained on the CPU side of things in JavaScript. It keeps a list of these circles and, on updates, redraws the obstacle texture from this list. This happens, for example, every time you move your mouse on the screen, providing a moving obstacle. The texture provides shader-friendly access to the geometry. Two representations for two purposes.

When I started writing this part of the program, I envisioned that shapes other than circles could place placed, too. For example, solid rectangles: the normals would look something like this.

So far these are unimplemented.

Future Ideas

I didn't try it yet, but I wonder if particles could interact with each other by also drawing themselves onto the obstacles texture. Two nearby particles would bounce off each other. Perhaps the entire liquid demo could run on the GPU like this. If I'm imagining it correctly, particles would gain volume and obstacles forming bowl shapes would fill up rather than concentrate particles into a single point.

I think there's still some more to explore with this project.

tags: [ webgl media interactive gpgpu javascript ]

A GPU Approach to Path Finding

Last time I demonstrated how to run Conway's Game of Life entirely on a graphics card. This concept can be generalized to any cellular automaton, including automata with more than two states. In this article I'm going to exploit this to solve the shortest path problem for two-dimensional grids entirely on a GPU. It will be just as fast as traditional searches on a CPU.

The JavaScript side of things is essentially the same as before -- two textures with fragment shader in between that steps the automaton forward -- so I won't be repeating myself. The only parts that have changed are the cell state encoding (to express all automaton states) and the fragment shader (to code the new rules).

Included is a pure JavaScript implementation of the cellular automaton (State.js) that I used for debugging and experimentation, but it doesn't actually get used in the demo. A fragment shader (12state.frag) encodes the full automaton rules for the GPU.

Maze-solving Cellular Automaton

There's a dead simple 2-state cellular automaton that can solve any perfect maze of arbitrary dimension. Each cell is either OPEN or a WALL, only 4-connected neighbors are considered, and there's only one rule: if an OPEN cell has only one OPEN neighbor, it becomes a WALL.

On each step the dead ends collapse towards the solution. In the above GIF, in order to keep the start and finish from collapsing, I've added a third state (red) that holds them open. On a GPU, you'd have to do as many draws as the length of the longest dead end.

A perfect maze is a maze where there is exactly one solution. This technique doesn't work for mazes with multiple solutions, loops, or open spaces. The extra solutions won't collapse into one, let alone the shortest one.

To fix this we need a more advanced cellular automaton.

Path-solving Cellular Automaton

I came up with a 12-state cellular automaton that can not only solve mazes, but will specifically find the shortest path. Like above, it only considers 4-connected neighbors.

If we wanted to consider 8-connected neighbors, everything would be the same, but it would require 20 states (n, ne, e, se, s, sw, w, nw) instead of 12. The rules are still pretty simple.

This can be generalized for cellular grids of any arbitrary dimension, and it could even run on a GPU for higher dimensions, limited primarily by the number of texture uniform bindings (2D needs 1 texture binding, 3D needs 2 texture bindings, 4D needs 8 texture bindings ... I think). But if you need to find the shortest path along a five-dimensional grid, I'd like to know why!

So what does it look like?

FLOW cells flood the entire maze. Branches of the maze are search in parallel as they're discovered. As soon as an END cell is touched, a ROUTE is traced backwards along the flow to the BEGIN cell. It requires double the number of steps as the length of the shortest path.

Note that the FLOW cell keep flooding the maze even after the END was found. It's a cellular automaton, so there's no way to communicate to these other cells that the solution was discovered. However, when running on a GPU this wouldn't matter anyway. There's no bailing out early before all the fragment shaders have run.

What's great about this is that we're not limited to mazes whatsoever. Here's a path through a few connected rooms with open space.

Maze Types

The worst-case solution is the longest possible shortest path. There's only one frontier and running the entire automaton to push it forward by one cell is inefficient, even for a GPU.

The way a maze is generated plays a large role in how quickly the cellular automaton can solve it. A common maze generation algorithm is a random depth-first search (DFS). The entire maze starts out entirely walled in and the algorithm wanders around at random plowing down walls, but never breaking into open space. When it comes to a dead end, it unwinds looking for new walls to knock down. This methods tends towards long, winding paths with a low branching factor.

The mazes you see in the demo are Kruskal's algorithm mazes. Walls are knocked out at random anywhere in the maze, without breaking the perfect maze rule. It has a much higher branching factor and makes for a much more interesting demo.

Skipping the Route Step

On my computers, with a 1023x1023 Kruskal maze it's about an order of magnitude slower (see update below) than A* (rot.js's version) for the same maze. Not very impressive! I believe this gap will close with time, as GPUs become parallel faster than CPUs get faster. However, there's something important to consider: it's not only solving the shortest path between source and goal, it's finding the shortest path between the source and any other point. At its core it's a breadth-first grid search.

Update: One day after writing this article I realized that glReadPixels was causing a gigantic bottlebeck. By only checking for the end conditions once every 500 iterations, this method is now equally fast as A* on modern graphics cards, despite taking up to an extra 499 iterations. In just a few more years, this technique should be faster than A*.

Really, there's little use in ROUTE step. It's a poor fit for the GPU. It has no use in any real application. I'm using it here mainly for demonstration purposes. If dropped, the cellular automaton would become 6 states: OPEN, WALL, and four flavors of FLOW. Seed the source point with a FLOW cell (arbitrary direction) and run the automaton until all of the OPEN cells are gone.

Detecting End State

The ROUTE cells do have a useful purpose, though. How do we know when we're done? We can poll the BEGIN cell to check for when it becomes a ROUTE cell. Then we know we've found the solution. This doesn't necessarily mean all of the FLOW cells have finished propagating, though, especially in the case of a DFS-maze.

In a CPU-based solution, I'd keep a counter and increment it every time an OPEN cell changes state. The the counter doesn't change after an iteration, I'm done. OpenGL 4.2 introduces an atomic counter that could serve this role, but this isn't available in OpenGL ES / WebGL. The only thing left to do is use glReadPixels to pull down the entire thing and check for end state on the CPU.

The original 2-state automaton above also suffers from this problem.

Encoding Cell State

Cells are stored per pixel in a GPU texture. I spent quite some time trying to brainstorm a clever way to encode the twelve cell states into a vec4 color. Perhaps there's some way to exploit blending to update cell states, or make use of some other kind of built-in pixel math. I couldn't think of anything better than a straight-forward encoding of 0 to 11 into a single color channel (red in my case).

int state(vec2 offset) {
    vec2 coord = (gl_FragCoord.xy + offset) / scale;
    vec4 color = texture2D(maze, coord);
    return int(color.r * 11.0 + 0.5);
}

This leaves three untouched channels for other useful information. I experimented (uncommitted) with writing distance in the green channel. When an OPEN cell becomes a FLOW cell, it adds 1 to its adjacent FLOW cell distance. I imagine this could be really useful in a real application: put your map on the GPU, run the cellular automaton a sufficient number of times, pull the map back off (glReadPixels), and for every point you know both the path and total distance to the source point.

Performance

As mentioned above, I ran the GPU maze-solver against A* to test its performance. I didn't yet try running it against Dijkstra’s algorithm on a CPU over the entire grid (one source, many destinations). If I had to guess, I'd bet the GPU would come out on top for grids with a high branching factor (open spaces, etc.) so that its parallelism is most effectively exploited, but Dijkstra's algorithm would win in all other cases.

Overall this is more of a proof of concept than a practical application. It's proof that we can trick OpenGL into solving mazes for us!

tags: [ ai webgl javascript gpgpu ]

Feedback Applet Ported to WebGL

The biggest flaw with so many OpenGL tutorials is trying to teach two complicated topics at once: the OpenGL API and 3D graphics. These are only loosely related and do not need to be learned simultaneously. It's far more valuable to focus on the fundamentals, which can only happen when handled separately. With the programmable pipeline, OpenGL is useful for a lot more than 3D graphics. There are many non-3D directions that tutorials can take.

I think that's why I've been enjoying my journey through WebGL so much. Except for my sphere demo, which was only barely 3D, none of my projects have been what would typically be considered 3D graphics. Instead, each new project has introduced me to some new aspect of OpenGL, accidentally playing out like a great tutorial. I started out drawing points and lines, then took a dive into non-trivial fragment shaders, then textures and framebuffers, then the depth buffer, then general computation with fragment shaders.

The next project introduced me to alpha blending. I ported my old feedback applet to WebGL!

Since finishing the port I've already spent a couple of hours just playing with it. It's mesmerizing. Here's a video demonstration in case WebGL doesn't work for you yet. I'm manually driving it to show off the different things it can do.

Drawing a Frame

On my laptop, the original Java version plods along at about 6 frames per second. That's because it does all of the compositing on the CPU. Each frame it has to blend over 1.2 million color components. This is exactly the sort of thing the GPU is built to do. The WebGL version does the full 60 frames per second (i.e. requestAnimationFrame) without breaking a sweat. The CPU only computes a couple of 3x3 affine transformation matrices per frame: virtually nothing.

Similar to my WebGL Game of Life, there's texture stored on the GPU that holds almost all the system state. It's the same size as the display. To draw the next frame, this texture is drawn to the display directly, then transformed (rotated and scaled down slightly), and drawn again to the display. This is the "feedback" part and it's where blending kicks in. It's the core component of the whole project.

Next, some fresh shapes are drawn to the display (i.e. the circle for the mouse cursor) and the entire thing is captured back onto the state texture with glCopyTexImage2D, to be used for the next frame. It's important that glCopyTexImage2D is called before returning to the JavaScript top-level (back to the event loop), because the screen data will no longer be available at that point, even if it's still visible on the screen.

Alpha Blending

They say a picture is worth a thousand words, and that's literally true with the Visual glBlendFunc + glBlendEquation Tool. A few minutes playing with that tool tells you pretty much everything you need to know.

While you could potentially perform blending yourself in a fragment shader with multiple draw calls, it's much better (and faster) to configure OpenGL to do it. There are two functions to set it up: glBlendFunc and glBlendEquation. There are also "separate" versions of all this for specifying color channels separately, but I don't need that for this project.

The enumeration passed to glBlendFunc decides how the colors are combined. In WebGL our options are GL_FUNC_ADD (a + b), GL_FUNC_SUBTRACT (a - b), GL_FUNC_REVERSE_SUBTRACT (b - a). In regular OpenGL there's also GL_MIN (min(a, b)) and GL_MAX (max(a, b)).

The function glBlendEquation takes two enumerations, choosing how the alpha channels are applied to the colors before the blend function (above) is applied. The alpha channel could be ignored and the color used directly (GL_ONE) or discarded (GL_ZERO). The alpha channel could be multiplied directly (GL_SRC_ALPHA, GL_DST_ALPHA), or inverted first (GL_ONE_MINUS_SRC_ALPHA). In WebGL there are 72 possible combinations.

gl.enable(gl.BLEND);
gl.blendEquation(gl.FUNC_ADD);
gl.blendFunc(gl.SRC_ALPHA, gl.SRC_ALPHA);

In this project I'm using GL_FUNC_ADD and GL_SRC_ALPHA for both source and destination. The alpha value put out by the fragment shader is the experimentally-determined, magical value of 0.62. A little higher and the feedback tends to blend towards bright white really fast. A little lower and it blends away to nothing really fast. It's a numerical instability that has the interesting side effect of making the demo behave slightly differently depending on the floating point precision of the GPU running it!

Saving a Screenshot

The HTML5 canvas object that provides the WebGL context has a toDataURL() method for grabbing the canvas contents as a friendly base64-encoded PNG image. Unfortunately this doesn't work with WebGL unless the preserveDrawingBuffer options is set, which can introduce performance issues. Without this option, the browser is free to throw away the drawing buffer before the next JavaScript turn, making the pixel information inaccessible.

By coincidence there's a really convenient workaround for this project. Remember that state texture? That's exactly what we want to save. I can attach it to a framebuffer and use glReadPixels just like did in WebGL Game of Life to grab the simulation state. The pixel data is then drawn to a background canvas (without using WebGL) and toDataURL() is used on that canvas to get a PNG image. I slap this on a link with the new download attribute and call it done.

Anti-aliasing

At the time of this writing, support for automatic anti-aliasing in WebGL is sparse at best. I've never seen it working anywhere yet, in any browser on any platform. GL_SMOOTH isn't available and the anti-aliasing context creation option doesn't do anything on any of my computers. Fortunately I was able to work around this using a cool smoothstep trick.

The article I linked explains it better than I could, but here's the gist of it. This shader draws a circle in a quad, but leads to jagged, sharp edges.

uniform vec4 color;
varying vec3 coord;  // object space

void main() {
    if (distance(coord.xy, vec2(0, 0)) < 1.0) {
        gl_FragColor = color;
    } else {
        gl_FragColor = vec4(0, 0, 0, 1);
    }
}

The improved version uses smoothstep to fade from inside the circle to outside the circle. Not only does it look nicer on the screen, I think it looks nicer as code, too. Unfortunately WebGL has no fwidth function as explained in the article, so the delta is hardcoded.

uniform vec4 color;
varying vec3 coord;

const vec4 outside = vec4(0, 0, 0, 1);
const float delta = 0.1;

void main() {
    float dist = distance(coord.xy, vec2(0, 0));
    float a = smoothstep(1.0 - delta, 1.0, dist);
    gl_FragColor = mix(color, outside, a);
}

Matrix Uniforms

Up until this point I had avoided matrix uniforms. I was doing transformations individually within the shader. However, as transforms get more complicated, it's much better to express the transform as a matrix and let the shader language handle matrix multiplication implicitly. Rather than pass half a dozen uniforms describing the transform, you pass a single matrix that has the full range of motion.

My Igloo WebGL library originally had a vector library that provided GLSL-style vectors, including full swizzling. My long term goal was to extend this to support GLSL-style matrices. However, writing a matrix library from scratch was turning out to be far more work than I expected. Plus it's reinventing the wheel.

So, instead, I dropped my vector library -- I completely deleted it -- and decided to use glMatrix, a really solid WebGL-friendly matrix library. Highly recommended! It doesn't introduce any new types, it just provides functions for operating on JavaScript typed arrays, the same arrays that get passed directly to WebGL functions. This composes perfectly with Igloo without making it a formal dependency.

Here's my function for creating the mat3 uniform that transforms both the main texture as well as the individual shape sprites. This use of glMatrix looks a lot like java.awt.geom.AffineTransform, does it not? That's one of my favorite parts of Java 2D, and I've been missing it.

/* Translate, scale, and rotate. */
Feedback.affine = function(tx, ty, sx, sy, a) {
    var m = mat3.create();
    mat3.translate(m, m, [tx, ty]);
    mat3.rotate(m, m, a);
    mat3.scale(m, m, [sx, sy]);
    return m;
};

The return value is just a plain Float32Array that I can pass to glUniformMatrix3fv. It becomes the placement uniform in the shader.

attribute vec2 quad;
uniform mat3 placement;
varying vec3 coord;

void main() {
    coord = vec3(quad, 0);
    vec2 position = (placement * vec3(quad, 1)).xy;
    gl_Position = vec4(position, 0, 1);
}

To move to 3D graphics from here, I would just need to step up to a mat4 and operate on 3D coordinates instead of 2D. glMatrix would still do the heavy lifting on the CPU side. If this was part of an OpenGL tutorial series, perhaps that's how it would transition to the next stage.

Conclusion

I'm really happy with how this one turned out. The only way it's indistinguishable from the original applet is that it runs faster. In preparation for this project, I made a big pile of improvements to Igloo, bringing it up to speed with my current WebGL knowledge. This will greatly increase the speed at which I can code up and experiment with future projects. WebGL + Skewer + Igloo has really become a powerful platform for rapid prototyping with OpenGL.

tags: [ webgl javascript media interactive ]

Emacs Unicode Pitfalls

GNU Emacs is seven years older than Unicode. Support for Unicode had to be added relatively late in Emacs' existence. This means Emacs has existed longer without Unicode support (16 years) than with it (14 years). Despite this, Emacs has excellent Unicode support. It feels as if it was there the whole time.

However, as a natural result of Unicode covering all sorts of edge cases for every known human language, there are pitfalls and complications. As a user of Emacs, you're not particularly affected by these, but extension developers might run into trouble while handling Emacs character-oriented data structures: strings and buffers.

In this article I'll go over Elisp's Unicode surprises. I've been caught by some of these myself. In fact, as a result of writing this article, I've discovered subtle encoding bugs in some of my own extensions. None of these pitfalls are Emacs' fault. They're just the result of complexities of natural language.

Unicode and Code Points

First, there are excellent materials online for learning Unicode. I recommend starting with UTF-8 and Unicode FAQ for Unix/Linux. There's no reason for me to repeat all this information here, but I'll attempt to quickly summarize it.

Unicode maps code points (integers) to specific characters, along with a standard name. As of this writing, Unicode defines over 110,000 characters. For backwards compatibility, the first 128 code points are mapped to ASCII. This trend continues for other character standards, like Latin-1.

In Emacs, Unicode characters are entered into a buffer with C-x 8 RET (insert-char). You can enter either the official name of the character (e.g. "GREEK SMALL LETTER PI" for π) or the hexadecimal code point. Outside of Emacs it depends on the application, but C-S-u followed by the hexadecimal code works for most of the applications I care about.

Encodings

The Unicode standard also describes several methods for encoding sequences of code points into sequences of bytes. Obviously a selection of 110,000 characters cannot be encoded with one byte per letter, so these are multibyte encodings. The two most popular encodings are probably UTF-8 and UTF-16.

UTF-8 was designed to be backwards compatible with ASCII, Unix, and existing C APIs (null-terminated C strings). The first 128 code points are encoded directly as a single byte. Every other character is encoded with two to six bytes, with the highest bit of each byte set to 1. This ensures that no part of a multibyte character will be interpreted as ASCII, nor will it contain a null (0). The latter means that C programs and C APIs can handle UTF-8 strings with few or no changes. Most importantly, every ASCII encoded file is automatically a UTF-8 encoded file.

UTF-16 encodes all the characters from the Basic Multilingual Plane (BMP) with two bytes. Even the original ASCII characters get two bytes (16 bits). The BMP covers virtually all modern languages and is generally all you'll ever practically need. However, this doesn't include the important TROPICAL DRINK or PILE OF POO characters from the supplemental ("astral") plane. If you need to use these characters in UTF-16, you're going to run into problems: characters outside the BMP don't fit in two bytes. To accommodate these characters, UTF-16 uses surrogate pairs: these characters are encoded with two 16-bit units.

Because of this last point, UTF-16 offers no practical advantages over UTF-8. Its existence was probably a big mistake. You can't do constant-time character lookup because you have to scan for surrogate pairs. It's not backwards compatible and cannot be stored in null-terminated strings. In both Java and JavaScript, it leads to the awkward situation where the "length" of a string is not the number of characters, code points, or even bytes. Worst of all, it has serious security implications. New applications should avoid it whenever possible.

Emacs and UTF-8

Emacs internally stores all text as UTF-8. This was an excellent choice! When text leaves Emacs, such as writing to a file or to a process, Emacs automatically converts it to the coding system configured for that particular file or process. When it accepts text from a file or process, it either converts it to UTF-8 or preserves it as raw bytes.

There are two modes for this in Emacs: unibyte and multibyte. Unibyte strings/buffers are just raw bytes. They have constant access O(1) time but can only hold single-byte values. The byte-code compiler outputs unibyte strings.

Multibyte strings/buffers hold UTF-8 encoded code points. Character access is O(n) because the string/buffer has to be scanned to count characters.

The actual encoding is rarely relevant because there's little way (and need) to access it directly. Emacs automatically converts text as needed when it leaves Emacs and arrives in Emacs, so there's no need to know the internal encoding. If you really want to see it anyway, you can use string-as-unibyte to get a copy of a string with the exact same bytes, but as a byte-string.

(string-as-unibyte "π")
;; => "\317\200"

This can be reversed with string-as-multibyte), to change a unibyte string holding UTF-8 encoded text back into a multibyte string. Note that these functions are different than string-to-unibyte and string-to-multibyte, which will attempt a conversion rather than preserving the raw bytes.

The length and buffer-size functions always count characters in multibyte and bytes in unibyte. Being UTF-8, there are no surrogate pairs to worry about here. The string-bytes and position-bytes functions return byte information for both multibyte and unibyte.

To specify a Unicode character in a string literal without using the character directly, use \uXXXX. The XXXX is the hexadecimal code point for the character and is always 4 digits long. For characters outside the BMP, which won't fit in four digits, use a capital U with eight digits: \UXXXXXXXX.

"\u03C0"
;; => "π"

"\U0001F4A9"
;; => "💩"  (PILE OF POO)

Finally, Emacs extends Unicode with 256 additional "characters" representing raw bytes. This allows raw bytes to be embedded distinctly within UTF-8 sequences. For example, it's used to distinguish the code point U+0041 from the raw byte #x41. As far as I can tell, this isn't used very often.

Combining Characters

Some Unicode characters are defined as combining characters. These characters modify the non-combining character that appears before it, typically with accents or diacritical marks.

For example, the word "naïve" can be written as six characters as "nai\u0308ve". The fourth character, U+0308 (COMBINING DIAERESIS), is a combining character that changes the "i" (U+0069 LATIN SMALL LETTER I) into an umlaut character.

The most commonly accented characters have a code of their own. These are called precomposed characters. This includes ï (U+00EF LATIN SMALL LETTER I WITH DIAERESIS). This means "naïve" can also be written as five characters as "na\u00EFve".

Normalization

So what happens when comparing two different representations of the same text? They're not equal.

(string= "nai\u0308ve" "na\u00EFve")
;; => nil

To deal with situations like this, the Unicode standard defines four different kinds of normalization. The two most important ones are NFC (composition) and NFD (decomposition). The former uses precomposed characters whenever possible and the latter breaks them apart. The functions ucs-normalize-NFC-string and ucs-normalize-NFD-string perform this operation.

Pitfall #1: Proper string comparison requires normalization. It doesn't matter which normalization you use (though NFD should be slightly faster), you just need to use it consistently. Unfortunately this can get tricky when using equal to compare complex data structures with multiple strings.

(string= (ucs-normalize-NFD-string "nai\u0308ve")
         (ucs-normalize-NFD-string "na\u00EFve"))
;; => t

Emacs itself fails to do this. It doesn't normalize strings before interning them, which is probably a mistake. This means you can have differently defined variables and functions with the same canonical name.

(eq (intern "nai\u0308ve")
    (intern "na\u00EFve"))
;; => nil

(defun print-résumé ()
  "NFC-normalized form."
  (print "I'm going to sabotage your team."))

(defun print-résumé ()
  "NFD-normalized form."
  (print "I'd be a great asset to your team."))

(print-résumé)
;; => "I'm going to sabotage your team."

String Width

There are three ways to quantify multibyte text. These are often the same value, but in some circumstances they can each be different.

Most of the time, one character is one column (a width of one). Some characters, like combining characters, consume no columns. Many Asian characters consume two columns (U+4000, 䀀). Tabs consume tab-width columns, usually 8.

Generally, a string should have the same width regardless of which whether it's NFD or NFC. However, due to bugs and incomplete Unicode support, this isn't strictly true. For example, some combining characters, such as U+20DD ⃝, won't combine correctly in Emacs nor in other applications.

Pitfall #2: Always measure text by width, not length, when laying out a buffer. Width is measured with the string-width function. This comes up when laying out tables in a buffer. The number of characters that fit in a column depends on what those characters are.

Fortunately I accidentally got this right in Elfeed because I used the format function for layout. The %s directive operates on width, as would be expected. However, this has the side effect that the output of may format change depending on the current buffer! Pitfall #3: Be mindful of the current buffer when using the format function.

(let ((tab-width 4))
  (length (format "%.6s" "\t")))
;; => 1

(let ((tab-width 8))
  (length (format "%.6s" "\t")))
;; => 0

String Reversal

Say you want to reverse a multibyte string. Simple, right?

(defun reverse-string (string)
  (concat (reverse (string-to-list string))))

(reverse-string "abc")
;; => "cba"

Wrong! The combining characters will get flipped around to the wrong side of the character they're meant to modify.

(reverse-string "nai\u0308ve")
;; => "ev̈ian"

Pitfall #4: Reversing Unicode strings is non-trivial. The Rosetta Code page is full of incorrect examples, and I'm personally guilty of this, too. The other day I submitted a patch to s.el to correct its s-reverse function for Unicode. If it's accepted, you should never need to worry about this.

Regular Expressions

Regular expressions operate on code points. This means combining characters are counted separately and the match may change depending on how characters are composed. To avoid this, you might want to consider NFC normalization before performing some kinds of regular expressions.

;; Like string= from before:
(string-match-p  "na\u00EFve" "nai\u0308ve")
;; => nil

;; The . only matches part of the composition
(string-match-p "na.ve" "nai\u0308ve")
;; => nil

Pitfall #5: Be mindful of combining characters when using regular expressions. Prefer NFC normalization when dealing with regular expressions.

Another potential problem is ranges, though this is quite uncommon. Ranges of characters can be expressed in inside brackets, e.g. [a-zA-Z]. If the range begins or ends with a decomposed combining character you won't get the proper range because its parts are considered separately by the regular expression engine.

(defvar match-weird "[\u00E0-\u00F6]+")

(string-match-p match-weird "áâãäå")
;; => 0  (successful match)

(string-match-p (ucs-normalize-NFD-string match-weird) "áâãäå")
;; => nil

It's especially important to keep all of this in mind when sanitizing untrusted input, such as when using Emacs as a web server. An attacker might use a denormalized or strange grapheme cluster to bypass a filter.

Interacting with the World

Here's a mistake I've made twice now. Emacs uses UTF-8 internally, regardless of whatever encoding the original text came in. Pitfall #6: When working with bytes of text, the counts may be different than the original source of the text.

For example, HTTP/1.1 introduced persistent connections. Before this, a client connects to a server and asks for content. The server sends the content and then closes the connection to signal the end of the data. In HTTP/1.1, when Connection: close isn't specified, the server will instead send a Content-Length header indicating the length of the content in bytes. The connection can then be re-used for more requests, or, more importantly, pipelining requests.

The main problem is that HTTP headers usually have a different encoding than the content body. Emacs is not prepared to handle multiple encodings from a single source, so the only correct way to talk HTTP with a network process is raw. My mistake was allowing Emacs to do the UTF-8 conversion, then measuring the length of the content in its UTF-8 encoding. This just happens to work fine about 99.9% of the time since clients tend to speak UTF-8, or something like it, anyway, but it's not correct.

Further Reading

A lot of this investigation was inspired by JavaScript's and other languages' Unicode shortcomings.

Comparatively, Emacs Lisp has really great Unicode support. This isn't too surprising considering that it's primary purpose is for manipulating text.

tags: [ emacs elisp ]

A GPU Approach to Conway's Game of Life

Update: In the next article, I extend this program to solving mazes.

Conway's Game of Life is another well-matched workload for GPUs. Here's the actual WebGL demo if you want to check it out before continuing.

To quickly summarize the rules:

These simple cellular automata rules lead to surprisingly complex, organic patterns. Cells are updated in parallel, so it's generally implemented using two separate buffers. This makes it a perfect candidate for an OpenGL fragment shader.

Preparing the Textures

The entire simulation state will be stored in a single, 2D texture in GPU memory. Each pixel of the texture represents one Life cell. The texture will have the internal format GL_RGBA. That is, each pixel will have a red, green, blue, and alpha channel. This texture is not drawn directly to the screen, so how exactly these channels are used is mostly unimportant. It's merely a simulation data structure. This is because I'm using the OpenGL programmable pipeline for general computation. I'm calling this the "front" texture.

Four multi-bit (actual width is up to the GPU) channels seems excessive considering that all I really need is a single bit of state for each cell. However, due to framebuffer completion rules, in order to draw onto this texture it must be GL_RGBA. I could pack more than one cell into one texture pixel, but this would reduce parallelism: the shader will run once per pixel, not once per cell.

Because cells are updated in parallel, this texture can't be modified in-place. It would overwrite important state. In order to do any real work I need a second texture to store the update. This is the "back" texture. After the update, this back texture will hold the current simulation state, so the names of the front and back texture are swapped. The front texture always holds the current state, with the back texture acting as a workspace.

GOL.prototype.swap = function() {
    var tmp = this.textures.front;
    this.textures.front = this.textures.back;
    this.textures.back = tmp;
    return this;
};

Here's how a texture is created and prepared. It's wrapped in a function/method because I'll need two identical textures, making two separate calls to this function. All of these settings are required for framebuffer completion (explained later).

GOL.prototype.texture = function() {
    var gl = this.gl;
    var tex = gl.createTexture();
    gl.bindTexture(gl.TEXTURE_2D, tex);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, gl.REPEAT);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, gl.REPEAT);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
    gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA,
                  this.statesize.x, this.statesize.y,
                  0, gl.RGBA, gl.UNSIGNED_BYTE, null);
    return tex;
};

A texture wrap of GL_REPEAT means the simulation will be automatically torus-shaped. The interpolation is GL_NEAREST, because I don't want to interpolate between cell states at all. The final OpenGL call initializes the texture size (this.statesize). This size is different than the size of the display because, again, this is actually a simulation data structure for my purposes.

The null at the end would normally be texture data. I don't need to supply any data at this point, so this is left blank. Normally this would leave the texture content in an undefined state, but for security purposes, WebGL will automatically ensure that it's zeroed. Otherwise there's a chance that sensitive data might leak from another WebGL instance on another page or, worse, from another process using OpenGL. I'll make a similar call again later with glTexSubImage2D() to fill the texture with initial random state.

In OpenGL ES, and therefore WebGL, wrapped (GL_REPEAT) texture dimensions must be powers of two, i.e. 512x512, 256x1024, etc. Since I want to exploit the built-in texture wrapping, I've decided to constrain my simulation state size to powers of two. If I manually did the wrapping in the fragment shader, I could make the simulation state any size I wanted.

Framebuffers

A framebuffer is the target of the current glClear(), glDrawArrays(), or glDrawElements(). The user's display is the default framebuffer. New framebuffers can be created and used as drawing targets in place of the default framebuffer. This is how things are drawn off-screen without effecting the display.

A framebuffer by itself is nothing but an empty frame. It needs a canvas. Other resources are attached in order to make use of it. For the simulation I want to draw onto the back buffer, so I attach this to a framebuffer. If this framebuffer is bound at the time of the draw call, the output goes onto the texture. This is really powerful because this texture can be used as an input for another draw command, which is exactly what I'll be doing later.

Here's what making a single step of the simulation looks like.

GOL.prototype.step = function() {
    var gl = this.gl;
    gl.bindFramebuffer(gl.FRAMEBUFFER, this.framebuffers.step);
    gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0,
                            gl.TEXTURE_2D, this.textures.back, 0);
    gl.viewport(0, 0, this.statesize.x, this.statesize.y);
    gl.bindTexture(gl.TEXTURE_2D, this.textures.front);
    this.programs.gol.use()
        .attrib('quad', this.buffers.quad, 2)
        .uniform('state', 0, true)
        .uniform('scale', this.statesize)
        .draw(gl.TRIANGLE_STRIP, 4);
    this.swap();
    return this;
};

First, bind the custom framebuffer as the current framebuffer with glBindFramebuffer(). This framebuffer was previously created with glCreateFramebuffer() and required no initial configuration. The configuration is entirely done here, where the back texture is attached to the current framebuffer. This replaces any texture that might currently be attached to this spot -- like the front texture from the previous iteration. Finally, the size of the drawing area is locked to the size of the simulation state with glViewport().

Using Igloo again to keep the call concise, a fullscreen quad is rendered so that the fragment shader runs exactly once for each cell. That state uniform is the front texture, bound as GL_TEXTURE0.

With the drawing complete, the buffers are swapped. Since every pixel was drawn, there's no need to ever use glClear().

The Game of Life Fragment Shader

The simulation rules are coded entirely in the fragment shader. After initialization, JavaScript's only job is to make the appropriate glDrawArrays() call over and over. To run different cellular automata, all I would need to do is modify the fragment shader and generate an appropriate initial state for it.

uniform sampler2D state;
uniform vec2 scale;

int get(int x, int y) {
    return int(texture2D(state, (gl_FragCoord.xy + vec2(x, y)) / scale).r);
}

void main() {
    int sum = get(-1, -1) +
              get(-1,  0) +
              get(-1,  1) +
              get( 0, -1) +
              get( 0,  1) +
              get( 1, -1) +
              get( 1,  0) +
              get( 1,  1);
    if (sum == 3) {
        gl_FragColor = vec4(1.0, 1.0, 1.0, 1.0);
    } else if (sum == 2) {
        float current = float(get(0, 0));
        gl_FragColor = vec4(current, current, current, 1.0);
    } else {
        gl_FragColor = vec4(0.0, 0.0, 0.0, 1.0);
    }
}

The get(int, int) function returns the value of the cell at (x, y), 0 or 1. For the sake of simplicity, the output of the fragment shader is solid white and black, but just sampling one channel (red) is good enough to know the state of the cell. I've learned that loops and arrays are are troublesome in GLSL, so I've manually unrolled the neighbor check. Cellular automata that have more complex state could make use of the other channels and perhaps even exploit alpha channel blending in some special way.

Otherwise, this is just a straightforward encoding of the rules.

Displaying the State

What good is the simulation if the user doesn't see anything? So far all of the draw calls have been done on a custom framebuffer. Next I'll render the simulation state to the default framebuffer.

GOL.prototype.draw = function() {
    var gl = this.gl;
    gl.bindFramebuffer(gl.FRAMEBUFFER, null);
    gl.viewport(0, 0, this.viewsize.x, this.viewsize.y);
    gl.bindTexture(gl.TEXTURE_2D, this.textures.front);
    this.programs.copy.use()
        .attrib('quad', this.buffers.quad, 2)
        .uniform('state', 0, true)
        .uniform('scale', this.viewsize)
        .draw(gl.TRIANGLE_STRIP, 4);
    return this;
};

First, bind the default framebuffer as the current buffer. There's no actual handle for the default framebuffer, so using null sets it to the default. Next, set the viewport to the size of the display. Then use the "copy" program to copy the state to the default framebuffer where the user will see it. One pixel per cell is far too small, so it will be scaled as a consequence of this.viewsize being four times larger.

Here's what the "copy" fragment shader looks like. It's so simple because I'm storing the simulation state in black and white. If the state was in a different format than the display format, this shader would need to perform the translation.

uniform sampler2D state;
uniform vec2 scale;

void main() {
    gl_FragColor = texture2D(state, gl_FragCoord.xy / scale);
}

Since I'm scaling up by four -- i.e. 16 pixels per cell -- this fragment shader is run 16 times per simulation cell. Since I used GL_NEAREST on the texture there's no funny business going on here. If I had used GL_LINEAR, it would look blurry.

You might notice I'm passing in a scale uniform and using gl_FragCoord. The gl_FragCoord variable is in window-relative coordinates, but when I sample a texture I need unit coordinates: values between 0 and 1. To get this, I divide gl_FragCoord by the size of the viewport. Alternatively I could pass the coordinates as a varying from the vertex shader, automatically interpolated between the quad vertices.

An important thing to notice is that the simulation state never leaves the GPU. It's updated there and it's drawn there. The CPU is operating the simulation like the strings on a marionette -- from a thousand feet up in the air.

User Interaction

What good is a Game of Life simulation if you can't poke at it? If all of the state is on the GPU, how can I modify it? This is where glTexSubImage2D() comes in. As its name implies, it's used to set the values of some portion of a texture. I want to write a poke() method that uses this OpenGL function to set a single cell.

GOL.prototype.poke = function(x, y, value) {
    var gl = this.gl,
        v = value * 255;
    gl.bindTexture(gl.TEXTURE_2D, this.textures.front);
    gl.texSubImage2D(gl.TEXTURE_2D, 0, x, y, 1, 1,
                     gl.RGBA, gl.UNSIGNED_BYTE,
                     new Uint8Array([v, v, v, 255]));
    return this;
};

Bind the front texture, set the region at (x, y) of size 1x1 (a single pixel) to a very specific RGBA value. There's nothing else to it. If you click on the simulation in my demo, it will call this poke method. This method could also be used to initialize the entire simulation with random values, though it wouldn't be very efficient doing it one pixel at a time.

Getting the State

What if you wanted to read the simulation state into CPU memory, perhaps to store for reloading later? So far I can set the state and step the simulation, but there's been no way to get at the data. Unfortunately I can't directly access texture data. There's nothing like the inverse of glTexSubImage2D(). Here are a few options:

I'm reusing the "step" framebuffer for this since it's already intended for these textures to be its attachments.

GOL.prototype.get = function() {
    var gl = this.gl, w = this.statesize.x, h = this.statesize.y;
    gl.bindFramebuffer(gl.FRAMEBUFFER, this.framebuffers.step);
    gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0,
                            gl.TEXTURE_2D, this.textures.front, 0);
    var rgba = new Uint8Array(w * h * 4);
    gl.readPixels(0, 0, w, h, gl.RGBA, gl.UNSIGNED_BYTE, rgba);
    return rgba;
};

Voilà! This rgba array can be passed directly back to glTexSubImage2D() as a perfect snapshot of the simulation state.

Conclusion

This project turned out to be far simpler than I anticipated, so much so that I was able to get the simulation running within an evening's effort. I learned a whole lot more about WebGL in the process, enough for me to revisit my WebGL liquid simulation. It uses a similar texture-drawing technique, which I really fumbled through that first time. I dramatically cleaned it up, making it fast enough to run smoothly on my mobile devices.

Also, this Game of Life implementation is blazing fast. If rendering is skipped, it can run a 2048x2048 Game of Life at over 18,000 iterations per second! However, this isn't terribly useful because it hits its steady state well before that first second has passed.

tags: [ webgl javascript interactive gpgpu ]

Tag Feeds for null program

I just added a formal tags page along with individual feeds for each tag. I've had tags for a couple of years now, but they were really only useful for traveling sideways to similar articles. So now, if you're only interested in a subset of my content, you can subscribe to one or more tags rather than the main Atom feed.

What prompted this? In my Emacs Chat, Sacha asked me if this blog was part of Planet Emacsen (currently, it's not). If my tags are accurate, only about 25% of my articles are about Emacs, so most of my blog isn't relevant there. Tag feeds will go a long way to help support these "planet" aggregators, should they want to include my articles. For example, Planet Emacsen would use my Emacs feed.

Static Site Generation

I couldn't practically support these extra feeds until recently. Remember, this blog is statically generated. More feeds means more content to generate, because articles are duplicated in whole for each feed. In past years, Jekyll would probably take on the order of an hour to do all this for a single build. Fortunately, Jekyll has improved dramatically, especially in the past year or so, and these feeds have little impact on the total build time. It's currently around 10 seconds or so. Not bad at all!

A consequence of being statically generated is that you can't ask for a combination of tags as a single feed. It would be a combinatorial nightmare (billions of feeds). Plus, the request would have to normalize the tag order (e.g. alphabetical) or else the combinatorial explosion to be far worse (i.e. exceeding the number of atoms in the universe). So I hope you can forgive me when subscribing to each tag individually.

Duplicate Articles

What if an article matches multiple tags? It will appear in each feed where it's tagged, possibly showing up multiple times in your web feed reader. Fortunately, this is where Atom saves the day! I'm leveraging Atom's prudent design to make this work cleanly. Articles' UUIDs are consistent across all of these feeds, so if your web feed reader is smart enough, it will recognize these as being the same article. For example, this article is f47e5404-cc4a-3cc0-01ce-a844c04721b8 regardless of which feed you see it in.

Unfortunately, Elfeed isn't smart enough for this. Sorry! In order to better support all the broken RSS feeds out there, I had to compromise on entry keying. I couldn't trust RSS feeds to provide me a reasonably unique key, so, transitively, Elfeed doesn't fully trust Atom's UUIDs either. These RSS feeds are broken largely because RSS itself is a broken mess. When making new feeds in the future, please use Atom!

Atom requires that every feed and article have a proper UUID. It doesn't matter where you get the feed from. You could subscribe to the same exact feed at three different URLs (mirrors perhaps) and your reader could reliably use the UUIDs to avoid duplication. Or, if you're subscribed to an aggregator like Planet Emacsen, and it includes content from a feed to which you're also directly subscribed, your reader client should be able to merge these articles. In comparison, RSS not only doesn't require UUIDs, it actively discourages them with its broken guid tag, so merging content from multiple sources is impossible with RSS.

Anyway, if most of my content doesn't suit you, you can now subscribe to the subset that does. Aren't Atom feeds cool?

tags: [ meta rant ]

Per Loop vs. Per Iteration Bindings

The April 5th, 2014 draft of the ECMA-262 6th Edition specification -- a.k.a the next major version of JavaScript/ECMAScript -- contained a subtle, though very significant, change to the semantics of the for loop (13.6.3.3). Loop variables are now fresh bindings for each iteration of the loop: a per-iteration binding. Previously loop variables were established once for the entire loop, a per-loop binding. The purpose is an attempt to fix an old gotcha that effects many languages.

If you couldn't already tell, this is going to be another language lawyer post!

Backup to C

To try to explain what this all means this in plain English, let's step back a moment and discuss what a for loop really is. I can't find a source for this, but I'm pretty confident the three-part for loop originated in K&R C.

for (INITIALIZATION; CONDITION; ITERATION) {
    BODY;
}
  1. Evaluate INITIALIZATION.
  2. Evaluate CONDITION. If zero (false), exit the for.
  3. Evaluate BODY.
  4. Evaluate ITERATION and go to 2.

In the original C, and all the way up to C89, no variable declarations were allowed in the initialization expression. I can understand why: there's a subtle complication, though it's harmless in C. We'll get to that soon. Here's a typical C89 for loop.

int count = 10;
/* ... */
int i;
for (i = 0; i < count; i++) {
    double foo;
    /* ... */
}

The variable i is established independent of the loop, in the scope outside the for loop, alongside count. This isn't even a per-loop binding. As far as the language is concerned, it's just a variable that the loop happens to access and mutate. It's very assembly-language-like. Because C has block scoping, the body of the for loop is another nested scope. The variable foo is in this scope, reestablished on each iteration of the loop (per-iteration).

As an implementation detail, foo will reside at the same location on the stack each time around the loop. If it's accessed before being initialized, it will probably hold the value from the previous iteration, but, as far as the language is concerned, this is just a happy, though undefined, coincidence.

C99 Loops

Fast forward to the end of the 20th century. At this point, other languages have allowed variable declarations in the initialization part for years, so it's time for C to catch up with C99.

int count = 10;
/* ... */
for (int i = 0; i < count; i++) {
    double foo;
    /* ... */
}

Now consider this: in what scope is the variable i? The outer scope as before? The iteration scope with foo? The answer is neither. In order to make this work, a whole new loop scope is established in between: a per-loop binding. This scope holds for the entire duration of the loop.

The variable i is constrained to the for loop without being limited to the iteration scope. This is important because i is what keeps track of the loop's progress. The semantic equivalent in C89 makes the additional scope explicit with a block.

int count = 10;
/* ... */
{
    int i;
    for (i = 0; i < count; i++) {
        double foo;
        /* ... */
    }
}

This, ladies and gentlemen, is the the C-style 3-part for loop. Every language that has this statement, and has block scope, follows these semantics. This included JavaScript up until two months ago, where the draft now gives it its own unique behavior.

JavaScript's Let

As it exists today in its practical form, little of the above is relevant to JavaScript. JavaScript has no block scope, just function scope. A three-part for-loop doesn't establish all these scopes, because scopes like these are absent from the language.

An important change coming with 6th edition is the introduction of let declarations. Variables declared with let will have block scope.

let count = 10;
// ...
for(let i = 0; i < count; i++) {
    let foo;
    // ...
}
console.log(foo); // error
console.log(i);   // error

If these variables had been declared with var, the last two lines wouldn't be errors (or worse, global references). count, i, and foo would all be in the same function-level scope. This is really great! I look forward to using let exclusively someday.

The Closure Trap

I mentioned a subtle complication. Most of the time programmers don't need to consider or even be aware of this middle scope. However, when combined with closures it suddenly becomes an issue. Here's an example with Perl,

my @closures;
for (my $i = 0; $i < 2; $i++) {
    push(@closures, sub { return $i; });
}
$closures[0]();  # => 2
$closures[1]();  # => 2

Here's one with Python. Python lacks a three-part for loop, but its standard for loop has similar semantics.

closures = []
for i in xrange(2):
    closures.append(lambda: i)
closures[0]()  # => 1
closures[1]()  # => 1

And now Ruby.

closures = []
for i in (0..1)
  closures << lambda { i }
end
closures[0].call  # => 1
closures[1].call  # => 1

In all three cases, one closure is created per iteration. Each closure captures the loop variable i. It's easy to make the mistake of thinking each closure will return a unique value. However, as pointed out above, this is a per-loop variable, existing in a middle scope. The closures all capture the same variable, merely bound to different values at the time of capture. The solution is to establish a new variable in the iteration scope and capture that instead. Below, I've established a $value variable for this.

my @closures;
for (my $i = 0; $i < 2; $i++) {
    my $value = $i;
    push(@closures, sub { return $value; });
}
$closures[0]();  # => 0
$closures[1]();  # => 1

This is something that newbies easily get tripped up on. Because they're still trying to wrap their heads around the closure concept, this looks like some crazy bug in the interpreter/compiler. I can understand why the ECMA-262 draft was changed to accommodate this situation.

The JavaScript Workaround

The language in the new draft has two items called perIterationBindings and CreatePerIterationEnvironment (in case you're searching for the relevant part of the spec). Like the $value example above, for loops in JavaScript with "lexical" (i.e. let) loop bindings will implicitly mask the loop variable with a variable of the same name in the iteration scope.

let closures = [];
for (let i = 0; i < 2; i++) {
    closures.push(function() { return i; });
}

/* Before the change: */
closures[0]();  // => 2
closures[1]();  // => 2

/* After the change: */
closures[0]();  // => 0
closures[1]();  // => 1

Note: If you try to run this yourself, note that at the time of this writing, the only JavaScript implementation I could find that updated to the latest draft was Traceur. You'll probably see the "before" behavior for now.

You can't see it (I said it's implicit!), but under an updated JavaScript implementation there are actually two i variables here. The closures capture the most inner i, the per-iteration version of i. Let's go back to the original example, JavaScript-style.

let count = 10;
// ...
for (let i = 0; i < count; i++) {
   let foo;
   // ...
}

Here's what the scope looks like for the latest draft. Notice the second i in the iteration scope. The inner i is initially assigned to the value of the outer i.

We could emulate this in an older edition. Imagine writing a macro to do this.

let count = 10;
// ...
for (let i = 0; i < count; i++) {
    let __i = i;  // (possible name collision)
    {
        let i = __i;
        let foo;
        // ...
    }
}

I have to use __i to smuggle the value across scopes without having i reference itself. Unlike Lisp's let, the assignment value for var and let is evaluated in the nested scope, not the outer scope.

Each iteration gets its own i. But what happens when the loop modifies i? Simple, it's copied back out at the end of the body.

let count = 10;
// ...
for (let i = 0; i < count; i++) {
    let __i = i;
    {
        let i = __i;
        let foo;
        // ...
        __i = i;
    }
    i = __i;
}

Now all the expected for semantics work -- the body can also update the loop variable -- but we still get the closure-friendly per-iteration variables.

Conclusion

I'm still not sure if I really like this change. It's clean fix, but the gotcha hasn't been eliminated. Instead it's been inverted. Sometime someone will have the unusual circumstance of wanting to capture the loop variable, and he will run into some surprising behavior. Because the semantics are a lot more complicated, it's hard to reason about what's not working unless you already know JavaScript has magical for loops.

Perl and C# each also gained per-iteration bindings in their history, but rather than complicate or change their standard for loops, they instead introduced it as a new syntactic construction: foreach.

my @closures;
foreach my $i (0, 1) {
    push(@closures, sub { return $i; });
}
$closures[0]();  # => 0
$closures[1]();  # => 1

In this case, per-iteration bindings definitely make sense. The variable $i is established and bound to each value in turn. As far as control flow goes, it's very functional. The binding is never actually mutated.

I think it could be argued that Python and Ruby's for ... in forms should behave like this foreach. These were probably misdesigned early on, but it's not possible to change their semantics at this point. Because JavaScript's var was improperly designed from the beginning, let offers the opportunity to fix more than just var. We're seeing this right now with these new for semantics.

tags: [ lang javascript ]
Fork me on GitHub