null program

Web Tips For Webcomic Authors

My wife and I are huge webcomic fans. The web is the medium that the comic strip industry needed badly for decades, and, with Patreon and such today, we’re now living in a golden age of comics. As of this writing, I currently follow … let’s see … 39 different web comics.

(cl-count-if (lambda (x) (memq 'comic x)) elfeed-feeds)
;; => 39

My first exposure to comics was in my childhood when I got my hands on Bill Watterson’s Something Under the Bed Is Drooling (Calvin and Hobbes). This gave me very high expectations of the Sunday comics section of the newspaper when I’d read it at my grandmother’s house. Those hopes were shattered as I discovered just how awful nationally syndicated comic strips are: mostly watered down, lowest common denominator stuff like Garfield, Family Circus, Cathy, B.C., etc.

During Calvin and Hobbes’s original run, Bill Watterson wrote about his struggles with the newspapers and the Universal Press Syndicate, one of the organizations responsible for this mess. Newspapers and the Syndicate pushed for smaller frames and shorter comics. Authors were required to plan around newspapers removing frames for layout purposes. Many newspapers would drop comics that need meet stringent content limitations — a line that even Calvin and Hobbes crossed on occasion. Authors had little control over how their work was published.

Those days are over. Today’s authors can cheaply host their comics on the web — webcomics — with full control over content, layout, and schedule. If they even try to monetize at all, it’s generally through advertising, merchandising, or reader donations. Some do it all in their free time, while for others it’s part or even full time employment. The number of regular readers of a single webcomic can be just a handful of people, or up to millions of people. The role of the middleman is somewhere between diminished to non-existent. This is great, because newspapers would never publish the vast majority of the comics I read every day.

I’ve been fortunate to meet a couple of my favorite webcomic authors. Here’s a picture of my wife posing with Anthony Clark of Nedroid Picture Diary at the Small Press Expo.

I’ve also met Philippa Rice of My Cardboard Life. (Sorry, no picture for this one, since taking pictures with people isn’t really my thing.)

Over the years I’ve seen webcomic authors blunder with the web as a technology. In my experience it’s been disproportionate, with mistakes made more often by them than the bloggers I follow. I suspect that this is because blogs I follow tend to be computing related and so their authors have high proficiency in computing. The same is not necessarily true of the webcomics I follow.

Tips for web authors

Since I want to see this medium continue to thrive, and to do so in a way friendly to my own preferences, I’d like to share some tips to avoid common mistakes. Some of these apply more broadly than webcomics.

If you’re using a host designed for webcomics or similar, such as Tumblr, a lot of this stuff will be correct by default without any additional work on your part. However, you should still be aware of common problems because you may unwittingly go out of your way to break things.

URLs are forever

Every time you publish on the web, your content is accessible through some specific URL: that sequence of characters that starts with “http”. Each individual comic should be accessible through a unique, unchanging URL. That last adjective is critically important. That URL should point to the same comic for as long as possible — ideally until the heat death of the universe. This will be affected by problems such as your host going down, but the impact should only be temporary and short. A URL is a promise.

People will be using this URL to share your comics with others. They’ll make posts on other websites linking to your comic. They’ll e-mail that URLs to friends and family. Once you’ve published, you no longer control how that URL is used.

On several occasions I’ve seen authors break all their URLs after revamping their site. For example, the previously the URL contained the date but the new URL is only the domain and the title. That breaks thousands of links all over the Internet. Visitors using those old links will be welcomed with an ugly “404 Not Found” — or worse, as I’ve seen more than once, a “200 Found” blank page. These are missed opportunities for new readers.

If you really must change your URLs, the next best thing is to use an HTTP “301 Moved Permanently” and redirect to the new URL. This will leave all those old links intact and encourage new links to use the new address. If you don’t know how this works, ask your local computer geek about it.

You should also avoid having multiple URLs for the same content without a redirect. Search engines will punish you for it and it’s confusing for users. Pick one URL as the canonical URL for a comic, and if you’ve published any other URLs (short URLs, etc.), use the previously mentioned “301 Moved Permanently” to redirect to the canonical URL.

Your main page probably lists all your comics starting from the most recent. This is a good design and doesn’t violate anything I previously said. That’s not the URL for any particular comic, but to the main page, which also serves as the list of recent comics. I strongly recommend that the comics on the main page are also hyperlinks to their specific URL. Users naturally expect to find the comic’s URL by clicking on the comic’s image.

Have an Atom or RSS feed

Comics without feeds is much less of a problem than it used to be, but it still comes up on occasion. If you need to pick between Atom and RSS, I recommend Atom, but, honestly, it’s only important that you have a valid feed with a date. You don’t even need to put the comic in the feed itself (possibly costing you ad revenue), just a link to the comic’s URL is fine. It’s main purpose is to say, “hey, there’s a new comic up!”

You may not use Atom/RSS yourself, but your readers will appreciate it. Many of us don’t use centralized services like Facebook, Twitter, or Google+, and want to follow your work without signing up for a third-party service. Atom/RSS is the widely-accepted decentralized method for syndication on the web.

Web feeds are really easy; it’s just an XML file on your website that lists the most recent content. A validator can help you ensure you’ve done it correctly.

Pick a good, catchy title

One of the biggest barriers to sharing a comic is a lack of title. For example, if a reader is going to post your comic on reddit, they need to enter the comic’s URL and its title. If they comic doesn’t have a title, then this person will need to make one up. There’s two problems with this:

At minimum your title should appear in the <title> element of the page so that it shows up in the browser tab and browser’s window title. The title of the individual comic should come before the title of the whole website, since that shows up better in search engines. The title should also appear somewhere near the top of page for easy clipboard copying, though it may be worth leaving out depending on the style of your comic.

A page without a <title> element looks amateur, so don’t do that!

Think of the future and include dates

This is one of those things that’s important anywhere on the web and is often violated by blog articles as well. Far too much content is published without a date. Dates put your comic in context, especially if it’s about something topical, and it helps users navigate your content though time.

Putting the date in the URL is sufficient — even preferred — if you didn’t want to display it on the page proper. Your Atom/RSS should always have the comic’s date. I personally benefit from a date-time precision down to the publication hour. Some comics/articles are always published as “midnight” even when posted in the afternoon, which has the jarring effect of inserting it in time before a bunch of things I’ve already read.

How do I contact you?

When I notice one of the previous problems, particularly when they arise in comics I’m already following, I’d like to inform you of the problem. Or perhaps I want to compliment you on a well-made comic and you don’t have a comments section. I can only do this if you include some sort of contact information. An e-mail address, even in an anti-spam image form, is preferable but not strictly required.

Take advantage of the medium and go big

Comics published in newspapers are really tiny because newspaper editors want to cram a bunch of them onto a couple of pages. You’re not operating under these limitations, so fight the urge to copy that familiar format. Your canvas is practically infinite, so make big, colorful webcomics. The only limit is your readers’ screen resolution.

A final thanks

Thanks for all the work you do, webcomic authors. You regularly create all this awesome stuff for free. If you’re a webcomic author and you need help with any of the information above, don’t hesitate to contact me. After all, I don’t hesitate to bug you when something’s not right!

tags: [ web ]

Recovering Live Data with GDB

I recently ran into a problem where long-running program output was trapped in a C FILE buffer. The program had been running for two days straight printing its results, but the last few kilobytes of output were missing. It wouldn’t output these last bytes until the program completed its day-long (or worse!) cleanup operation and exited. This is easy to fix — and, honestly, the cleanup step was unnecessary anyway — but I didn’t want to start all over and wait two more days to recompute the result.

Here’s a minimal example of the situation. The first loop represents the long-running computation and the infinite loop represents a cleanup job that will never complete.

#include <stdio.h>

    /* Compute output. */
    for (int i = 0; i < 10; i++)
        printf("%d/%d ", i, i * i);

    /* "Slow" cleanup operation ... */
    for (;;)
    return 0;

Buffered Output Review

Both printf and putchar are C library functions and are usually buffered in some way. That is, each call to these functions doesn’t necessarily send data out of the program. This is in contrast to the POSIX functions read and write, which are unbuffered system calls. Since system calls are relatively expensive, buffered input and output is used to change a large number of system calls on small buffers into a single system call on a single large buffer.

Typically, stdout is line-buffered if connected to a terminal. When the program completes a line of output, the user probably wants to see it immediately. So, if you compile the example program and run it at your terminal you will probably see the output before the program hangs on the infinite loop.

$ cc -std=c99 example.c
$ ./a.out
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81

However, when stdout is connected to a file or pipe, it’s generally buffered to something like 4kB. For this program, the output will remain empty no matter how long you wait. It’s trapped in a FILE buffer in process memory.

$ ./a.out > output.txt

The primary way to fix this is to use the fflush function, to force the buffer empty before starting a long, non-output operation. Unfortunately for me I didn’t think of this two days earlier.

Debugger to the Rescue

Fortunately there is a way to interrupt a running program and manipulate its state: a debugger. First, find the process ID of the running program (the one writing to output.txt above).

$ pgrep a.out

Now attach GDB, which will pause the program’s execution.

$ gdb ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
gdb> attach 12934
Attaching to program: /tmp/a.out, process 12934
... snip ...
0x0000000000400598 in main ()

From here I could examine the stdout FILE struct and try to extract the buffer contents by hand. However, the easiest thing is to do is perform the call I forgot in the first place: fflush(stdout).

gdb> call fflush(stdout)
$1 = 0
gdb> quit
Detaching from program: /tmp/a.out, process 12934

The program is still running, but the output has been recovered.

$ cat output.txt
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81

Why Cleanup?

As I said, in my case the cleanup operation was entirely unnecessary, so it would be safe to just kill the program at this point. It was taking a really long time to tear down a humongous data structure (on the order of 50GB) one little node at a time with free. Obviously, the memory would be freed much more quickly by the OS when the program exited.

Freeing memory in the program was only to satisfy Valgrind, since it’s so incredibly useful for debugging. Not freeing the data structure would hide actual memory leaks in Valgrind’s final report. For the real “production” run, I should have disabled cleanup.

tags: [ c cpp ]

Shamus Young's Twenty-Sided Tale E-book

Last month I assembled and edited Shamus Young’s Twenty-Sided Tale, originally a series of 84 blog articles, into an e-book. The book is 75,000 words — about the average length of a novel — recording the complete story of one of Shamus’ Dungeons and Dragons campaigns. Since he’s shared the e-book on his blog, I’m now free to pull back the curtain on this little project.

To build the book yourself, you will only need make and pandoc.

Why did I want this?

Ever since I got a tablet a couple years ago, I’ve completely switched over to e-books. Prior to the tablet, if there was an e-book I wanted to read, I’d have to read from a computer monitor while sitting at a desk. Anyone who’s tried it can tell you it’s not a comfortable way to read for long periods, so I only reserved the effort for e-book-only books that were really worth it. However, once comfortable with the tablet, I gave away nearly all my paper books from my bookshelves at home. The remaining use of paper books is because either an e-book version isn’t reasonably available or the book is very graphical, not suited to read/view on a screen (full image astronomy books, Calvin and Hobbes collections).

As far as formats go, I prefer PDF and ePub, depending on the contents of the book. Technical books fare better as PDFs due to elaborate typesetting used for diagrams and code samples. For prose-oriented content, particularly fiction, ePub is the better format due to its flexibility and looseness. Twenty-Sided Tale falls in this latter category. The reader gets to decide the font, size, color, contrast, and word wrapping. I kept the ePub’s CSS to a bare minimum as to not get in the reader’s way. Unfortunately I’ve found that most ePub readers are awful at rendering content, so while technically you could do the same fancy typesetting with ePub, it rarely works out well.

The Process

To start, I spent about 8 hours with Emacs manually converting each article into Markdown and concatenating them into a single document. The ePub is generated from the Markdown using the Pandoc “universal document converter.” The markup includes some HTML, because Markdown alone, even Pandoc’s flavor, isn’t expressive enough for the typesetting needs of this particular book. This means it can only reasonably be transformed into HTML-based formats.

Pandoc isn’t good enough for some kinds of publishing, but it was sufficient here. The one feature I really wished it had was support for tagging arbitrary document elements with CSS classes (images, paragraphs, blockquotes, etc.), effectively extending Markdown’s syntax. Currently only headings support extra attributes. Such a feature would have allowed me to bypass all use of HTML, and the classes could maybe have been re-used in other output formats, like LaTeX.

Once I got the book in a comfortable format, I spent another 1.5 weeks combing through the book fixing up punctuation, spelling, grammar, and, in some cases, wording. It was my first time editing a book — fiction in particular — and in many cases I wasn’t sure of the correct way to punctuate and capitalize some particular expression. Is “Foreman” capitalized when talking about a particular foreman? What about “Queen?” How are quoted questions punctuated when the sentence continues beyond the quotes? As an official source on the matter, I consulted the Chicago Manual of Style. The first edition is free online. It’s from 1906, but style really hasn’t changed too much over the past century!

The original articles were written over a period of three years. Understandably, Shamus forgot how some of the story’s proper names were spelled over this time period. There wasn’t a wiki to check. Some proper names had two, three, or even four different spellings. Sometimes I picked the most common usage, sometimes the first usage, and sometimes I had to read the article’s comments written by the game’s players to see how they spelled their own proper names.

I also sunk time into a stylesheet for a straight HTML version of the book, with the images embedded within the HTML document itself. This will be one of the two outputs if you build the book in the repository.

A Process to Improve

Now I’ve got a tidy, standalone e-book version of one of my favorite online stories. When I want to re-read it again in the future, it will be as comfortable as reading any other novel.

This has been a wonderful research project into a new domain (for me): writing and editing, style, and today’s tooling for writing and editing. As a software developer, the latter overlaps my expertise and is particularly fascinating. A note to entrepreneurs: There’s massive room for improvement in this area. Compared software development, the processes in place today for professional writing and editing is, by my estimates, about 20 years behind. It’s a place where Microsoft Word is still the industry standard. Few authors and editors are using source control or leveraging the powerful tools available for creating and manipulating their writing.

Unfortunately it’s not so much a technical problem as it is a social/educational one. The tools mostly exist in one form or another, but they’re not being put to use. Even if an author or editor learns or builds a more powerful set of tools, they must still interoperate with people who do not. Looking at it optimistically, this is a potential door into the industry for myself: a computer whiz editor who doesn’t require Word-formatted manuscripts; who can make the computer reliably and quickly perform the tedious work. Or maybe that idea only works in fiction.

tags: [ media rant ]

Mandelbrot Set with SIMD Intrinsics

When I started this blog 8 years ago, my first post was about the Mandelbrot set. Since then, both technology and my own skills have improved (or so I like to believe!), so I’m going to take another look at it, this time using three different Single Instruction, Multiple Data (SIMD) instruction sets: SSE2, AVX, and NEON. The latter two didn’t exist when the last article was published. In this article I demonstrate SIMD bringing a 5.8x speedup to a fractal renderer.

If you want to take a look at my code before reading further:

Having multiple CPU cores allows different instructions to operation on (usually) different data independently. In contrast, under SIMD a specific operation (single instruction) acts upon several values (multiple data) at once. It’s another form of parallelization. For example, with image processing – perhaps the most common use case – this means multiple pixels could be computed within the same number of cycles it would normally take to compute just one. SIMD is generally implemented on CPUs through wide registers: 64, 128, 256, and even 512 bits wide. Values are packed into the register like an array and are operated on independently, generally with saturation arithmetic (clamped, non-wrapping).

Rather than hand-code all this in assembly, I’m using yet another technique I picked up from the always-educational Handmade Hero: compiler intrinsics. The code is all C, but in place of C’s operators are pseudo-function calls operating on special SIMD types. These aren’t actual function calls, they’re intrinsics. The compiler will emit a specific assembly instruction for each intrinsic, sort of like an inline function. This is more flexible for mixing with other C code, the compiler will manage all the registers, and the compiler will attempt to re-order and interleave instructions to maximize throughput. It’s a big win!

Some SIMD History

The first widely consumer available SIMD hardware was probably the MMX instruction set, introduced to 32-bit x86 in 1997. This provided 8 64-bit mm0 - mm7, registers aliasing the older x87 floating pointer registers, which operated on packed integer values. This was extended by AMD with its 3DNow! instruction set, adding floating point instructions.

However, you don’t need to worry about any of that because these both were superseded by Streaming SIMD Extensions (SSE) in 1999. SSE has 128-bit registers – confusingly named xmm0 - xmm7 – and a much richer instruction set. SSE has been extended with SSE2 (2001), SSE3 (2004), SSSE3 (2006), SSE4.1 (2007), and SSE4.2 (2008). x86_64 doesn’t have SSE2 as an extension but instead as a core component of the architecture (adding xmm8- xmm15), baking it into its ABI.

In 2009, ARM introduced the NEON instruction set as part of ARMv6. Like SSE, it has 128-bit registers, but its instruction set is more consistent and uniform. One of its most visible features over SSE is a stride load parameter making it flexible for a wider variety data arrangements. NEON is available on your Raspberry Pi, which is why I’m using it here.

In 2011, Intel and AMD introduced the Advanced Vector Extensions (AVX) instruction set. Essentially it’s SSE with 256-bit registers, named ymm0 - ymm15. That means operating on 8 single-precision floats at once! As of this writing, this extensions is just starting to become commonplace on desktops and laptops. It also has extensions: AVX2 (2013) and AVX-512 (2015).

Starting with C

Moving on to the code, in mandel.c you’ll find mandel_basic, a straight C implementation that produces a monochrome image. Normally I would post the code here within the article, but it’s 30 lines long and most of it isn’t of any particular interest.

I didn’t use C99’s complex number support because – continuing to follow the approach Handmade Hero – I intended to port this code directly into SIMD intrinsics. It’s much easier to work from a straight non-SIMD implementation towards one with compiler intrinsics than coding with compiler intrinsics right away. In fact, I’d say it’s almost trivial, since I got it right the first attempt on all three.

There’s just one unusual part:

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < s->height; y++) {
   /* ... */

This is an Open Multi-Processing (OpenMP) pragma. It’s a higher-level threading API than POSIX or Win32 threads. OpenMP takes care of all thread creation, work scheduling, and cleanup. In this case, the for loop is parallelized such that each row of the image will be scheduled individually to a thread, with one thread spawned for each CPU core. This one line saves all the trouble of managing a work queue and such. I also use it in my SIMD implementations, composing both forms of parallelization for maximum performance.

I did it in single precision because I really want to exploit SIMD. Obviously, being half as wide as double precision, twice an many single precision operands can fit in a SIMD register.

On my wife’s i7-4770 (8 logical cores), it takes 29.9ms to render one image using the defaults (1440x1080, real{-2.5, 1.5}, imag{-1.5, 1.5}, 256 iterations). I’ll use the same machine for both the SSE2 and AVX benchmarks.

SSE2 Mandelbrot Set

The first translation I did was SSE2 (mandel_sse2.c). As with just about any optimization, it’s more complex and harder to read than the straight version. Again, I won’t post the code here, especially when this one has doubled to 60 lines long.

Porting to SSE2 (and SIMD in general) is simply a matter of converting all assignments and arithmetic operators to their equivalent intrinsics. The Intel Intrinsics Guide is a godsend for this step. It’s easy to search for specific operations and it tells you what headers they come from. Notice that there are no C arithmetic operators until the very end, after the results have been extracted from SSE and pixels are being written.

There are two new types present in this version, __m128 and __m128i. These will be mapped to SSE registers by the compiler, sort of like the old (outdated) C register keyword. One big difference is that it’s legal to take the address of these values with &, and the compiler will worry about the store/load operations. The first type is for floating point values and the second is for integer values. At first it’s annoying for these to be separate types (the CPU doesn’t care), but it becomes a set of compiler-checked rails for avoiding mistakes.

Here’s how assignment was written in the straight C version:

float iter_scale = 1.0f / s->iterations;

And here’s the SSE version. SSE intrinsics are prefixed with _mm, and the “ps” stands for “packed single-precision.”

__m128 iter_scale = _mm_set_ps1(1.0f / s->iterations);

This sets all four lanes of the register to the same value (a broadcast). Lanes can also be assigned individually, such as at the beginning of the innermost loop.

__m128 mx = _mm_set_ps(x + 3, x + 2, x + 1, x + 0);

This next part shows why the SSE2 version is longer. Here’s the straight C version:

float zr1 = zr * zr - zi * zi + cr;
float zi1 = zr * zi + zr * zi + ci;
zr = zr1;
zi = zi1;

To make it easier to read in the absence of operator syntax, I broke out the intermediate values. Here’s the same operation across four different complex values simultaneously. The purpose of these intrinsics should be easy to guess from their names.

__m128 zr2 = _mm_mul_ps(zr, zr);
__m128 zi2 = _mm_mul_ps(zi, zi);
__m128 zrzi = _mm_mul_ps(zr, zi);
zr = _mm_add_ps(_mm_sub_ps(zr2, zi2), cr);
zi = _mm_add_ps(_mm_add_ps(zrzi, zrzi), ci);

There are a bunch of swizzle instructions added in SSSE3 and beyond for re-arranging bytes within registers. With those I could eliminate that last bit of non-SIMD code at the end of the function for packing pixels. In an earlier version I used them, but since pixel packing isn’t a hot spot in this code (it’s outside the tight, innermost loop), it didn’t impact the final performance, so I took it out for the sake of simplicity.

The running time is now 8.56ms per image, a 3.5x speedup. That’s close to the theoretical 4x speedup from moving to 4-lane SIMD. That’s fast enough to render fullscreen at 60FPS.

AVX Mandelbrot Set

With SSE2 explained, there’s not much to say about AVX (mandel_avx.c). The only difference is the use of __m256, __m256i, the _mm256 intrinsic prefix, and that this operates on 8 points on the complex plane instead of 4.

It’s interesting that the AVX naming conventions are subtly improved over SSE. For example, here are the SSE broadcast intrinsics.

Notice the oddball at the end? That’s discrimination against sufferers of obsessive-compulsive personality disorder. This was fixed in AVX’s broadcast intrinsics:

The running time here is 5.20ms per image, a 1.6x speedup from SSE2. That’s not too far from the theoretical 2x speedup from using twice as many lanes. We can render at 60FPS and spend most of the time waiting around for the next vsync.

NEON Mandelbrot Set

NEON is ARM’s take on SIMD. It’s what you’d find on your phone and tablet rather than desktop or laptop. NEON behaves much like a co-processor: NEON instructions are (cheaply) dispatched asynchronously to their own instruction pipeline, but transferring data back out of NEON is expensive and will stall the ARM pipeline until the NEON pipeline catches up.

Going beyond __m128 and __m256, NEON intrinsics have a type for each of the possible packings. On x86, the old stack-oriented x87 floating-point instructions are replaced with SSE single-value (“ss”, “sd”) instructions. On ARM, there’s no reason to use NEON to operate on single values, so these “packings” don’t exist. Instead there are half-wide packings. Note the lack of double-precision support.

Again, the CPU doesn’t really care about any of these types. It’s all to help the compiler help us. For example, we don’t want to multiply a float32x4_t and a float32x2_t since it wouldn’t have a meaningful result.

Otherwise everything is similar (mandel_neon.c). NEON intrinsics are (less-cautiously) prefixed with v and suffixed with a type (_f32, _u32, etc.).

The performance on my model Raspberry Pi 2 (900 MHz quad-core ARM Cortex-A7) is 545ms per frame without NEON and 232ms with NEON, a 2.3x speedup. This isn’t nearly as impressive as SSE2, also at 4 lanes. My implementation almost certainly needs more work, especially since I know less about ARM than x86.

Compiling with Intrinsics

For the x86 build, I wanted the same binary to have AVX, SSE2, and plain C versions, selected by a command line switch and feature availability, so that I could easily compare benchmarks. Without any special options, gcc and clang will make conservative assumptions about the CPU features of the target machine. In order to build using AVX intrinsics, I need the compiler to assume the target has AVX. The -mavx argument does this.

mandel_avx.o : mandel_avx.c
    $(CC) -c $(CFLAGS) -mavx -o $@ $^

mandel_sse2.o : mandel_sse2.c
    $(CC) -c $(CFLAGS) -msse2 -o $@ $^

mandel_neon.o : mandel_neon.c
    $(CC) -c $(CFLAGS) -mfpu=neon -o $@ $^

All x86_64 CPUs have SSE2 but I included it anyway for clarity. But it should also enable it for 32-bit x86 builds.

It’s absolutely critical that each is done in a separate translation unit. Suppose I compiled like so in one big translation unit,

gcc -msse2 -mavx mandel.c mandel_sse2.c mandel_avx.c

The compiler will likely use some AVX instructions outside of the explicit intrinsics, meaning it’s going to crash on machine without AVX (“illegal instruction”). The main program needs to be compiled with AVX disabled. That’s where it will test for AVX before executing any special instructions.

Feature Testing

Intrinsics are well-supported across different compilers (surprisingly, even including the late-to-the-party Microsoft). Unfortunately testing for CPU features differs across compilers. Intel advertises a _may_i_use_cpu_feature intrinsic, but it’s not supported in either gcc or clang. gcc has a __builtin_cpu_supports built-in, but it’s only supported by gcc.

The most portable solution I came up with is cpuid.h (x86 specific). It’s supported by at least gcc and clang. The clang version of the header is much better documented, so if you want to read up on how this works, read that one.

#include <cpuid.h>

static inline int
    unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
    __get_cpuid(1, &eax, &ebx, &ecx, &edx);
    return ecx & bit_AVX ? 1 : 0;

And in use:

if (use_avx && is_avx_supported())
    mandel_avx(image, &spec);
else if (use_sse2)
    mandel_sse2(image, &spec);
    mandel_basic(image, &spec);

I don’t know how to test for NEON, nor do I have the necessary hardware to test it, so on ARM assume it’s always available.


Using SIMD intrinsics for the Mandelbrot set was just an exercise to learn how to use them. Unlike in Handmade Hero, where it makes a 1080p 60FPS software renderer feasible, I don’t have an immediate, practical use for CPU SIMD, but, like so many similar techniques, I like having it ready in my toolbelt for the next time an opportunity arises.

tags: [ c ]

Minimal OpenGL 3.3 Core Profile Demo

When I was first attempting to learn OpenGL years ago, what I really wanted was a complete, minimal example program. OpenGL has enormous flexibility and I wanted to fully understand the fundamentals in isolation before moving on to more advanced features. I had been advised to specifically learn core profile, which drops nearly all the legacy parts of the API.

However, since much of the OpenGL-related content to be found online, even today, is outdated – and, worse, it’s not marked as such – good, modern core profile examples have been hard to come by. The relevant examples I could find at the time were more complicated than necessary, due to the common problem that full 3D graphics are too closely conflated with OpenGL. The examples would include matrix libraries, texture loading, etc. This is a big reason I ended up settling on WebGL: a clean slate in a completely different community. (The good news is that this situation has already improved dramatically over the last few years!)

Until recently, all of my OpenGL experience had been WebGL. Wanting to break out of that, earlier this year I set up a minimal OpenGL 3.3 core profile demo in C, using GLFW and gl3w. You can find it here:

No 3D graphics, no matrix library, no textures. It’s just a spinning red square.

It supports both Linux and Windows. The Windows’ build is static, so it compiles to a single, easily distributable, standalone binary. With some minor tweaking it would probably support the BSDs as well. For simplicity’s sake, the shaders are baked right into the source as strings, but if you’re extending the demo for your own use, you may want to move them out into their own source files.

Why OpenGL 3.3?

I chose OpenGL 3.3 in particular for three reasons:

As far as “desktop” OpenGL goes, 3.3 is currently the prime target.


Until EGL someday fills this role, the process for obtaining an OpenGL context is specific to each operating system, where it’s generally a pain in the butt. GLUT, the OpenGL Utility Toolkit, was a library to make this process uniform across the different platforms. It also normalized user input (keyboard and mouse) and provided some basic (and outdated) utility functions.

The original GLUT isn’t quite open source (licensing issues) and it’s no longer maintained. The open source replacement for GLUT is FreeGLUT. It’s what you’d typically find on a Linux system in place of the original GLUT.

I just need a portable library that creates a window, handles keyboard and mouse events in that window, and gives me an OpenGL 3.3 core profile context. FreeGLUT does this well, but we can do better. One problem is that it includes a whole bunch of legacy cruft from GLUT: immediate mode rendering utilities, menus, spaceball support, lots of global state, and only one OpenGL context per process.

One of the biggest problems is that FreeGLUT doesn’t have a swap interval function. This is used to lock the application’s redraw rate to the system’s screen refresh rate, preventing screen tearing and excessive resource consumption. I originally used FreeGLUT for the demo, and, as a workaround, had added my own macro work around this by finding the system’s swap interval function, but it was a total hack.

The demo was initially written with FreeGLUT, but I switched over to GLFW since it’s smaller, simpler, cleaner, and more modern. GLFW also has portable joystick handling. With the plethora of modern context+window creation libraries out there, it seems there’s not much reason to use FreeGLUT anymore.

SDL 2.0 would also be an excellent choice. It goes beyond GLFW with threading, audio, networking, image loading, and timers: basically all the stuff you’d need when writing a game.

I’m sure there are some other good alternatives, especially when you’re not sticking to plain C, but these are the libraries I’m familiar with at the time of this article.

Why gl3w?

If you didn’t think the interface between OpenGL and the operating system was messy enough, I have good news for you. Neither the operating system nor the video card drivers are going to provide any of the correct headers, nor will you have anything meaningful to link against! For these, you’re on your own.

The OpenGL Extension Wrangler Library (GLEW) was invented solve this problem. It dynamically loads the system’s OpenGL libraries and finds all the relevant functions at run time. That way your application avoids linking to anything too specific. At compile time, it provides the headers defining all of the OpenGL functions.

Over the years, GLEW has become outdated, to this day having no support for core profile. So instead I used a replacement called gl3w. It’s just like GLEW, but, as the name suggests, oriented around core profile … exactly what I needed. Unlike GLEW, it is generated directly from Kronos’ documentation by a script. In practice, you drop the generated code directly into your project (embedded) rather than rely on the system to provide it as a library.

A great (and probably better) alternative to gl3w is glLoadgen. It’s the same idea – an automatically generated OpenGL loader – but allows for full customization of the output, such as the inclusion of select OpenGL extensions.


While I hope it serves an educational resources for others, I primarily have it for my own record-keeping, pedagogical, and reference purposes, born out of a weekend’s worth of research. It’s a starting point for future projects, and it’s somewhere easy to start when I want to experiment with an idea.

Plus, someday I want to write a sweet, standalone game with fancy OpenGL graphics.

tags: [ opengl c ]

Raw Linux Threads via System Calls

This article has been translated to Japanese.

Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.

Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.

It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86_64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).

I’ve put the complete demo here if you want to see it all at once:

An x86_64 Primer

I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86_64 assembly, feel free to skip to the next section.

x86_64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.

The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86_64 they’re almost completely interchangeable.

There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.

The Stack

The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.

The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86_64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:

So, for example, to perform a function call like foo(1, 2, 3), store 1, 2 and 3 in rdi, rsi, and rdx, then call the function. The mov instruction stores the source (second) operand in its destination (first) operand. The call instruction pushes the current value of rip onto the stack, then sets rip (jumps) to the address of the target function. When the callee is ready to return, it uses the ret instruction to pop the original rip value off the stack and back into rip, returning control to the callee.

    mov rdi, 1
    mov rsi, 2
    mov rdx, 3
    call foo

Called functions must preserve the contents of these registers (the same value must be stored when the function returns):

System Calls

When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.

Each system call has an integer identifying it. This number is different on each platform, but, in Linux’s case, it will never change. Instead of call, rax is set to the number of the desired system call and the syscall instruction makes the request to the OS kernel. Prior to x86_64, this was done with an old-fashioned interrupt. Because interrupts are slow, a special, statically-positioned “vsyscall” page (now deprecated as a security hazard), later vDSO, is provided to allow certain system calls to be made as function calls. We’ll only need the syscall instruction in this article.

So, for example, the write() system call has this C prototype.

ssize_t write(int fd, const void *buf, size_t count);

On x86_64, the write() system call is at the top of the system call table as call 1 (read() is 0). Standard output is file descriptor 1 by default (standard input is 0). The following bit of code will write 10 bytes of data from the memory address buffer (a symbol defined elsewhere in the assembly program) to standard output. The number of bytes written, or -1 for error, will be returned in rax.

    mov rdi, 1        ; fd
    mov rsi, buffer
    mov rdx, 10       ; 10 bytes
    mov rax, 1        ; SYS_write

Effective Addresses

There’s one last thing you need to know: registers often hold a memory address (i.e. a pointer), and you need a way to read the data behind that address. In NASM syntax, wrap the register in brackets (e.g. [rax]), which, if you’re familiar with C, would be the same as dereferencing the pointer.

These bracket expressions, called an effective address, may be limited mathematical expressions to offset that base address entirely within a single instruction. This expression can include another register (index), a power-of-two scalar (bit shift), and an immediate signed offset. For example, [rax + rdx*8 + 12]. If rax is a pointer to a struct, and rdx is an array index to an element in array on that struct, only a single instruction is needed to read that element. NASM is smart enough to allow the assembly programmer to break this mold a little bit with more complex expressions, so long as it can reduce it to the [base + index*2^exp + offset] form.

The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.

Allocating a Stack

Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.

The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.

You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:

On x86_64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.

void *stack_create(void);

The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.

void *mmap(void *addr, size_t length, int prot, int flags);

For flags, we’ll choose a private, anonymous mapping that, being a stack, grows downward. Even with that last flag, the system call will still return the bottom address of the mapping, which will be important to remember later. It’s just a simple matter of setting the arguments in the registers and making the system call.

%define SYS_mmap    9
%define STACK_SIZE  (4096 * 1024)   ; 4 MB

    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov rax, SYS_mmap

Now we can allocate new stacks (or stack-sized buffers) as needed.

Spawning a Thread

Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.

long sys_clone(unsigned long flags, void *child_stack);

Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.

long thread_create(void (*)(void));

The function pointer argument is passed via rdi, per the ABI. Store this for safekeeping on the stack (push) in preparation for calling stack_create(). When it returns, the address of the low end of stack will be in rax.

    push rdi
    call stack_create
    lea rsi, [rax + STACK_SIZE - 8]
    pop qword [rsi]
    mov rax, SYS_clone

The second argument to clone() is a pointer to the high address of the stack (specifically, just above the stack). So we need to add STACK_SIZE to rax to get the high end. This is done with the lea instruction: load effective address. Despite the brackets, it doesn’t actually read memory at that address, but instead stores the address in the destination register (rsi). I’ve moved it back by 8 bytes because I’m going to place the thread function pointer at the “top” of the new stack in the next instruction. You’ll see why in a moment.

Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.

As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.

A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).

Now here’s the really cool part, and the reason branching isn’t needed. There’s no reason to check rax to determine if we are the original thread (in which case we return to the caller) or if we’re the new thread (in which case we jump to the thread function). Remember how we seeded the new stack with the thread function? When the new thread returns (ret), it will jump to the thread function with a completely empty stack. The original thread, using the original stack, will return to the caller.

The value returned by thread_create() is the process ID of the new thread, which is essentially the thread object (e.g. Pthread’s pthread_t).

Cleaning Up

The thread function has to be careful not to return (ret) since there’s nowhere to return. It will fall off the stack and terminate the program with a segmentation fault. Remember that threads are just processes? It must use the exit() syscall to terminate. This won’t terminate the other threads.

%define SYS_exit    60

    mov rax, SYS_exit

Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.

More Exploration

If you found this interesting, be sure to check out the full demo link at the top of this article. Now with the ability to spawn threads, it’s a great opportunity to explore and experiment with x86’s synchronization primitives, such as the lock instruction prefix, xadd, and compare-and-exchange (cmpxchg). I’ll discuss these in a future article.

tags: [ x86 linux c tutorial ]

NASM x86 Assembly Major Mode for Emacs

Last weekend I created a new Emacs mode, nasm-mode, for editing Netwide Assembler (NASM) x86 assembly programs. Over the past week I tweaked it until it felt comfortable enough to share on MELPA. It’s got what you’d expect from a standard Emacs programming language mode: syntax highlighting, automatic indentation, and imenu support. It’s not a full parser, but it knows all of NASM’s instructions and directives.

Until recently I didn’t really have preferences about x86 assemblers (GAS, NASM, YASM, FASM, MASM, etc.) or syntax (Intel, AT&T). I stuck to the GNU Assembler (GAS) since it’s already there with all the other GNU development tools I know and love, and it’s required for inline assembly in GCC. However, nasm-mode now marks my commitment to NASM as my primary x86 assembler.


I need an assembler that can assemble 16-bit code (8086, 8088, 80186, 80286), because real mode is fun. Despite its .code16gcc directive, GAS is not suitable for this purpose. It’s just enough to get the CPU into protected mode – as needed when writing an operating system with GCC – and that’s it. A different assembler is required for serious 16-bit programming.

GAS syntax has problems. I’m not talking about the argument order (source first or destination first), since there’s no right answer to that one. The linked article covers a number of problems, with these being the big ones for me:

Being a portable assembler, GAS is the jack of all instruction sets, master of none. If I’m going to write a lot of x86 assembly, I want a tool specialized for the job.


I also looked at YASM, a rewrite of NASM. It supports 16-bit assembly and mostly uses NASM syntax. In my research I found that NASM used to lag behind in features due to slower development, which is what spawned YASM. In recent years this seems to have flipped around, with YASM lagging behind. If you’re using YASM, nasm-mode should work pretty well for you, since it’s still very similar.

YASM optionally supports GAS syntax, but this reintroduces almost all of GAS’s problems. Even YASM’s improvements (i.e. its ORG directive) become broken when switching to GAS syntax.


FASM is the “flat assembler,” an assembler written in assembly language. This means it’s only available on x86 platforms. While I don’t really plan on developing x86 assembly on a Raspberry Pi, I’d rather not limit my options! I already regard 16-bit DOS programming as a form of embedded programming, and this may very well extend to the rest of x86 someday.

Also, it hasn’t made its way into the various Linux distribution package repositories, including Debian, so it’s already at a disadvantage for me.


This is Microsoft’s assembler that comes with Visual Studio. Windows only and not open source, this is in no way a serious consideration. But since NASM’s syntax was originally derived from MASM, it’s worth mentioning. NASM takes the good parts of MASM and fixes the mistakes (such as the offset operator). It’s different enough that nasm-mode would not work well with MASM.


It’s not perfect, but it’s got an excellent manual, it’s a solid program that does exactly what it says it will do, has a powerful macro system, great 16-bit support, highly portable, easy to build, and its semantics and syntax has been carefully considered. It also comes with a simple, pure binary disassembler (ndisasm). In retrospect it seems like an obvious choice!

My one complaint would be that it’s that it’s too flexible about labels. The colon on labels is optional, which can lead to subtle bugs. NASM will warn about this under some conditions (orphan-labels). Combined with the preprocessor, the difference between a macro and a label is ambiguous, short of re-implementing the entire preprocessor in Emacs Lisp.

Why nasm-mode?

Emacs comes with an asm-mode for editing assembly code for various architectures. Unfortunately it’s another jack-of-all-trades that’s not very good. More so, it doesn’t follow Emacs’ normal editing conventions, having unusual automatic indentation and self-insertion behaviors. It’s what prompted me to make nasm-mode.

To be fair, I don’t think it’s possible to write a major mode that covers many different instruction set architectures. Each architecture has its own quirks and oddities that essentially makes gives it a unique language. This is especially true with x86, which, from its 37 year tenure touched by so many different vendors, comes in a number of incompatible flavors. Each assembler/architecture pair needs its own major mode. I hope I just wrote NASM’s.

One area where I’m still stuck is that I can’t find an x86 style guide. It’s easy to find half a dozen style guides of varying authority for any programming language that’s more than 10 years old … except x86. There’s no obvious answer when it comes to automatic indentation. How are comments formatted and indented? How are instructions aligned? Should labels be on the same line as the instruction? Should labels require a colon? (I’ve decided this is “yes.”) What about long label names? How are function prototypes/signatures documented? (The mode could take advantage of such a standard, a la ElDoc.) It seems everyone uses their own style. This is another conundrum for a generic asm-mode.

There are a couple of other nasm-modes floating around with different levels of completeness. Mine should supersede these, and will be much easier to maintain into the future as NASM evolves.

tags: [ emacs x86 ]

A Basic Just-In-Time Compiler

Monday’s /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It’s given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it’s an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86_64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there’s less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;

To keep things simple here, I’m just assuming the page size is 4kB. In a real program, we’d use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86_64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It’s anonymous because it’s not backed by a file.

struct asmbuf *
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);

Windows doesn’t have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here’s the equivalent in Win32.

struct asmbuf *

Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

asmbuf_finalize(struct asmbuf *buf)
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);

Or on Win32 (that last parameter is not allowed to be NULL),

asmbuf_finalize(struct asmbuf *buf)
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);

Finally, instead of free() it gets unmapped.

asmbuf_free(struct asmbuf *buf)
    munmap(buf, PAGE_SIZE);

And on Win32,

asmbuf_free(struct asmbuf *buf)
    VirtualFree(buf, 0, MEM_RELEASE);

I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We’re only going to be concerned with three of x86_64’s many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we’ll be able to call from C like a normal function. Here’s what it’s prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that’s where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   %rdi, %rax

There’s a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.


So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86_64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.

Fortunately there’s a much easier way. We’ll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to as (GAS). It will produce a.out, which we’ll disassemble with objdump -d.

$ as peek.s
$ objdump -d a.out

a.out:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   c3                      retq

That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   %rdi, %rax
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // retq

For each operation, we’ll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator’s instruction if it’s small (32-bits or fewer), but I’m keeping it simple. To sneakily capture the “template” for this instruction I’m going to use 0x0123456789abcdef as the operand.

mov   $0x0123456789abcdef, %rdi

Which disassembled with objdump -d is,

0:  48 bf ef cd ab 89 67    movabs $0x123456789abcdef,%rdi
7:  45 23 01

Notice the operand listed little endian immediately after the instruction. That’s also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   operand, %rdi
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
case '+':
    asmbuf_ins(buf, 3, 0x4801f8);   // add   %rdi, %rax
case '-':
    asmbuf_ins(buf, 3, 0x4829f8);   // sub   %rdi, %rax
case '*':
    asmbuf_ins(buf, 4, 0x480fafc7); // imul  %rdi, %rax
case '/':
    asmbuf_ins(buf, 3, 0x4831d2);   // xor   %rdx, %rdx
    asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  %rdi

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we’re all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.

I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

tags: [ c tutorial netsec x86 ]

Goblin-COM 7DRL 2015

Yesterday I completed my third entry to the annual Seven Day Roguelike (7DRL) challenge (previously: 2013 and 2014). This year’s entry is called Goblin-COM.

As with previous years, the ideas behind the game are not all that original. The goal was to be a fantasy version of classic X-COM with an ANSI terminal interface. You are the ruler of a fledgling human nation that is under attack by invading goblins. You hire heroes, operate squads, construct buildings, and manage resource income.

The inspiration this year came from watching BattleBunny play OpenXCOM, an open source clone of the original X-COM. It had its major 1.0 release last year. Like the early days of OpenTTD, it currently depends on the original game assets. But also like OpenTTD, it surpasses the original game in every way, so there’s no reason to bother running the original anymore. I’ve also recently been watching One F Jef play Silent Storm, which is another turn-based squad game with a similar combat simulation.

As in X-COM, the game is broken into two modes of play: the geoscape (strategic) and the battlescape (tactical). Unfortunately I ran out of time and didn’t get to the battlescape part, though I’d like to add it in the future. What’s left is a sort-of city-builder with some squad management. You can hire heroes and send them out in squads to eliminate goblins, but rather than dropping to the battlescape, battles always auto-resolve in your favor. Despite this, the game still has a story, a win state, and a lose state. I won’t say what they are, so you have to play it for yourself!

Terminal Emulator Layer

My previous entries were HTML5 games, but this entry is a plain old standalone application. C has been my preferred language for the past few months, so that’s what I used. Both UTF-8-capable ANSI terminals and the Windows console are supported, so it should be perfectly playable on any modern machine. Note, though, that some of the poorer-quality terminal emulators that you’ll find in your Linux distribution’s repositories (rxvt and its derivatives) are not Unicode-capable, which means they won’t work with G-COM.

I didn’t make use of ncurses, instead opting to write my own terminal graphics engine. That’s because I wanted a single, small binary that was easy to build, and I didn’t want to mess around with PDCurses. I’ve also been studying the Win32 API lately, so writing my own terminal platform layer would rather easy to do anyway.

I experimented with a number of terminal emulators – LXTerminal, Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator – but the least capable “terminal” by far is the Windows console, so it was the one to dictate the capabilities of the graphics engine. Some ANSI terminals are capable of 256 colors, bold, underline, and strikethrough fonts, but a highly portable API is basically limited to 16 colors (RGBCMYKW with two levels of intensity) for each of the foreground and background, and no other special text properties.

ANSI terminals also have a concept of a default foreground color and a default background color. Most applications that output color (git, grep, ls) leave the background color alone and are careful to choose neutral foreground colors. G-COM always sets the background color, so that the game looks the same no matter what the default colors are. Also, the Windows console doesn’t really have default colors anyway, even if I wanted to use them.

I put in partial support for Unicode because I wanted to use interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported Unicode for a long time now, but since they added it too early, they’re locked into the outdated UTF-16. For me this wasn’t too bad, because few computers, Linux included, are equipped to render characters outside of the Basic Multilingual Plane anyway, so there’s no need to deal with surrogate pairs. This is especially true for the Windows console, which can only render a very small set of characters: another limit on my graphics engine. Internally individual codepoints are handled as uint16_t and strings are handled as UTF-8.

I said partial support because, in addition to the above, it has no support for combining characters, or any other situation where a codepoint takes up something other than one space in the terminal. This requires lookup tables and dealing with pitfalls, but since I get to control exactly which characters were going to be used I didn’t need any of that.

In spite of the limitations, I’m really happy with the graphical results. The waves are animated continuously, even while the game is paused, and it looks great. Here’s GNOME Terminal’s rendering, which I think looked the best by default.

I’ll talk about how G-COM actually communicates with the terminal in another article. The interface between the game and the graphics engine is really clean (device.h), so it would be an interesting project to write a back end that renders the game to a regular window, no terminal needed.

Color Directive

I came up with a format directive to help me colorize everything. It runs in addition to the standard printf directives. Here’s an example,

panel_printf(&panel, 1, 1, "Really save and quit? (Rk{y}/Rk{n})");

The color is specified by two characters, and the text it applies to is wrapped in curly brackets. There are eight colors to pick from: RGBCMYKW. That covers all the binary values for red, green, and blue. To specify an “intense” (bright) color, capitalize it. That means the Rk{...} above makes the wrapped text bright red.

Nested directives are also supported. (And, yes, that K means “high intense black,” a.k.a. dark gray. A w means “low intensity white,” a.k.a. light gray.)

panel_printf(p, x, y++, "Kk{♦}    wk{Rk{B}uild}     Kk{♦}");

And it mixes with the normal printf directives:

panel_printf(p, 1, y++, "(Rk{m}) Yk{Mine} [%s]", cost);

Single Binary

The GNU linker has a really nice feature for linking arbitrary binary data into your application. I used this to embed my assets into a single binary so that the user doesn’t need to worry about any sort of data directory or anything like that. Here’s what the make rule would look like:

$(LD) -r -b binary -o $@ $^

The -r specifies that output should be relocatable – i.e. it can be fed back into the linker later when linking the final binary. The -b binary says that the input is just an opaque binary file (“plain” text included). The linker will create three symbols for each input file:

When then you can access from your C program like so:

extern const char _binary_filename_txt_start[];

I used this to embed the story texts, and I’ve used it in the past to embed images and textures. If you were to link zlib, you could easily compress these assets, too. I’m surprised this sort of thing isn’t done more often!

Dumb Game Saves

To save time, and because it doesn’t really matter, saves are just memory dumps. I took another page from Handmade Hero and allocate everything in a single, contiguous block of memory. With one exception, there are no pointers, so the entire block is relocatable. When references are needed, it’s done via integers into the embedded arrays. This allows it to be cleanly reloaded in another process later. As a side effect, it also means there are no dynamic allocations (malloc()) while the game is running. Here’s roughly what it looks like.

typedef struct game {
    uint64_t map_seed;
    map_t *map;
    long time;
    float wood, gold, food;
    long population;
    float goblin_spawn_rate;
    invader_t invaders[16];
    squad_t squads[16];
    hero_t heroes[128];
    game_event_t events[16];
} game_t;

The map pointer is that one exception, but that’s because it’s generated fresh after loading from the map_seed. Saving and loading is trivial (error checking omitted) and very fast.

game_save(game_t *game, FILE *out)
    fwrite(game, sizeof(*game), 1, out);

game_t *
game_load(FILE *in)
    game_t *game = malloc(sizeof(*game));
    fread(game, sizeof(*game), 1, in);
    game->map = map_generate(game->map_seed);
    return game;

The data isn’t important enough to bother with rename+fsync durability. I’ll risk the data if it makes savescumming that much harder!

The downside to this technique is that saves are generally not portable across architectures (particularly where endianness differs), and may not even portable between different platforms on the same architecture. I only needed to persist a single game state on the same machine, so this wouldn’t be a problem.

Final Results

I’m definitely going to be reusing some of this code in future projects. The G-COM terminal graphics layer is nifty, and I already like it better than ncurses, whose API I’ve always thought was kind of ugly and old-fashioned. I like writing terminal applications.

Just like the last couple of years, the final game is a lot simpler than I had planned at the beginning of the week. Most things take longer to code than I initially expect. I’m still enjoying playing it, which is a really good sign. When I play, I’m having enough fun to deliberately delay the end of the game so that I can sprawl my nation out over the island and generate crazy income.

tags: [ game media ]

Generic C Reference Counting

As a result of making regular use of object-oriented programming in C, I’ve discovered a useful reference counting technique for the occasional dynamically allocated structs that need it. The situation arises when the same struct instance is shared between an arbitrary number of other data structures and I need to keep track of it all.

It’s incredibly simple and lives entirely in a header file, so without further ado (ref.h):

#pragma once

struct ref {
    void (*free)(const struct ref *);
    int count;

static inline void
ref_inc(const struct ref *ref)
    ((struct ref *)ref)->count++;

static inline void
ref_dec(const struct ref *ref)
    if (--((struct ref *)ref)->count == 0)

It has only two fields: the reference count and a “method” that knows how to free the object once the reference count hits 0. Structs using this reference counter will know how to free themselves, so callers will never call a specific *_destroy()/*_free() function. Instead they call ref_dec() to decrement the reference counter and let it happen on its own.

I decided to go with a signed count because it allows for better error checking. It may be worth putting an assert() in ref_inc() and ref_dec() to ensure the count is always non-negative. I chose an int because it’s fast, and anything smaller will be padded out to at least that size anyway. On x86_64, struct ref is 16 bytes.

This is basically all there is to a C++ shared_ptr, leveraging C++’s destructors and performing all increment/decrement work automatically.

Thread Safety

Those increments and decrements aren’t thread safe, so this won’t work as-is when data structures are shared between threads. If you’re sure that you’re using GCC on a capable platform, you can make use of its atomic builtins, making the reference counter completely thread safe.

static inline void
ref_inc(const struct ref *ref)
    __sync_add_and_fetch((int *)&ref->count, 1);

static inline void
ref_dec(const struct ref *ref)
    if (__sync_sub_and_fetch((int *)&ref->count, 1) == 0)

Or if you’re using C11, make use of the new stdatomic.h.

static inline void
ref_inc(const struct ref *ref)
    atomic_fetch_add((int *)&ref->count, 1);

static inline void
ref_dec(const struct ref *ref)
    if (atomic_fetch_sub((int *)&ref->count, 1) == 1)

What’s That Const?

There’s a very deliberate decision to make all of the function arguments const, for both reference counting functions and the free() method. This may seem wrong because these functions are specifically intended to modify the reference count. There are dangerous-looking casts in each case to remove the const.

The reason for this is that’s it’s likely for someone holding a const pointer to one of these objects to want to keep their own reference. Their promise not to modify the object doesn’t really apply to the reference count, which is merely embedded metadata. They would need to cast the const away before being permitted to call ref_inc() and ref_dec(). Rather than litter the program with dangerous casts, the casts are all kept in one place – in the reference counting functions – where they’re strictly limited to mutating the reference counting fields.

On a related note, the stdlib.h free() function doesn’t take a const pointer, so the free() method taking a const pointer is a slight departure from the norm. Taking a non-const pointer was a mistake in the C standard library. The free() function mutates the pointer itself – including all other pointers to that object – making it invalid. Semantically, it doesn’t mutate the memory behind the pointer, so it’s not actually violating the const. To compare, the Linux kernel kfree() takes a const void *.

Just as users may need to increment and decrement the counters on const objects, they’ll also need to be able to free() them, so it’s also a const.

Usage Example

So how does one use this generic reference counter? Embed a struct ref in your own structure and use our old friend: the container_of() macro. For anyone who’s forgotten, this macro not part of standard C, but you can define it with offsetof().

#define container_of(ptr, type, member) \
    ((type *)((char *)(ptr) - offsetof(type, member)))

Here’s a dumb linked list example where each node is individually reference counted. Adding an extra 16 bytes to each of your linked list nodes isn’t normally going to help with much, but if the tail of the linked list is being shared between different data structures (such as other lists), reference counting makes things a lot simpler.

struct node {
    char id[64];
    float value;
    struct node *next;
    struct ref refcount;

I put refcount at the end so that we’ll have to use container_of() in this example. It conveniently casts away the const for us.

static void
node_free(const struct ref *ref)
    struct node *node = container_of(ref, struct node, refcount);
    struct node *child = node->next;
    if (child)

Notice that it recursively decrements its child’s reference count afterwards (intentionally tail recursive). A whole list will clean itself up when the head is freed and no part of the list is shared.

The allocation function sets up the free() function pointer and initializes the count to 1.

struct node *
node_create(char *id, float value)
    struct node *node = malloc(sizeof(*node));
    snprintf(node->id, sizeof(node->id), "%s", id);
    node->value = value;
    node->next = NULL;
    node->refcount = (struct ref){node_free, 1};
    return node;

(Side note: I used snprintf() because strncpy() is broken and strlcpy() is non-standard, so it’s the most straightforward way to do this in standard C.);

And to start making some use of the reference counter, here’s push and pop.

node_push(struct node **nodes, char *id, float value)
    struct node *node = node_create(id, value);
    node->next = *nodes;
    *nodes = node;

struct node *
node_pop(struct node **nodes)
    struct node *node = *nodes;
    *nodes = (*nodes)->next;
    if (*nodes)
    return node;

Notice node_pop() increments the reference count of the new head node before returning. That’s because the node now has an additional reference: from *nodes and from the node that was just popped. It’s up to the caller to free the returned node, which would decrement the count of the new head node, but not free it. Alternatively node_pop() could set next on the returned node to NULL rather than increment the counter, which would also prevent the returned node from freeing the new head when it gets freed. But it’s probably more useful for the returned node to keep functioning as a list. That’s what the reference counting is for, after all.

Finally, a simple program to exercise it all. It reads ID/value pairs from standard input.

node_print(struct node *node)
    for (; node; node = node->next)
        printf("%s = %f\n", node->id, node->value);

int main(void)
    struct node *nodes = NULL;
    char id[64];
    float value;
    while (scanf(" %63s %f", id, &value) == 2)
        node_push(&nodes, id, value);
    if (nodes != NULL) {
        struct node *old = node_pop(&nodes);
        node_push(&nodes, "foobar", 0.0f);
    return 0;

I’ve used this technique several times over the past few months. It’s trivial to remember, so I just code it up from scratch each time I need it.

tags: [ c ]