null program

Quickly Access x86 Documentation in Emacs

I recently released an Emacs package called x86-lookup. Given a mnemonic, Emacs will open up a local copy of an Intel’s software developer manual PDF at the page documenting the instruction. It complements nasm-mode, released earlier this year.

x86-lookup is also available from MELPA.

To use it, you’ll need Poppler’s pdftotext command line program — used to build an index of the PDF — and a copy of the complete Volume 2 of Intel’s instruction set manual. There’s only one command to worry about: M-x x86-lookup.

Minimize documentation friction

This package should be familiar to anyone who’s used javadoc-lookup, one of my older packages. It has a common underlying itch: the context switch to read API documentation while coding should have as little friction as possible, otherwise I’m discouraged from doing it. In an ideal world I wouldn’t ever need to check documentation because it’s already in my head. By visiting documentation frequently with ease, it’s going to become familiar that much faster and I’ll be reaching for it less and less, approaching the ideal.

I picked up x86 assembly [about a year ago][x86] and for the first few months I struggled to find a good online reference for the instruction set. There are little scraps here and there, but not much of substance. The big exception is Félix Cloutier’s reference, which is an amazingly well-done HTML conversion of Intel’s PDF manuals. Unfortunately I could never get it working locally to generate my own. There’s also the X86 Opcode and Instruction Reference, but it’s more for machines than humans.

Besides, I often work without an Internet connection, so offline documentation is absolutely essential. (You hear that Microsoft? Not only do I avoid coding against Win32 because it’s badly designed, but even more so because you don’t offer offline documentation anymore! The friction to API reference your documentation is enormous.)

I avoided the official x86 documentation for awhile, thinking it would be too opaque, at least until I became more accustomed to the instruction set. But really, it’s not bad! With a handle on the basics, I would encourage anyone to dive into either Intel’s or AMD’s manuals. The reason there’s not much online in HTML form is because these manuals are nearly everything you need.

I chose Intel’s manuals for x86-lookup because I’m more familiar with it, it’s more popular, it’s (slightly) easier to parse, it’s offered as a single PDF, and it’s more complete. The regular expression for finding instructions is tuned for Intel’s manual and it won’t work with AMD’s manuals.

For a couple months prior to writing x86-lookup, I had a couple of scratch functions to very roughly accomplish the same thing. The tipping point for formalizing it was that last month I wrote my own x86 assembler. A single mnemonic often has a dozen or more different opcodes depending on the instruction’s operands, and there are often several ways to encode the same operation. I was frequently looking up opcodes, and navigating the PDF quickly became a real chore. I only needed about 80 different opcodes, so I was just adding them to the assembler’s internal table manually as needed.

How does it work?

Say you want to look up the instruction RDRAND.

Initially Emacs has no idea what page this is on, so the first step is to build an index mapping mnemonics to pages. x86-lookup runs the pdftotext command line program on the PDF and loads the result into a temporary buffer.

The killer feature of pdftotext is that it emits FORM FEED (U+0012) characters between pages. Think of these as page breaks. By counting form feed characters, x86-lookup can track the page for any part of the document. In fact, Emacs is already set up to do this with its forward-page and backward-page commands. So to build the index, x86-lookup steps forward page-by-page looking for mnemonics, keeping note of the page. Since this process typically takes about 10 seconds, the index is cached in a file (see x86-lookup-cache-directory) for future use. It only needs to happen once for a particular manual on a particular computer.

The mnemonic listing is slightly incomplete, so x86-lookup expands certain mnemonics into the familiar set. For example, all the conditional jumps are listed under “Jcc,” but this is probably not what you’d expect to look up. I compared x86-lookup’s mnemonic listing against NASM/nasm-mode’s mnemonics to ensure everything was accounted for. Both packages benefited from this process.

Once the index is built, pdftotext is no longer needed. If you’re desperate and don’t have this program available, you can borrow the index file from another computer. But you’re on your own for figuring that out!

So to look up RDRAND, x86-lookup checks the index for the page number and invokes a PDF reader on that page. This is where not all PDF readers are created equal. There’s no convention for opening a PDF to a particular page and each PDF reader differs. Some don’t even support it. To deal with this, x86-lookup has a function specialized for different PDF readers. Similar to browse-url-browser-function, x86-lookup has x86-lookup-browse-pdf-function.

By default it tries to open the PDF for viewing within Emacs (did you know Emacs is a PDF viewer?), falling back to on options if the feature is unavailable. I welcome pull requests for any PDF readers not yet supported by x86-lookup. Perhaps this functionality deserves its own package.

That’s it! It’s a simple feature that has already saved me a lot of time. If you’re ever programming in x86 assembly, give x86-lookup a spin.

tags: [ x86 emacs ]

RSA Signatures in Emacs Lisp

Emacs comes with a wonderful arbitrary-precision computer algebra system called calc. I’ve discussed it previously and continue to use it on a daily basis. That’s right, people, Emacs can do calculus. Like everything Emacs, it’s programmable and extensible from Emacs Lisp. In this article, I’m going to implement the RSA public-key cryptosystem in Emacs Lisp using calc.

If you want to dive right in first, here’s the repository:

This is only a toy implementation and not really intended for serious cryptographic work. It’s also far too slow when using keys of reasonable length.

Evaluation with calc

The calc package is particularly useful when considering Emacs’ limited integer type. Emacs uses a tagged integer scheme where integers are embedded within pointers. It’s a lot faster than the alternative (individually-allocated integer objects), but it means they’re always a few bits short of the platform’s native integer type.

calc has a large API, but the user-friendly porcelain for it is the under-documented calc-eval function. It evaluates an expression string with format-like argument substitutions ($n).

(calc-eval "2^16 - 1")
;; => "65535"

(calc-eval "2^$1 - 1" nil 128)
;; => "340282366920938463463374607431768211455"

Notice it returns strings, which is one of the ways calc represents arbitrary precision numbers. For arguments, it accepts regular Elisp numbers and strings just like this function returns. The implicit radix is 10. To explicitly set the radix, prefix the number with the radix and #. This is the same as in the user interface of calc. For example:

(calc-eval "16#deadbeef")
;; => "3735928559"

The second argument (optional) to calc-eval adjusts its behavior. Given nil, it simply evaluates the string and returns the result. The manual documents the different options, but the only other relevant option for RSA is the symbol pred, which asks it to return a boolean “predicate” result.

(calc-eval "$1 < $2" 'pred "4000" "5000")
;; => t

Generating primes

RSA is founded on the difficulty of factoring large composites with large factors. Generating an RSA keypair starts with generating two prime numbers, p and q, and using these primes to compute two mathematically related composite numbers.

calc has a function calc-next-prime for finding the next prime number following any arbitrary number. It uses a probabilistic primarily test — the Fermat Miller-Rabin primality test – to efficiently test large integers. It increments the input until it finds a result that passes enough iterations of the primality test.

(calc-eval "nextprime($1)" nil "100000000000000000")
;; => "100000000000000003"

So to generate a random n-bit prime, first generate a random n-bit number and then increment it until a prime number is found.

;; Generate a 128-bit prime, 10 iterations (0.000084% error rate)
(calc-eval "nextprime(random(2^$1), 10)" nil 128)

Unfortunately calc’s random function is based on Emacs’ random function, which is entirely unsuitable for cryptography. In the real implementation I read n bits from /dev/urandom to generate an n-bit number.

  (set-buffer-multibyte nil)
  (call-process "head" "/dev/urandom" t nil "-c" (format "%d" (/ bits 8)))
  (let ((f (apply-partially #'format "%02x")))
    (concat "16#" (mapconcat f (buffer-string) ""))))

(Note: /dev/urandom is the right choice. There’s no reason to use /dev/random for generating keys.)

Computing e and d

From here the code just follows along from the Wikipedia article. After generating the primes p and q, two composites are computed, n = p * q and i = (p - 1) * (q - 1). Lacking any reason to do otherwise, I chose 65,537 for the public exponent e.

The function rsa--inverse is just a straight Emacs Lisp + calc implementation of the extended Euclidean algorithm from the Wikipedia article pseudocode, computing d ≡ e^-1 (mod i). It’s not much use sharing it here, so take a look at the repository if you’re curious.

(defun rsa-generate-keypair (bits)
  "Generate a fresh RSA keypair plist of BITS length."
  (let* ((p (rsa-generate-prime (+ 1 (/ bits 2))))
         (q (rsa-generate-prime (+ 1 (/ bits 2))))
         (n (calc-eval "$1 * $2" nil p q))
         (i (calc-eval "($1 - 1) * ($2 - 1)" nil p q))
         (e (calc-eval "2^16+1"))
         (d (rsa--inverse e i)))
    `(:public  (:n ,n :e ,e) :private (:n ,n :d ,d))))

The public key is n and e and the private key is n and d. From here we can compute and verify cryptographic signatures.


To compute signature s of an integer m (where m < n), compute s ≡ m^d (mod n). I chose the right-to-left binary method, again straight from the Wikipedia pseudocode (lazy!). I’ll share this one since it’s short. The backslash denotes integer division.

(defun rsa--mod-pow (base exponent modulus)
  (let ((result 1))
    (setf base (calc-eval "$1 % $2" nil base modulus))
    (while (calc-eval "$1 > 0" 'pred exponent)
      (when (calc-eval "$1 % 2 == 1" 'pred exponent)
        (setf result (calc-eval "($1 * $2) % $3" nil result base modulus)))
      (setf exponent (calc-eval "$1 \\ 2" nil exponent)
            base (calc-eval "($1 * $1) % $2" nil base modulus)))

Verifying the signature is the same process, but with the public key’s e: m ≡ s^e (mod n). If the signature is valid, m will be recovered. In theory, only someone who knows d can feasibly compute s from m. If n is small enough to factor, revealing p and q, then d can be feasibly recomputed from the public key. So mind your Ps and Qs.

So that leaves one problem: generally users want to sign strings and files and such, not integers. A hash function is used to reduce an arbitrary quantity of data into an integer suitable for signing. Emacs comes with a bunch of them, accessible through secure-hash. It hashes strings and buffers.

(secure-hash 'sha224 "Hello, world!")
;; => "8552d8b7a7dc5476cb9e25dee69a8091290764b7f2a64fe6e78e9568"

Since the result is hexadecimal, just prefix 16# to turn it into a calc integer.

Here’s the signature and verification functions. Any string or buffer can be signed.

(defun rsa-sign (private-key object)
  (let ((n (plist-get private-key :n))
        (d (plist-get private-key :d))
        (hash (concat "16#" (secure-hash 'sha384 object))))
    ;; truncate hash such that hash < n
    (while (calc-eval "$1 > $2" 'pred hash n)
      (setf hash (calc-eval "$1 \\ 2" nil hash)))
    (rsa--mod-pow hash d n)))

(defun rsa-verify (public-key object sig)
  (let ((n (plist-get public-key :n))
        (e (plist-get public-key :e))
        (hash (concat "16#" (secure-hash 'sha384 object))))
    ;; truncate hash such that hash < n
    (while (calc-eval "$1 > $2" 'pred hash n)
      (setf hash (calc-eval "$1 \\ 2" nil hash)))
    (let* ((result (rsa--mod-pow sig e n)))
      (calc-eval "$1 == $2" 'pred result hash))))

Note the hash truncation step. If this is actually necessary, then your n is very easy to factor! It’s in there since this is just a toy and I want it to work with small keys.

Putting it all together

Here’s the whole thing in action with an extremely small, 128-bit key.

(setf message "hello, world!")

(setf keypair (rsa-generate-keypair 128))
;; => (:public  (:n "74924929503799951536367992905751084593"
;;               :e "65537")
;;     :private (:n "74924929503799951536367992905751084593"
;;               :d "36491277062297490768595348639394259869"))

(setf sig (rsa-sign (plist-get keypair :private) message))
;; => "31982247477262471348259501761458827454"

(rsa-verify (plist-get keypair :public) message sig)
;; => t

(rsa-verify (plist-get keypair :public) (capitalize message) sig)
;; => nil

Each of these operations took less than a second. For larger, secure-length keys, this implementation is painfully slow. For example, generating a 2048-bit key takes my laptop about half an hour, and computing a signature with that key (any size message) takes about a minute. That’s probably a little too slow for, say, signing ELPA packages.

tags: [ emacs elisp lisp ]

Counting Processor Cores in Emacs

One of the great advantages of dependency analysis is parallelization. Modern processors reorder instructions whose results don’t affect each other. Compilers reorder expressions and statements to improve throughput. Build systems know which outputs are inputs for other targets and can choose any arbitrary build order within that constraint. This article involves the last case.

The build system I use most often is GNU Make, either directly or indirectly (Autoconf, CMake). It’s far from perfect, but it does what I need. I almost always invoke it from within Emacs rather than in a terminal. In fact, I do it so often that I’ve wrapped Emacs’ compile command for rapid invocation.

I recently helped a co-worker set this set up for himself, so it had me thinking about the problem again. The situation in my config is much more complicated than it needs to be, so I’ll share a simplified version instead.

First bring in the usual goodies (we’re going to be making closures):

;;; -*- lexical-binding: t; -*-
(require 'cl-lib)

We need a couple of configuration variables.

(defvar quick-compile-command "make -k ")
(defvar quick-compile-build-file "Makefile")

Then a couple of interactive functions to set these on the fly. It’s not strictly necessary, but I like giving each a key binding. I also like having a history available via read-string, so I can switch between a couple of different options with ease.

(defun quick-compile-set-command (command)
   (list (read-string "Command: " quick-compile-command)))
  (setf quick-compile-command command))

(defun quick-compile-set-build-file (build-file)
   (list (read-string "Build file: " quick-compile-build-file)))
  (setf quick-compile-build-file build-file))

Now finally to the good part. Below, quick-compile is a non-interactive function that returns an interactive closure ready to be bound to any key I desire. It takes an optional target. This means I don’t use the above quick-compile-set-command to choose a target, only for setting other options. That will make more sense in a moment.

(cl-defun quick-compile (&optional (target ""))
  "Return an interaction function that runs `compile' for TARGET."
  (lambda ()
    (save-buffer)  ; so I don't get asked
    (let ((default-directory
             default-directory quick-compile-build-file)))
      (if default-directory
          (compile (concat quick-compile-command " " target))
        (error "Cannot find %s" quick-compile-build-file)))))

It traverses up (down?) the directory hierarchy towards root looking for a Makefile — or whatever is set for quick-compile-build-file — then invokes the build system there. I don’t believe in recursive make.

So how do I put this to use? I clobber some key bindings I don’t otherwise care about. A better choice might be the F-keys, but my muscle memory is already committed elsewhere.

(global-set-key (kbd "C-x c") (quick-compile)) ; default target
(global-set-key (kbd "C-x C") (quick-compile "clean"))
(global-set-key (kbd "C-x t") (quick-compile "test"))
(global-set-key (kbd "C-x r") (quick-compile "run"))

Each of those invokes a different target without second guessing me. Let me tell you, having “clean” at the tip of my fingers is wonderful.

Parallel Builds

An extension common to many different make programs is -j, which asks make to build targets in parallel where possible. These days where multi-core machines are the norm, you nearly always want to use this option, ideally set to the number of logical processor cores on your system. It’s a huge time-saver.

My recent revelation was that my default build command could be better: make -k is minimal. It should at least include -j, but choosing an argument (number of processor cores) is a problem. Today I use different machines with 2, 4, or 8 cores, so most of the time any given number will be wrong. I could use a per-system configuration, but I’d rather not. Unfortunately GNU Make will not automatically detect the number of cores. That leaves the matter up to Emacs Lisp.

Emacs doesn’t currently have a built-in function that returns the number of processor cores. I’ll need to reach into the operating system to figure it out. My usual development environments are Linux, Windows, and OpenBSD, so my solution should work on each. I’ve ranked them by order of importance.

Number of cores on Linux

Linux has the /proc virtual filesystem in the fashion of Plan 9, allowing different aspects of the system to be explored through the standard filesystem API. The relevant file here is /proc/cpuinfo, listing useful information about each of the system’s processors. To get the number of processors, count the number of processor entries in this file. I’ve wrapped it in if-file-exists so that it returns nil on other operating systems instead of throwing an error.

(when (file-exists-p "/proc/cpuinfo")
    (insert-file-contents "/proc/cpuinfo")
    (how-many "^processor[[:space:]]+:")))

Number of cores on Windows

When I was first researching how to do this on Windows, I thought I would need to invoke the wmic command line program and hope the output could be parsed the same way on different versions of the operating system and tool. However, it turns out the solution for Windows is trivial. The environment variable NUMBER_OF_PROCESSORS gives every process the answer for free. Being an environment variable, it will need to be parsed.

(let ((number-of-processors (getenv "NUMBER_OF_PROCESSORS")))
  (when number-of-processors
    (string-to-number number-of-processors)))

Number of cores on BSD

This seems to work the same across all the BSDs, including OS X, though I haven’t yet tested it exhaustively. Invoke sysctl, which returns an undecorated number to be parsed.

    (when (zerop (call-process "sysctl" nil t nil "-n" "hw.ncpu"))
      (string-to-number (buffer-string)))))

Also not complicated, but it’s the heaviest solution of the three.

Putting it all together

Join all these together with or, call it numcores, and ta-da.

(setf quick-compile-command (format "make -kj%d" (numcores)))

Now make is invoked correctly on any system by default.

tags: [ emacs c cpp ]

Web Tips For Webcomic Authors

My wife and I are huge webcomic fans. The web is the medium that the comic strip industry needed badly for decades, and, with Patreon and such today, we’re now living in a golden age of comics. As of this writing, I currently follow … let’s see … 39 different web comics.

(cl-count-if (lambda (x) (memq 'comic x)) elfeed-feeds)
;; => 39

My first exposure to comics was in my childhood when I got my hands on Bill Watterson’s Something Under the Bed Is Drooling (Calvin and Hobbes). This gave me very high expectations of the Sunday comics section of the newspaper when I’d read it at my grandmother’s house. Those hopes were shattered as I discovered just how awful nationally syndicated comic strips are: mostly watered down, lowest common denominator stuff like Garfield, Family Circus, Cathy, B.C., etc.

During Calvin and Hobbes’s original run, Bill Watterson wrote about his struggles with the newspapers and the Universal Press Syndicate, one of the organizations responsible for this mess. Newspapers and the Syndicate pushed for smaller frames and shorter comics. Authors were required to plan around newspapers removing frames for layout purposes. Many newspapers would drop comics that need meet stringent content limitations — a line that even Calvin and Hobbes crossed on occasion. Authors had little control over how their work was published.

Those days are over. Today’s authors can cheaply host their comics on the web — webcomics — with full control over content, layout, and schedule. If they even try to monetize at all, it’s generally through advertising, merchandising, or reader donations. Some do it all in their free time, while for others it’s part or even full time employment. The number of regular readers of a single webcomic can be just a handful of people, or up to millions of people. The role of the middleman is somewhere between diminished to non-existent. This is great, because newspapers would never publish the vast majority of the comics I read every day.

I’ve been fortunate to meet a couple of my favorite webcomic authors. Here’s a picture of my wife posing with Anthony Clark of Nedroid Picture Diary at the Small Press Expo.

I’ve also met Philippa Rice of My Cardboard Life. (Sorry, no picture for this one, since taking pictures with people isn’t really my thing.)

Over the years I’ve seen webcomic authors blunder with the web as a technology. In my experience it’s been disproportionate, with mistakes made more often by them than the bloggers I follow. I suspect that this is because blogs I follow tend to be computing related and so their authors have high proficiency in computing. The same is not necessarily true of the webcomics I follow.

Tips for web authors

Since I want to see this medium continue to thrive, and to do so in a way friendly to my own preferences, I’d like to share some tips to avoid common mistakes. Some of these apply more broadly than webcomics.

If you’re using a host designed for webcomics or similar, such as Tumblr, a lot of this stuff will be correct by default without any additional work on your part. However, you should still be aware of common problems because you may unwittingly go out of your way to break things.

URLs are forever

Every time you publish on the web, your content is accessible through some specific URL: that sequence of characters that starts with “http”. Each individual comic should be accessible through a unique, unchanging URL. That last adjective is critically important. That URL should point to the same comic for as long as possible — ideally until the heat death of the universe. This will be affected by problems such as your host going down, but the impact should only be temporary and short. A URL is a promise.

People will be using this URL to share your comics with others. They’ll make posts on other websites linking to your comic. They’ll e-mail that URLs to friends and family. Once you’ve published, you no longer control how that URL is used.

On several occasions I’ve seen authors break all their URLs after revamping their site. For example, the previously the URL contained the date but the new URL is only the domain and the title. That breaks thousands of links all over the Internet. Visitors using those old links will be welcomed with an ugly “404 Not Found” — or worse, as I’ve seen more than once, a “200 Found” blank page. These are missed opportunities for new readers.

If you really must change your URLs, the next best thing is to use an HTTP “301 Moved Permanently” and redirect to the new URL. This will leave all those old links intact and encourage new links to use the new address. If you don’t know how this works, ask your local computer geek about it.

You should also avoid having multiple URLs for the same content without a redirect. Search engines will punish you for it and it’s confusing for users. Pick one URL as the canonical URL for a comic, and if you’ve published any other URLs (short URLs, etc.), use the previously mentioned “301 Moved Permanently” to redirect to the canonical URL.

Your main page probably lists all your comics starting from the most recent. This is a good design and doesn’t violate anything I previously said. That’s not the URL for any particular comic, but to the main page, which also serves as the list of recent comics. I strongly recommend that the comics on the main page are also hyperlinks to their specific URL. Users naturally expect to find the comic’s URL by clicking on the comic’s image.

Have an Atom or RSS feed

Comics without feeds is much less of a problem than it used to be, but it still comes up on occasion. If you need to pick between Atom and RSS, I recommend Atom, but, honestly, it’s only important that you have a valid feed with a date. You don’t even need to put the comic in the feed itself (possibly costing you ad revenue), just a link to the comic’s URL is fine. It’s main purpose is to say, “hey, there’s a new comic up!”

You may not use Atom/RSS yourself, but your readers will appreciate it. Many of us don’t use centralized services like Facebook, Twitter, or Google+, and want to follow your work without signing up for a third-party service. Atom/RSS is the widely-accepted decentralized method for syndication on the web.

Web feeds are really easy; it’s just an XML file on your website that lists the most recent content. A validator can help you ensure you’ve done it correctly.

Pick a good, catchy title

One of the biggest barriers to sharing a comic is a lack of title. For example, if a reader is going to post your comic on reddit, they need to enter the comic’s URL and its title. If the comic doesn’t have a title, then this person will need to make one up. There’s two problems with this:

At minimum your title should appear in the <title> element of the page so that it shows up in the browser tab and browser’s window title. The title of the individual comic should come before the title of the whole website, since that shows up better in search engines. The title should also appear somewhere near the top of page for easy clipboard copying, though it may be worth leaving out depending on the style of your comic.

A page without a <title> element looks amateur, so don’t do that!

Think of the future and include dates

This is one of those things that’s important anywhere on the web and is often violated by blog articles as well. Far too much content is published without a date. Dates put your comic in context, especially if it’s about something topical. It also helps users navigate your content though time.

Putting the date in the URL is sufficient — even preferred — if you didn’t want to display it on the page proper. Your Atom/RSS should always have the comic’s date. I personally benefit from a date-time precision down to the publication hour. Some comics/articles are always published as “midnight” even when posted in the afternoon, which has the jarring effect of inserting it in time before a bunch of things I’ve already read.

How do I contact you?

When I notice one of the previous problems, particularly when they arise in comics I’m already following, I’d like to inform you of the problem. Or perhaps I want to compliment you on a well-made comic and you don’t have a comments section. I can only do this if you include some sort of contact information. An e-mail address, even in an anti-spam image form, is preferable but not strictly required.

Take advantage of the medium and go big

Comics published in newspapers are really tiny because newspaper editors want to cram a bunch of them onto a couple of pages. You’re not operating under these limitations, so fight the urge to copy that familiar format. Your canvas is practically infinite, so make big, colorful webcomics. The only limit is your readers’ screen resolution.

A final thanks

Thanks for all the work you do, webcomic authors. You regularly create all this awesome stuff for free. If you’re a webcomic author and you need help with any of the information above, don’t hesitate to contact me. After all, I don’t hesitate to bug you when something’s not right!

tags: [ web ]

Recovering Live Data with GDB

I recently ran into a problem where long-running program output was trapped in a C FILE buffer. The program had been running for two days straight printing its results, but the last few kilobytes of output were missing. It wouldn’t output these last bytes until the program completed its day-long (or worse!) cleanup operation and exited. This is easy to fix — and, honestly, the cleanup step was unnecessary anyway — but I didn’t want to start all over and wait two more days to recompute the result.

Here’s a minimal example of the situation. The first loop represents the long-running computation and the infinite loop represents a cleanup job that will never complete.

#include <stdio.h>

    /* Compute output. */
    for (int i = 0; i < 10; i++)
        printf("%d/%d ", i, i * i);

    /* "Slow" cleanup operation ... */
    for (;;)
    return 0;

Buffered Output Review

Both printf and putchar are C library functions and are usually buffered in some way. That is, each call to these functions doesn’t necessarily send data out of the program. This is in contrast to the POSIX functions read and write, which are unbuffered system calls. Since system calls are relatively expensive, buffered input and output is used to change a large number of system calls on small buffers into a single system call on a single large buffer.

Typically, stdout is line-buffered if connected to a terminal. When the program completes a line of output, the user probably wants to see it immediately. So, if you compile the example program and run it at your terminal you will probably see the output before the program hangs on the infinite loop.

$ cc -std=c99 example.c
$ ./a.out
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81

However, when stdout is connected to a file or pipe, it’s generally buffered to something like 4kB. For this program, the output will remain empty no matter how long you wait. It’s trapped in a FILE buffer in process memory.

$ ./a.out > output.txt

The primary way to fix this is to use the fflush function, to force the buffer empty before starting a long, non-output operation. Unfortunately for me I didn’t think of this two days earlier.

Debugger to the Rescue

Fortunately there is a way to interrupt a running program and manipulate its state: a debugger. First, find the process ID of the running program (the one writing to output.txt above).

$ pgrep a.out

Now attach GDB, which will pause the program’s execution.

$ gdb ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
gdb> attach 12934
Attaching to program: /tmp/a.out, process 12934
... snip ...
0x0000000000400598 in main ()

From here I could examine the stdout FILE struct and try to extract the buffer contents by hand. However, the easiest thing is to do is perform the call I forgot in the first place: fflush(stdout).

gdb> call fflush(stdout)
$1 = 0
gdb> quit
Detaching from program: /tmp/a.out, process 12934

The program is still running, but the output has been recovered.

$ cat output.txt
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81

Why Cleanup?

As I said, in my case the cleanup operation was entirely unnecessary, so it would be safe to just kill the program at this point. It was taking a really long time to tear down a humongous data structure (on the order of 50GB) one little node at a time with free. Obviously, the memory would be freed much more quickly by the OS when the program exited.

Freeing memory in the program was only to satisfy Valgrind, since it’s so incredibly useful for debugging. Not freeing the data structure would hide actual memory leaks in Valgrind’s final report. For the real “production” run, I should have disabled cleanup.

tags: [ c cpp ]

Shamus Young's Twenty-Sided Tale E-book

Last month I assembled and edited Shamus Young’s Twenty-Sided Tale, originally a series of 84 blog articles, into an e-book. The book is 75,000 words — about the average length of a novel — recording the complete story of one of Shamus’ Dungeons and Dragons campaigns. Since he’s shared the e-book on his blog, I’m now free to pull back the curtain on this little project.

To build the book yourself, you will only need make and pandoc.

Why did I want this?

Ever since I got a tablet a couple years ago, I’ve completely switched over to e-books. Prior to the tablet, if there was an e-book I wanted to read, I’d have to read from a computer monitor while sitting at a desk. Anyone who’s tried it can tell you it’s not a comfortable way to read for long periods, so I only reserved the effort for e-book-only books that were really worth it. However, once comfortable with the tablet, I gave away nearly all my paper books from my bookshelves at home. The remaining use of paper books is because either an e-book version isn’t reasonably available or the book is very graphical, not suited to read/view on a screen (full image astronomy books, Calvin and Hobbes collections).

As far as formats go, I prefer PDF and ePub, depending on the contents of the book. Technical books fare better as PDFs due to elaborate typesetting used for diagrams and code samples. For prose-oriented content, particularly fiction, ePub is the better format due to its flexibility and looseness. Twenty-Sided Tale falls in this latter category. The reader gets to decide the font, size, color, contrast, and word wrapping. I kept the ePub’s CSS to a bare minimum as to not get in the reader’s way. Unfortunately I’ve found that most ePub readers are awful at rendering content, so while technically you could do the same fancy typesetting with ePub, it rarely works out well.

The Process

To start, I spent about 8 hours with Emacs manually converting each article into Markdown and concatenating them into a single document. The ePub is generated from the Markdown using the Pandoc “universal document converter.” The markup includes some HTML, because Markdown alone, even Pandoc’s flavor, isn’t expressive enough for the typesetting needs of this particular book. This means it can only reasonably be transformed into HTML-based formats.

Pandoc isn’t good enough for some kinds of publishing, but it was sufficient here. The one feature I really wished it had was support for tagging arbitrary document elements with CSS classes (images, paragraphs, blockquotes, etc.), effectively extending Markdown’s syntax. Currently only headings support extra attributes. Such a feature would have allowed me to bypass all use of HTML, and the classes could maybe have been re-used in other output formats, like LaTeX.

Once I got the book in a comfortable format, I spent another 1.5 weeks combing through the book fixing up punctuation, spelling, grammar, and, in some cases, wording. It was my first time editing a book — fiction in particular — and in many cases I wasn’t sure of the correct way to punctuate and capitalize some particular expression. Is “Foreman” capitalized when talking about a particular foreman? What about “Queen?” How are quoted questions punctuated when the sentence continues beyond the quotes? As an official source on the matter, I consulted the Chicago Manual of Style. The first edition is free online. It’s from 1906, but style really hasn’t changed too much over the past century!

The original articles were written over a period of three years. Understandably, Shamus forgot how some of the story’s proper names were spelled over this time period. There wasn’t a wiki to check. Some proper names had two, three, or even four different spellings. Sometimes I picked the most common usage, sometimes the first usage, and sometimes I had to read the article’s comments written by the game’s players to see how they spelled their own proper names.

I also sunk time into a stylesheet for a straight HTML version of the book, with the images embedded within the HTML document itself. This will be one of the two outputs if you build the book in the repository.

A Process to Improve

Now I’ve got a tidy, standalone e-book version of one of my favorite online stories. When I want to re-read it again in the future, it will be as comfortable as reading any other novel.

This has been a wonderful research project into a new domain (for me): writing and editing, style, and today’s tooling for writing and editing. As a software developer, the latter overlaps my expertise and is particularly fascinating. A note to entrepreneurs: There’s massive room for improvement in this area. Compared software development, the processes in place today for professional writing and editing is, by my estimates, about 20 years behind. It’s a place where Microsoft Word is still the industry standard. Few authors and editors are using source control or leveraging the powerful tools available for creating and manipulating their writing.

Unfortunately it’s not so much a technical problem as it is a social/educational one. The tools mostly exist in one form or another, but they’re not being put to use. Even if an author or editor learns or builds a more powerful set of tools, they must still interoperate with people who do not. Looking at it optimistically, this is a potential door into the industry for myself: a computer whiz editor who doesn’t require Word-formatted manuscripts; who can make the computer reliably and quickly perform the tedious work. Or maybe that idea only works in fiction.

tags: [ media rant ]

Mandelbrot Set with SIMD Intrinsics

When I started this blog 8 years ago, my first post was about the Mandelbrot set. Since then, both technology and my own skills have improved (or so I like to believe!), so I’m going to take another look at it, this time using three different Single Instruction, Multiple Data (SIMD) instruction sets: SSE2, AVX, and NEON. The latter two didn’t exist when the last article was published. In this article I demonstrate SIMD bringing a 5.8x speedup to a fractal renderer.

If you want to take a look at my code before reading further:

Having multiple CPU cores allows different instructions to operation on (usually) different data independently. In contrast, under SIMD a specific operation (single instruction) acts upon several values (multiple data) at once. It’s another form of parallelization. For example, with image processing – perhaps the most common use case – this means multiple pixels could be computed within the same number of cycles it would normally take to compute just one. SIMD is generally implemented on CPUs through wide registers: 64, 128, 256, and even 512 bits wide. Values are packed into the register like an array and are operated on independently, generally with saturation arithmetic (clamped, non-wrapping).

Rather than hand-code all this in assembly, I’m using yet another technique I picked up from the always-educational Handmade Hero: compiler intrinsics. The code is all C, but in place of C’s operators are pseudo-function calls operating on special SIMD types. These aren’t actual function calls, they’re intrinsics. The compiler will emit a specific assembly instruction for each intrinsic, sort of like an inline function. This is more flexible for mixing with other C code, the compiler will manage all the registers, and the compiler will attempt to re-order and interleave instructions to maximize throughput. It’s a big win!

Some SIMD History

The first widely consumer available SIMD hardware was probably the MMX instruction set, introduced to 32-bit x86 in 1997. This provided 8 64-bit mm0 - mm7, registers aliasing the older x87 floating pointer registers, which operated on packed integer values. This was extended by AMD with its 3DNow! instruction set, adding floating point instructions.

However, you don’t need to worry about any of that because these both were superseded by Streaming SIMD Extensions (SSE) in 1999. SSE has 128-bit registers – confusingly named xmm0 - xmm7 – and a much richer instruction set. SSE has been extended with SSE2 (2001), SSE3 (2004), SSSE3 (2006), SSE4.1 (2007), and SSE4.2 (2008). x86_64 doesn’t have SSE2 as an extension but instead as a core component of the architecture (adding xmm8- xmm15), baking it into its ABI.

In 2009, ARM introduced the NEON instruction set as part of ARMv6. Like SSE, it has 128-bit registers, but its instruction set is more consistent and uniform. One of its most visible features over SSE is a stride load parameter making it flexible for a wider variety data arrangements. NEON is available on your Raspberry Pi, which is why I’m using it here.

In 2011, Intel and AMD introduced the Advanced Vector Extensions (AVX) instruction set. Essentially it’s SSE with 256-bit registers, named ymm0 - ymm15. That means operating on 8 single-precision floats at once! As of this writing, this extensions is just starting to become commonplace on desktops and laptops. It also has extensions: AVX2 (2013) and AVX-512 (2015).

Starting with C

Moving on to the code, in mandel.c you’ll find mandel_basic, a straight C implementation that produces a monochrome image. Normally I would post the code here within the article, but it’s 30 lines long and most of it isn’t of any particular interest.

I didn’t use C99’s complex number support because – continuing to follow the approach Handmade Hero – I intended to port this code directly into SIMD intrinsics. It’s much easier to work from a straight non-SIMD implementation towards one with compiler intrinsics than coding with compiler intrinsics right away. In fact, I’d say it’s almost trivial, since I got it right the first attempt on all three.

There’s just one unusual part:

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < s->height; y++) {
   /* ... */

This is an Open Multi-Processing (OpenMP) pragma. It’s a higher-level threading API than POSIX or Win32 threads. OpenMP takes care of all thread creation, work scheduling, and cleanup. In this case, the for loop is parallelized such that each row of the image will be scheduled individually to a thread, with one thread spawned for each CPU core. This one line saves all the trouble of managing a work queue and such. I also use it in my SIMD implementations, composing both forms of parallelization for maximum performance.

I did it in single precision because I really want to exploit SIMD. Obviously, being half as wide as double precision, twice an many single precision operands can fit in a SIMD register.

On my wife’s i7-4770 (8 logical cores), it takes 29.9ms to render one image using the defaults (1440x1080, real{-2.5, 1.5}, imag{-1.5, 1.5}, 256 iterations). I’ll use the same machine for both the SSE2 and AVX benchmarks.

SSE2 Mandelbrot Set

The first translation I did was SSE2 (mandel_sse2.c). As with just about any optimization, it’s more complex and harder to read than the straight version. Again, I won’t post the code here, especially when this one has doubled to 60 lines long.

Porting to SSE2 (and SIMD in general) is simply a matter of converting all assignments and arithmetic operators to their equivalent intrinsics. The Intel Intrinsics Guide is a godsend for this step. It’s easy to search for specific operations and it tells you what headers they come from. Notice that there are no C arithmetic operators until the very end, after the results have been extracted from SSE and pixels are being written.

There are two new types present in this version, __m128 and __m128i. These will be mapped to SSE registers by the compiler, sort of like the old (outdated) C register keyword. One big difference is that it’s legal to take the address of these values with &, and the compiler will worry about the store/load operations. The first type is for floating point values and the second is for integer values. At first it’s annoying for these to be separate types (the CPU doesn’t care), but it becomes a set of compiler-checked rails for avoiding mistakes.

Here’s how assignment was written in the straight C version:

float iter_scale = 1.0f / s->iterations;

And here’s the SSE version. SSE intrinsics are prefixed with _mm, and the “ps” stands for “packed single-precision.”

__m128 iter_scale = _mm_set_ps1(1.0f / s->iterations);

This sets all four lanes of the register to the same value (a broadcast). Lanes can also be assigned individually, such as at the beginning of the innermost loop.

__m128 mx = _mm_set_ps(x + 3, x + 2, x + 1, x + 0);

This next part shows why the SSE2 version is longer. Here’s the straight C version:

float zr1 = zr * zr - zi * zi + cr;
float zi1 = zr * zi + zr * zi + ci;
zr = zr1;
zi = zi1;

To make it easier to read in the absence of operator syntax, I broke out the intermediate values. Here’s the same operation across four different complex values simultaneously. The purpose of these intrinsics should be easy to guess from their names.

__m128 zr2 = _mm_mul_ps(zr, zr);
__m128 zi2 = _mm_mul_ps(zi, zi);
__m128 zrzi = _mm_mul_ps(zr, zi);
zr = _mm_add_ps(_mm_sub_ps(zr2, zi2), cr);
zi = _mm_add_ps(_mm_add_ps(zrzi, zrzi), ci);

There are a bunch of swizzle instructions added in SSSE3 and beyond for re-arranging bytes within registers. With those I could eliminate that last bit of non-SIMD code at the end of the function for packing pixels. In an earlier version I used them, but since pixel packing isn’t a hot spot in this code (it’s outside the tight, innermost loop), it didn’t impact the final performance, so I took it out for the sake of simplicity.

The running time is now 8.56ms per image, a 3.5x speedup. That’s close to the theoretical 4x speedup from moving to 4-lane SIMD. That’s fast enough to render fullscreen at 60FPS.

AVX Mandelbrot Set

With SSE2 explained, there’s not much to say about AVX (mandel_avx.c). The only difference is the use of __m256, __m256i, the _mm256 intrinsic prefix, and that this operates on 8 points on the complex plane instead of 4.

It’s interesting that the AVX naming conventions are subtly improved over SSE. For example, here are the SSE broadcast intrinsics.

Notice the oddball at the end? That’s discrimination against sufferers of obsessive-compulsive personality disorder. This was fixed in AVX’s broadcast intrinsics:

The running time here is 5.20ms per image, a 1.6x speedup from SSE2. That’s not too far from the theoretical 2x speedup from using twice as many lanes. We can render at 60FPS and spend most of the time waiting around for the next vsync.

NEON Mandelbrot Set

NEON is ARM’s take on SIMD. It’s what you’d find on your phone and tablet rather than desktop or laptop. NEON behaves much like a co-processor: NEON instructions are (cheaply) dispatched asynchronously to their own instruction pipeline, but transferring data back out of NEON is expensive and will stall the ARM pipeline until the NEON pipeline catches up.

Going beyond __m128 and __m256, NEON intrinsics have a type for each of the possible packings. On x86, the old stack-oriented x87 floating-point instructions are replaced with SSE single-value (“ss”, “sd”) instructions. On ARM, there’s no reason to use NEON to operate on single values, so these “packings” don’t exist. Instead there are half-wide packings. Note the lack of double-precision support.

Again, the CPU doesn’t really care about any of these types. It’s all to help the compiler help us. For example, we don’t want to multiply a float32x4_t and a float32x2_t since it wouldn’t have a meaningful result.

Otherwise everything is similar (mandel_neon.c). NEON intrinsics are (less-cautiously) prefixed with v and suffixed with a type (_f32, _u32, etc.).

The performance on my model Raspberry Pi 2 (900 MHz quad-core ARM Cortex-A7) is 545ms per frame without NEON and 232ms with NEON, a 2.3x speedup. This isn’t nearly as impressive as SSE2, also at 4 lanes. My implementation almost certainly needs more work, especially since I know less about ARM than x86.

Compiling with Intrinsics

For the x86 build, I wanted the same binary to have AVX, SSE2, and plain C versions, selected by a command line switch and feature availability, so that I could easily compare benchmarks. Without any special options, gcc and clang will make conservative assumptions about the CPU features of the target machine. In order to build using AVX intrinsics, I need the compiler to assume the target has AVX. The -mavx argument does this.

mandel_avx.o : mandel_avx.c
    $(CC) -c $(CFLAGS) -mavx -o $@ $^

mandel_sse2.o : mandel_sse2.c
    $(CC) -c $(CFLAGS) -msse2 -o $@ $^

mandel_neon.o : mandel_neon.c
    $(CC) -c $(CFLAGS) -mfpu=neon -o $@ $^

All x86_64 CPUs have SSE2 but I included it anyway for clarity. But it should also enable it for 32-bit x86 builds.

It’s absolutely critical that each is done in a separate translation unit. Suppose I compiled like so in one big translation unit,

gcc -msse2 -mavx mandel.c mandel_sse2.c mandel_avx.c

The compiler will likely use some AVX instructions outside of the explicit intrinsics, meaning it’s going to crash on machine without AVX (“illegal instruction”). The main program needs to be compiled with AVX disabled. That’s where it will test for AVX before executing any special instructions.

Feature Testing

Intrinsics are well-supported across different compilers (surprisingly, even including the late-to-the-party Microsoft). Unfortunately testing for CPU features differs across compilers. Intel advertises a _may_i_use_cpu_feature intrinsic, but it’s not supported in either gcc or clang. gcc has a __builtin_cpu_supports built-in, but it’s only supported by gcc.

The most portable solution I came up with is cpuid.h (x86 specific). It’s supported by at least gcc and clang. The clang version of the header is much better documented, so if you want to read up on how this works, read that one.

#include <cpuid.h>

static inline int
    unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
    __get_cpuid(1, &eax, &ebx, &ecx, &edx);
    return ecx & bit_AVX ? 1 : 0;

And in use:

if (use_avx && is_avx_supported())
    mandel_avx(image, &spec);
else if (use_sse2)
    mandel_sse2(image, &spec);
    mandel_basic(image, &spec);

I don’t know how to test for NEON, nor do I have the necessary hardware to test it, so on ARM assume it’s always available.


Using SIMD intrinsics for the Mandelbrot set was just an exercise to learn how to use them. Unlike in Handmade Hero, where it makes a 1080p 60FPS software renderer feasible, I don’t have an immediate, practical use for CPU SIMD, but, like so many similar techniques, I like having it ready in my toolbelt for the next time an opportunity arises.

tags: [ c ]

Minimal OpenGL 3.3 Core Profile Demo

When I was first attempting to learn OpenGL years ago, what I really wanted was a complete, minimal example program. OpenGL has enormous flexibility and I wanted to fully understand the fundamentals in isolation before moving on to more advanced features. I had been advised to specifically learn core profile, which drops nearly all the legacy parts of the API.

However, since much of the OpenGL-related content to be found online, even today, is outdated – and, worse, it’s not marked as such – good, modern core profile examples have been hard to come by. The relevant examples I could find at the time were more complicated than necessary, due to the common problem that full 3D graphics are too closely conflated with OpenGL. The examples would include matrix libraries, texture loading, etc. This is a big reason I ended up settling on WebGL: a clean slate in a completely different community. (The good news is that this situation has already improved dramatically over the last few years!)

Until recently, all of my OpenGL experience had been WebGL. Wanting to break out of that, earlier this year I set up a minimal OpenGL 3.3 core profile demo in C, using GLFW and gl3w. You can find it here:

No 3D graphics, no matrix library, no textures. It’s just a spinning red square.

It supports both Linux and Windows. The Windows’ build is static, so it compiles to a single, easily distributable, standalone binary. With some minor tweaking it would probably support the BSDs as well. For simplicity’s sake, the shaders are baked right into the source as strings, but if you’re extending the demo for your own use, you may want to move them out into their own source files.

Why OpenGL 3.3?

I chose OpenGL 3.3 in particular for three reasons:

As far as “desktop” OpenGL goes, 3.3 is currently the prime target.


Until EGL someday fills this role, the process for obtaining an OpenGL context is specific to each operating system, where it’s generally a pain in the butt. GLUT, the OpenGL Utility Toolkit, was a library to make this process uniform across the different platforms. It also normalized user input (keyboard and mouse) and provided some basic (and outdated) utility functions.

The original GLUT isn’t quite open source (licensing issues) and it’s no longer maintained. The open source replacement for GLUT is FreeGLUT. It’s what you’d typically find on a Linux system in place of the original GLUT.

I just need a portable library that creates a window, handles keyboard and mouse events in that window, and gives me an OpenGL 3.3 core profile context. FreeGLUT does this well, but we can do better. One problem is that it includes a whole bunch of legacy cruft from GLUT: immediate mode rendering utilities, menus, spaceball support, lots of global state, and only one OpenGL context per process.

One of the biggest problems is that FreeGLUT doesn’t have a swap interval function. This is used to lock the application’s redraw rate to the system’s screen refresh rate, preventing screen tearing and excessive resource consumption. I originally used FreeGLUT for the demo, and, as a workaround, had added my own macro work around this by finding the system’s swap interval function, but it was a total hack.

The demo was initially written with FreeGLUT, but I switched over to GLFW since it’s smaller, simpler, cleaner, and more modern. GLFW also has portable joystick handling. With the plethora of modern context+window creation libraries out there, it seems there’s not much reason to use FreeGLUT anymore.

SDL 2.0 would also be an excellent choice. It goes beyond GLFW with threading, audio, networking, image loading, and timers: basically all the stuff you’d need when writing a game.

I’m sure there are some other good alternatives, especially when you’re not sticking to plain C, but these are the libraries I’m familiar with at the time of this article.

Why gl3w?

If you didn’t think the interface between OpenGL and the operating system was messy enough, I have good news for you. Neither the operating system nor the video card drivers are going to provide any of the correct headers, nor will you have anything meaningful to link against! For these, you’re on your own.

The OpenGL Extension Wrangler Library (GLEW) was invented solve this problem. It dynamically loads the system’s OpenGL libraries and finds all the relevant functions at run time. That way your application avoids linking to anything too specific. At compile time, it provides the headers defining all of the OpenGL functions.

Over the years, GLEW has become outdated, to this day having no support for core profile. So instead I used a replacement called gl3w. It’s just like GLEW, but, as the name suggests, oriented around core profile … exactly what I needed. Unlike GLEW, it is generated directly from Kronos’ documentation by a script. In practice, you drop the generated code directly into your project (embedded) rather than rely on the system to provide it as a library.

A great (and probably better) alternative to gl3w is glLoadgen. It’s the same idea – an automatically generated OpenGL loader – but allows for full customization of the output, such as the inclusion of select OpenGL extensions.


While I hope it serves an educational resources for others, I primarily have it for my own record-keeping, pedagogical, and reference purposes, born out of a weekend’s worth of research. It’s a starting point for future projects, and it’s somewhere easy to start when I want to experiment with an idea.

Plus, someday I want to write a sweet, standalone game with fancy OpenGL graphics.

tags: [ opengl c ]

Raw Linux Threads via System Calls

This article has been translated to Japanese.

Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.

Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.

It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86_64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).

I’ve put the complete demo here if you want to see it all at once:

An x86_64 Primer

I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86_64 assembly, feel free to skip to the next section.

x86_64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.

The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86_64 they’re almost completely interchangeable.

There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.

The Stack

The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.

The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86_64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:

So, for example, to perform a function call like foo(1, 2, 3), store 1, 2 and 3 in rdi, rsi, and rdx, then call the function. The mov instruction stores the source (second) operand in its destination (first) operand. The call instruction pushes the current value of rip onto the stack, then sets rip (jumps) to the address of the target function. When the callee is ready to return, it uses the ret instruction to pop the original rip value off the stack and back into rip, returning control to the callee.

    mov rdi, 1
    mov rsi, 2
    mov rdx, 3
    call foo

Called functions must preserve the contents of these registers (the same value must be stored when the function returns):

System Calls

When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.

Each system call has an integer identifying it. This number is different on each platform, but, in Linux’s case, it will never change. Instead of call, rax is set to the number of the desired system call and the syscall instruction makes the request to the OS kernel. Prior to x86_64, this was done with an old-fashioned interrupt. Because interrupts are slow, a special, statically-positioned “vsyscall” page (now deprecated as a security hazard), later vDSO, is provided to allow certain system calls to be made as function calls. We’ll only need the syscall instruction in this article.

So, for example, the write() system call has this C prototype.

ssize_t write(int fd, const void *buf, size_t count);

On x86_64, the write() system call is at the top of the system call table as call 1 (read() is 0). Standard output is file descriptor 1 by default (standard input is 0). The following bit of code will write 10 bytes of data from the memory address buffer (a symbol defined elsewhere in the assembly program) to standard output. The number of bytes written, or -1 for error, will be returned in rax.

    mov rdi, 1        ; fd
    mov rsi, buffer
    mov rdx, 10       ; 10 bytes
    mov rax, 1        ; SYS_write

Effective Addresses

There’s one last thing you need to know: registers often hold a memory address (i.e. a pointer), and you need a way to read the data behind that address. In NASM syntax, wrap the register in brackets (e.g. [rax]), which, if you’re familiar with C, would be the same as dereferencing the pointer.

These bracket expressions, called an effective address, may be limited mathematical expressions to offset that base address entirely within a single instruction. This expression can include another register (index), a power-of-two scalar (bit shift), and an immediate signed offset. For example, [rax + rdx*8 + 12]. If rax is a pointer to a struct, and rdx is an array index to an element in array on that struct, only a single instruction is needed to read that element. NASM is smart enough to allow the assembly programmer to break this mold a little bit with more complex expressions, so long as it can reduce it to the [base + index*2^exp + offset] form.

The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.

Allocating a Stack

Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.

The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.

You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:

On x86_64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.

void *stack_create(void);

The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.

void *mmap(void *addr, size_t length, int prot, int flags);

For flags, we’ll choose a private, anonymous mapping that, being a stack, grows downward. Even with that last flag, the system call will still return the bottom address of the mapping, which will be important to remember later. It’s just a simple matter of setting the arguments in the registers and making the system call.

%define SYS_mmap    9
%define STACK_SIZE  (4096 * 1024)   ; 4 MB

    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov rax, SYS_mmap

Now we can allocate new stacks (or stack-sized buffers) as needed.

Spawning a Thread

Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.

long sys_clone(unsigned long flags, void *child_stack);

Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.

long thread_create(void (*)(void));

The function pointer argument is passed via rdi, per the ABI. Store this for safekeeping on the stack (push) in preparation for calling stack_create(). When it returns, the address of the low end of stack will be in rax.

    push rdi
    call stack_create
    lea rsi, [rax + STACK_SIZE - 8]
    pop qword [rsi]
    mov rax, SYS_clone

The second argument to clone() is a pointer to the high address of the stack (specifically, just above the stack). So we need to add STACK_SIZE to rax to get the high end. This is done with the lea instruction: load effective address. Despite the brackets, it doesn’t actually read memory at that address, but instead stores the address in the destination register (rsi). I’ve moved it back by 8 bytes because I’m going to place the thread function pointer at the “top” of the new stack in the next instruction. You’ll see why in a moment.

Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.

As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.

A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).

Now here’s the really cool part, and the reason branching isn’t needed. There’s no reason to check rax to determine if we are the original thread (in which case we return to the caller) or if we’re the new thread (in which case we jump to the thread function). Remember how we seeded the new stack with the thread function? When the new thread returns (ret), it will jump to the thread function with a completely empty stack. The original thread, using the original stack, will return to the caller.

The value returned by thread_create() is the process ID of the new thread, which is essentially the thread object (e.g. Pthread’s pthread_t).

Cleaning Up

The thread function has to be careful not to return (ret) since there’s nowhere to return. It will fall off the stack and terminate the program with a segmentation fault. Remember that threads are just processes? It must use the exit() syscall to terminate. This won’t terminate the other threads.

%define SYS_exit    60

    mov rax, SYS_exit

Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.

More Exploration

If you found this interesting, be sure to check out the full demo link at the top of this article. Now with the ability to spawn threads, it’s a great opportunity to explore and experiment with x86’s synchronization primitives, such as the lock instruction prefix, xadd, and compare-and-exchange (cmpxchg). I’ll discuss these in a future article.

tags: [ x86 linux c tutorial ]

NASM x86 Assembly Major Mode for Emacs

Last weekend I created a new Emacs mode, nasm-mode, for editing Netwide Assembler (NASM) x86 assembly programs. Over the past week I tweaked it until it felt comfortable enough to share on MELPA. It’s got what you’d expect from a standard Emacs programming language mode: syntax highlighting, automatic indentation, and imenu support. It’s not a full parser, but it knows all of NASM’s instructions and directives.

Until recently I didn’t really have preferences about x86 assemblers (GAS, NASM, YASM, FASM, MASM, etc.) or syntax (Intel, AT&T). I stuck to the GNU Assembler (GAS) since it’s already there with all the other GNU development tools I know and love, and it’s required for inline assembly in GCC. However, nasm-mode now marks my commitment to NASM as my primary x86 assembler.


I need an assembler that can assemble 16-bit code (8086, 8088, 80186, 80286), because real mode is fun. Despite its .code16gcc directive, GAS is not suitable for this purpose. It’s just enough to get the CPU into protected mode – as needed when writing an operating system with GCC – and that’s it. A different assembler is required for serious 16-bit programming.

GAS syntax has problems. I’m not talking about the argument order (source first or destination first), since there’s no right answer to that one. The linked article covers a number of problems, with these being the big ones for me:

Being a portable assembler, GAS is the jack of all instruction sets, master of none. If I’m going to write a lot of x86 assembly, I want a tool specialized for the job.


I also looked at YASM, a rewrite of NASM. It supports 16-bit assembly and mostly uses NASM syntax. In my research I found that NASM used to lag behind in features due to slower development, which is what spawned YASM. In recent years this seems to have flipped around, with YASM lagging behind. If you’re using YASM, nasm-mode should work pretty well for you, since it’s still very similar.

YASM optionally supports GAS syntax, but this reintroduces almost all of GAS’s problems. Even YASM’s improvements (i.e. its ORG directive) become broken when switching to GAS syntax.


FASM is the “flat assembler,” an assembler written in assembly language. This means it’s only available on x86 platforms. While I don’t really plan on developing x86 assembly on a Raspberry Pi, I’d rather not limit my options! I already regard 16-bit DOS programming as a form of embedded programming, and this may very well extend to the rest of x86 someday.

Also, it hasn’t made its way into the various Linux distribution package repositories, including Debian, so it’s already at a disadvantage for me.


This is Microsoft’s assembler that comes with Visual Studio. Windows only and not open source, this is in no way a serious consideration. But since NASM’s syntax was originally derived from MASM, it’s worth mentioning. NASM takes the good parts of MASM and fixes the mistakes (such as the offset operator). It’s different enough that nasm-mode would not work well with MASM.


It’s not perfect, but it’s got an excellent manual, it’s a solid program that does exactly what it says it will do, has a powerful macro system, great 16-bit support, highly portable, easy to build, and its semantics and syntax has been carefully considered. It also comes with a simple, pure binary disassembler (ndisasm). In retrospect it seems like an obvious choice!

My one complaint would be that it’s that it’s too flexible about labels. The colon on labels is optional, which can lead to subtle bugs. NASM will warn about this under some conditions (orphan-labels). Combined with the preprocessor, the difference between a macro and a label is ambiguous, short of re-implementing the entire preprocessor in Emacs Lisp.

Why nasm-mode?

Emacs comes with an asm-mode for editing assembly code for various architectures. Unfortunately it’s another jack-of-all-trades that’s not very good. More so, it doesn’t follow Emacs’ normal editing conventions, having unusual automatic indentation and self-insertion behaviors. It’s what prompted me to make nasm-mode.

To be fair, I don’t think it’s possible to write a major mode that covers many different instruction set architectures. Each architecture has its own quirks and oddities that essentially makes gives it a unique language. This is especially true with x86, which, from its 37 year tenure touched by so many different vendors, comes in a number of incompatible flavors. Each assembler/architecture pair needs its own major mode. I hope I just wrote NASM’s.

One area where I’m still stuck is that I can’t find an x86 style guide. It’s easy to find half a dozen style guides of varying authority for any programming language that’s more than 10 years old … except x86. There’s no obvious answer when it comes to automatic indentation. How are comments formatted and indented? How are instructions aligned? Should labels be on the same line as the instruction? Should labels require a colon? (I’ve decided this is “yes.”) What about long label names? How are function prototypes/signatures documented? (The mode could take advantage of such a standard, a la ElDoc.) It seems everyone uses their own style. This is another conundrum for a generic asm-mode.

There are a couple of other nasm-modes floating around with different levels of completeness. Mine should supersede these, and will be much easier to maintain into the future as NASM evolves.

tags: [ emacs x86 ]