null program

Makefile Assignments are Turing-Complete

For over a decade now, GNU Make has almost exclusively been my build system of choice, either directly or indirectly. Unfortunately this means I unnecessarily depend on some GNU extensions — an annoyance when porting to the BSDs. In an effort to increase the portability of my Makefiles, I recently read the POSIX make specification. I learned two important things: 1) POSIX make is so barren it’s not really worth striving for, and 2) make’s macro assignment mechanism is Turing-complete.

If you want to see it in action for yourself before reading further, here’s a Makefile that implements Conway’s Game of Life (40x40) using only macro assignments.

Run it with any make program in an ANSI terminal. It must literally be named life.mak.

make -f life.mak

It’s 100% POSIX-compatible except for the sleep 0.1 (fractional sleep), which is only needed for visual effect.

A POSIX workaround

Unlike virtually every real world implementation, POSIX make doesn’t support conditional parts. For example, you might want your Makefile’s behavior to change depending on the value of certain variables. In GNU Make it looks like this:

ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
    EXTRA_FLAGS = -Wbar

Or BSD-style:

.ifdef USE_FOO
    EXTRA_FLAGS = -ffoo -lfoo
    EXTRA_FLAGS = -Wbar

If the goal is to write a strictly POSIX Makefile, how could I work around the lack of conditional parts and maintain a similar interface? The selection of macro/variable to evaluate can be dynamically selected, allowing for some useful tricks. First define the option’s default:


Then define both sets of flags:

EXTRA_FLAGS_1 = -ffoo -lfoo

Now dynamically select one of these macros for assignment to EXTRA_FLAGS.


The assignment on the command line overrides the assignment in the Makefile, so the user gets to override USE_FOO.

$ make              # EXTRA_FLAGS = -Wbar
$ make USE_FOO=0    # EXTRA_FLAGS = -Wbar
$ make USE_FOO=1    # EXTRA_FLAGS = -ffoo -lfoo

Before reading the POSIX specification, I didn’t realize that the left side of an assignment can get the same treatment. For example, if I really want the “if defined” behavior back, I can use the macro to mangle the left-hand side. For example,


Caveat: If DEBUG is set to empty, it may still result in true for ifdef depending on which make flavor you’re using, but will always appear to be unset in this hack.

$ make             # EXTRA_FLAGS = -O3 -DNDEBUG
$ make DEBUG=yes   # EXTRA_FLAGS = -O0 -g3

This last case had me thinking: This is very similar to the (ab)use of the x86 mov instruction in mov is Turing-complete. These macro assignments alone should be enough to compute any algorithm.

Macro Operations

Macro names are just keys to a global associative array. This can be used to build lookup tables. Here’s a Makefile to “compute” the square root of integers between 0 and 10.

sqrt_0  = 0.000000
sqrt_1  = 1.000000
sqrt_2  = 1.414214
sqrt_3  = 1.732051
sqrt_4  = 2.000000
sqrt_5  = 2.236068
sqrt_6  = 2.449490
sqrt_7  = 2.645751
sqrt_8  = 2.828427
sqrt_9  = 3.000000
sqrt_10 = 3.162278
result := $(sqrt_$(n))

The BSD flavors of make have a -V option for printing variables, which is an easy way to retrieve output. I used an “immediate” assignment (:=) for result since some versions of make won’t evaluate the expression before -V printing.

$ make -f sqrt.mak -V result n=8

Without -V, a default target could be used instead:

output :
        @printf "$(result)\n"

There are no math operators, so performing arithmetic requires some creativity. For example, integers could be represented as a series of x characters. The number 4 is xxxx, the number 6 is xxxxxx, etc. Addition is concatenation (note: macros can have + in their names):

A      = xxx
B      = xxxx
A+B    = $(A)$(B)

However, since there’s no way to “slice” a value, subtraction isn’t possible. A more realistic approach to arithmetic would require lookup tables.


Branching could be achieved through more lookup tables. For example,

square_0  = 1
square_1  = 2
square_2  = 4
# ...
result := $($(op)_$(n))

And called as:

$ make n=5 op=sqrt    # 2.236068
$ make n=5 op=square  # 25

Or using the DEBUG trick above, use the condition to mask out the results of the unwanted branch. This is similar to the mov paper.

result           := $(op)($(n)) = $($(op)_$(n))
result$(verbose) := $($(op)_$(n))

And its usage:

$ make n=5 op=square             # 25
$ make n=5 op=square verbose=1   # square(5) = 25

What about loops?

Looping is a tricky problem. However, one of the most common build (anti?)patterns is the recursive Makefile. Borrowing from the mov paper, which used an unconditional jump to restart the program from the beginning, for a Makefile Turing-completeness I can invoke the Makefile recursively, restarting the program with a new set of inputs.

Remember the print target above? I can loop by invoking make again with new inputs in this target,

output :
    @printf "$(result)\n"
    @$(MAKE) $(args)

Before going any further, now that loops have been added, the natural next question is halting. In reality, the operating system will take care of that after some millions of make processes have carelessly been invoked by this horribly inefficient scheme. However, we can do better. The program can clobber the MAKE variable when it’s ready to halt. Let’s formalize it.

loop = $(MAKE) $(args)
output :
    @printf "$(result)\n"

To halt, the program just needs to clear loop.

Suppose we want to count down to 0. There will be an initial count:

count = 6

A decrement table:

6  = 5
5  = 4
4  = 3
3  = 2
2  = 1
1  = 0
0  = loop

The last line will be used to halt by clearing the name on the right side. This is three star territory.

$($($(count))) =

The result (current iteration) loop value is computed from the lookup table.

result = $($(count))

The next loop value is passed via args. If loop was cleared above, this result will be discarded.

args = count=$(result)

With all that in place, invoking the Makefile will print a countdown from 5 to 0 and quit. This is the general structure for the Game of Life macro program.

Game of Life

A universal Turing machine has been implemented in Conway’s Game of Life. With all that heavy lifting done, one of the easiest methods today to prove a language’s Turing-completeness is to implement Conway’s Game of Life. Ignoring the criminal inefficiency of it, the Game of Life Turing machine could be run on the Game of Life simulation running on make’s macro assignments.

In the Game of Life program — the one linked at the top of this article — each cell is stored in a macro named xxyy, after its position. The top-left most cell is named 0000, then going left to right, 0100, 0200, etc. Providing input is a matter of assigning each of these macros. I chose X for alive and - for dead, but, as you’ll see, any two characters permitted in macro names would work as well.

$ make 0000=X 0100=- 0200=- 0300=X ...

The next part should be no surprise: The rules of the Game of Life are encoded as a 512-entry lookup table. The key is formed by concatenating the cell’s value along with all its neighbors, with itself in the center.

The “beginning” of the table looks like this:

--------- = -
X-------- = -
-X------- = -
XX------- = -
--X------ = -
X-X------ = -
-XX------ = -
XXX------ = X
---X----- = -
X--X----- = -
-X-X----- = -
XX-X----- = X
# ...

Note: The two right-hand X values here are the cell coming to life (exactly three living neighbors). Computing the next value (n0101) for 0101 is done like so:

n0101 = $($(0000)$(0100)$(0200)$(0001)$(0101)$(0201)$(0002)$(0102)$(0202))

Given these results, constructing the input to the next loop is simple:

args = 0000=$(n0000) 0100=$(n0100) 0200=$(n0200) ...

The display output, to be given to printf, is built similarly:

output = $(n0000)$(n0100)$(n0200)$(n0300)...

In the real version, this is decorated with an ANSI escape code that clears the terminal. The printf interprets the escape byte (\033) so that it doesn’t need to appear literally in the source.

And that’s all there is to it: Conway’s Game of Life running in a Makefile. Life, uh, finds a way.

tags: [ lang compsci ]

Mapping Multiple Memory Views in User Space

Modern operating systems run processes within virtual memory using a piece of hardware called a memory management unit (MMU). The MMU contains a page table that defines how virtual memory maps onto physical memory. The operating system is responsible for maintaining this page table, mapping and unmapping virtual memory to physical memory as needed by the processes it’s running. If a process accesses a page that is not currently mapped, it will trigger a page fault and the execution of the offending thread will be paused until the operating system maps that page.

This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.

Some useful applications of this feature include:

Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.

POSIX Memory Mapping

On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions are shm_open(3), ftruncate(2), and mmap(2).

First, create a file descriptor to shared memory using shm_open. It has very similar semantics to open(2).

int shm_open(const char *name, int oflag, mode_t mode);

The name works much like a filesystem path, but is actually a different namespace (though on Linux it is a tmpfs mounted at /dev/shm). Resources created here (O_CREAT) will persist until explicitly deleted (shm_unlink(3)) or until the system reboots. It’s an oversight in POSIX that a name is required even if we never intend to access it by name. File descriptors can be shared with other processes via fork(2) or through UNIX domain sockets, so a name isn’t strictly required.

OpenBSD introduced shm_mkstemp(3) to solve this problem, but it’s not widely available. On Linux, as of this writing, the O_TMPFILE flag may or may not provide a fix (it’s undocumented).

The portable workaround is to attempt to choose a unique name, open the file with O_CREAT | O_EXCL (either atomically create the file or fail), shm_unlink the shared memory object as soon as possible, then cross our fingers. The shared memory object will still exist (the file descriptor keeps it alive) but will not longer be accessible by name.

int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
    handle_error(); // non-local exit

The shared memory object is brand new (O_EXCL) and is therefore of zero size. ftruncate sets it to the desired size. This does not need to be a multiple of the page size. Failing to allocate memory will result in a bus error on access.

size_t size = sizeof(uint32_t);
ftruncate(fd, size);

Finally mmap the shared memory into place just as if it were a file. We can choose an address (aligned to a page) or let the operating system choose one for use (NULL). If we don’t plan on making any more mappings, we can also close the file descriptor. The shared memory object will be freed as soon as it completely unmapped (munmap(2)).

int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);

At this point both a and b have different addresses but point (via the page table) to the same physical memory. Changes to one are reflected in the other. So this:

*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);

Will print out something like:

0x6ffffff0000 0x6fffffe0000 0xdeafbeef

It’s also possible to do all this only with open(2) and mmap(2) by mapping the same file twice, but you’d need to worry about where to put the file, where it’s going to be backed, and the operating system will have certain obligations about syncing it to storage somewhere. Using POSIX shared memory is simpler and faster.

Windows Memory Mapping

Windows is very similar, but directly supports anonymous shared memory. The key functions are CreateFileMapping, and MapViewOfFileEx.

First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.

size_t size = sizeof(uint32_t);
                             0, size,

There’s no truncate step because the space is allocated at creation time via the two-part size argument.

Then, just like mmap:

uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);

If I wanted to choose the target address myself, I’d call MapViewOfFileEx instead, which takes the address as additional argument.

From here on it’s the same as above.

Generalizing the API

Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.

int  memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);

Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.

It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.

Pass the same pointers back to memory_alias_unmap to free the mappings. When called correctly it cannot fail, so there’s no return value.

The full source is here: memalias.c


Starting with the simpler of the two functions, the POSIX implementation looks like so:

memory_alias_unmap(size_t size, size_t naddr, void **addrs)
    for (size_t i = 0; i < naddr; i++)
        munmap(addrs[i], size);

The complex part is creating the mapping:

memory_alias_map(size_t size, size_t naddr, void **addrs)
    char path[128];
    snprintf(path, sizeof(path), "/%s(%lu,%p)",
             __FUNCTION__, (long)getpid(), addrs);
    int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd == -1)
        return -1;
    ftruncate(fd, size);
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = mmap(addrs[i], size,
                        PROT_READ | PROT_WRITE, MAP_SHARED,
                        fd, 0);
        if (addrs[i] == MAP_FAILED) {
            memory_alias_unmap(size, i, addrs);
            return -1;
    return 0;

The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.

Otherwise it just walks the array setting up the mappings.


The Windows version is very similar.

memory_alias_unmap(size_t size, size_t naddr, void **addrs)
    for (size_t i = 0; i < naddr; i++)

Since Windows tracks the size internally, it’s unneeded and ignored.

memory_alias_map(size_t size, size_t naddr, void **addrs)
    HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
                                 0, size,
    if (m == NULL)
        return -1;
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
        if (addrs[i] == NULL) {
            memory_alias_unmap(size, i, addrs);
            return -1;
    return 0;

In the future I’d like to find some unique applications of these multiple memory views.

tags: [ c linux win32 ]

Hotpatching a C Function on x86

In this post I’m going to do a silly, but interesting, exercise that should never be done in any program that actually matters. I’m going write a program that changes one of its function definitions while it’s actively running and using that function. Unlike last time, this won’t involve shared libraries, but it will require x86_64 and GCC. Most of the time it will work with Clang, too, but it’s missing an important compiler option that makes it stable.

If you want to see it all up front, here’s the full source: hotpatch.c

Here’s the function that I’m going to change:


It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:

    static int x;
    printf("goodbye %d\n", x++);

I was only going change the string, but I figured I should make it a little more interesting.

Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a far more complex problem.

But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:

The solution is the ms_hook_prologue function attribute. This tells GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP that I can safely clobber. This idea originated in Microsoft’s Win32 API, hence the “ms” in the name.

The solution is the aligned function attribute, ensuring the hotpatch prologue is properly aligned.

As you might have guessed, this is primarily fixed with the noinline function attribute. Since GCC may also clone the function and call that instead, so it also needs the noclone attribute.

Even further, if GCC determines there are no side effects, it may cache the return value and only ever call the function once. To convince GCC that there’s a side effect, I added an empty inline assembly string (__asm("")). Since puts() has a side effect (output), this isn’t truly necessary for this particular example, but I’m being thorough.

What does the function look like now?

__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))

And what does the assembly look like?

$ objdump -Mintel -d hotpatch
0000000000400848 <hello>:
  400848:       48 8d a4 24 00 00 00    lea    rsp,[rsp+0x0]
  40084f:       00
  400850:       bf d4 09 40 00          mov    edi,0x4009d4
  400855:       e9 06 fe ff ff          jmp    400660 <puts@plt>

It’s 8-byte aligned and it has the 8-byte NOP: that lea instruction does nothing. It copies rsp into itself and changes no flags. Why not 8 1-byte NOPs? I need to replace exactly one instruction with exactly one other instruction. I can’t have another thread in between those NOPs.


Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.

hotpatch(void *target, void *replacement)
    assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
    void *page = (void *)((uintptr_t)target & ~0xfff);
    mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
    uint32_t rel = (char *)replacement - (char *)target - 5;
    union {
        uint8_t bytes[8];
        uint64_t value;
    } instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
    *(uint64_t *)target = instruction.value;
    mprotect(page, 4096, PROT_EXEC);

It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.

There are a number of things going on here, so let’s go through them one at a time:

Make the function writeable

The .text segment will not be writeable by default. This is for both security and safety. Before I can hotpatch the function I need to make the function writeable. To make the function writeable, I need to make its page writable. To make its page writeable I need to call mprotect(). If there was another thread monkeying with the page attributes of this page at the same time (another thread calling hotpatch()) I’d be in trouble.

It finds the page by rounding the target address down to the nearest 4096, the assumed page size (sorry hagepages). Warning: I’m being a bad programmer and not checking the result of mprotect(). If it fails, the program will crash and burn. It will always fail systems with W^X enforcement, which will likely become the standard in the future. Under W^X (“write XOR execute”), memory can either be writeable or executable, but never both at the same time.

What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.

At the end of the function, I mprotect() the page back to non-writeable.

Create the instruction

I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:

Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.

To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.

I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.

Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.

    *(uint64_t *)target = instruction.value;

Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.

Trying it out

Here’s what my test program looks like:

void *
worker(void *arg)
    for (;;) {
    return NULL;

    pthread_t thread;
    pthread_create(&thread, NULL, worker, NULL);
    hotpatch(hello, new_hello);
    pthread_join(thread, NULL);
    return 0;

I fire off the other thread to keep it pinging at hello(). In the main thread, it waits until I hit enter to give the program input, after which it calls hotpatch() and changes the function called by the “worker” thread. I’ve now changed the behavior of the worker thread without its knowledge. In a more practical situation, this could be used to update parts of a running program without restarting or even synchronizing.

Further Reading

These related articles have been shared with me since publishing this article:

tags: [ x86 c linux ]

Calling the Native API While Freestanding

When developing minimal, freestanding Windows programs, it’s obviously beneficial to take full advantage of dynamic libraries that are already linked rather than duplicate that functionality in the application itself. Every Windows process automatically, and involuntarily, has kernel32.dll and ntdll.dll loaded into its process space before it starts. As discussed previously, kernel32.dll provides the Windows API (Win32). The other, ntdll.dll, provides the Native API for user space applications, and is the focus of this article.

The Native API is a low-level API, a foundation for the implementation of the Windows API and various components that don’t use the Windows API (drivers, etc.). It includes a runtime library (RTL) suitable for replacing important parts of the C standard library, unavailable to freestanding programs. Very useful for a minimal program.

Unfortunately, using the Native API is a bit of a minefield. Not all of the documented Native API functions are actually exported by ntdll.dll, making them inaccessible both for linking and GetProcAddress(). Some are exported, but not documented as such. Others are documented as exported but are not documented when (which release of Windows). If a particular function wasn’t exported until Windows 8, I don’t want to use when supporting Windows 7.

This is further complicated by the Microsoft Windows SDK, where many of these functions are just macros that alias C runtime functions. Naturally, MinGW closely follows suit. For example, in both cases, here is how the Native API function RtlCopyMemory is “declared.”

#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))

This is certainly not useful for freestanding programs, though it has a significant benefit for hosted programs: The C compiler knows the semantics of memcpy() and can properly optimize around it. Any C compiler worth its salt will replace a small or aligned, fixed-sized memcpy() or memmove() with the equivalent inlined code. For example:

    char buffer0[16];
    char buffer1[16];
    // ...
    memcpy(buffer0, buffer1, 16);
    // ...

On x86_64 (GCC 4.9.3, -Os), this memmove() call is replaced with two instructions. This isn’t possible when calling an opaque function in a non-standard dynamic library. The side effects could be anything.

    movaps  xmm0, [rsp + 48]
    movaps  [rsp + 32], xmm0

These Native API macro aliases are what have allowed certain Wine issues to slip by unnoticed for years. Very few user space applications actually call Native API functions, even when addressed directly by name in the source. The development suite is pulling a bait and switch.

Like last time I danced at the edge of the compiler, this has caused headaches in my recent experimentation with freestanding executables. The MinGW headers assume that the programs including them will link against a C runtime. Dirty hack warning: To work around it, I have to undo the definition in the MinGW headers and make my own. For example, to use the real RtlMoveMemory():

#include <windows.h>

#undef RtlMoveMemory
void RtlMoveMemory(void *, const void *, size_t);

Anywhere where I might have previously used memmove() I can instead use RtlMoveMemory(). Or I could trivially supply my own wrapper:

void *
memmove(void *d, const void *s, size_t n)
    RtlMoveMemory(d, s, n);
    return d;

As of this writing, the same approach is not reliable with RtlCopyMemory(), the cousin to memcpy(). As far as I can tell, it was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June 2015). Use RtlMoveMemory() instead. The overlap-handling overhead is negligible compared to the function call overhead anyway.

As a side note: one reason besides minimalism for not implementing your own memmove() is that it’s not legal to implement efficiently in straight C. According to the language specification, your implementation of memmove() would not be permitted to compare its pointer arguments with <, >, <=, or >=. That would lead to undefined behavior if they pointed to unrelated objects (ISO/IEC 9899:2011 §6.5.8¶5). So the only legal approach is to allocate a temporary buffer, copy the source buffer into it, then copy it into the destination buffer.

I keep mentioning Wine because I’ve been careful to ensure my applications run correctly with it. So far it’s worked perfectly with both Windows API and Native API functions. Thanks to the hard work behind the Wine project, despite being written sharply against the Windows API, these tiny programs remain relatively portable (x86 and ARM). It’s a good fit for graphical applications (games), but I would never write a command line application like this. The command line has always been a second class citizen on Windows.

Mostly for my own future reference, here are export lists for two different versions of kernel32.dll and ntdll.dll:

As I collect more of these export lists, I’ll be able to paint a full picture of when particular functions first appeared as exports. These lists were generated with objdump -p <path_to_dll>.

Now that I’ve got these Native API issues sorted out, I’ve significantly expanded the capabilities of my tiny, freestanding programs without adding anything to their size. Functions like RtlUnicodeToUTF8N() and RtlUTF8ToUnicodeN() will surely be handy.

tags: [ c win32 x86 ]

Small, Freestanding Windows Executables

Recently I’ve been experimenting with freestanding C programs on Windows. Freestanding refers to programs that don’t link, either statically or dynamically, against a standard library (i.e. libc). This is typical for operating systems and similar, bare metal situations. Normally a C compiler can make assumptions about the semantics of functions provided by the C standard library. For example, the compiler will likely replace a call to a small, fixed-size memmove() with move instructions. Since a freestanding program would supply its own, it may have different semantics.

My usual go to for C/C++ on Windows is Mingw-w64, which has greatly suited my needs the past couple of years. It’s packaged on Debian, and, when combined with Wine, allows me to fully develop Windows applications on Linux. Being GCC, it’s also great for cross-platform development since it’s essentially the same compiler as the other platforms. The primary difference is the interface to the operating system (POSIX vs. Win32).

However, it has one glaring flaw inherited from MinGW: it links against msvcrt.dll, an ancient version of the Microsoft C runtime library that currently ships with Windows. Besides being dated and quirky, it’s not an official part of Windows and never has been, despite its inclusion with every release since Windows 95. Mingw-w64 doesn’t have a C library of its own, instead patching over some of the flaws of msvcrt.dll and linking against it.

Since so much depends on msvcrt.dll despite its unofficial nature, it’s unlikely Microsoft will ever drop it from future releases of Windows. However, if strict correctness is a concern, we must ask Mingw-w64 not to link against it. An alternative would be PlibC, though the LGPL licensing is unfortunate. Another is Cygwin, which is a very complete POSIX environment, but is heavy and GPL-encumbered.

Sometimes I’d prefer to be more direct: skip the C library altogether and talk directly to the operating system. On Windows that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only links against system DLLs.

Linux vs. Windows

The most important benefit of a standard library like libc is a portable, uniform interface to the host system. So long as the standard library suits its needs, the same program can run anywhere. Without it, the programs needs an implementation of each host-specific interface.

On Linux, operating system requests at the lowest level are made directly via system calls. This requires a bit of assembly language for each supported architecture (int 0x80 on x86, syscall on x86_64, swi on ARM, etc.). The POSIX functions of the various Linux libc implementations are built on top of this mechanism.

For example, here’s a function for a 1-argument system call on x86_64.

syscall1(long n, long arg)
    long result;
    __asm__ volatile (
        : "=a"(result)
        : "a"(n), "D"(arg)
    return result;

Then exit() is implemented on top. Note: A real libc would do cleanup before exiting, like calling registered atexit() functions.

#include <syscall.h>  // defines SYS_exit

exit(int code)
    syscall1(SYS_exit, code);

The situation is simpler on Windows. Its low level system calls are undocumented and unstable, changing across even minor updates. The formal, stable interface is through the exported functions in kernel32.dll. In fact, kernel32.dll is essentially a standard library on its own (making the term “freestanding” in this case dubious). It includes functions usually found only in user-space, like string manipulation, formatted output, font handling, and heap management (similar to malloc()). It’s not POSIX, but it has analogs to much of the same functionality.

Program Entry

The standard entry for a C program is main(). However, this is not the application’s true entry. The entry is in the C library, which does some initialization before calling your main(). When main() returns, it performs cleanup and exits. Without a C library, programs don’t start at main().

On Linux the default entry is the symbol _start. It’s prototype would look like so:

void _start(void);

Returning from this function leads to a segmentation fault, so it’s up to your application to perform the exit system call rather than return.

On Windows, the entry depends on the type of application. The two relevant subsystems today are the console and windows subsystems. The former is for console applications (duh). These programs may still create windows and such, but must always have a controlling console. The latter is primarily for programs that don’t run in a console, though they can still create an associated console if they like. In Mingw-w64, give -mconsole (default) or -mwindows to the linker to choose the subsystem.

The default entry for each is slightly different.

int WINAPI mainCRTStartup(void);
int WINAPI WinMainCRTStartup(void);

Unlike Linux’s _start, Windows programs can safely return from these functions, similar to main(), hence the int return. The WINAPI macro means the function may have a special calling convention, depending on the platform.

On any system, you can choose a different entry symbol or address using the --entry option to the GNU linker.

Disabling libgcc

One problem I’ve run into is Mingw-w64 generating code that calls __chkstk_ms() from libgcc. I believe this is a long-standing bug, since -ffreestanding should prevent these sorts of helper functions from being used. The workaround I’ve found is to disable stack protection.

-fno-stack-check -fno-stack-protector -mno-stack-arg-probe

(If you always write perfect code you don’t need this, amiright?)

Alternatively you could link against libgcc (statically) with -lgcc, but, again, I’m going for a tiny executable.

A freestanding example

Here’s an example of a Windows “Hello, World” that doesn’t use a C library.

#include <windows.h>

    char msg[] = "Hello, world!\n";
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(stdout, msg, sizeof(msg), (DWORD[]){0}, NULL);
    return 0;

To build it:

x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -fno-stack-check -fno-stack-protector -mno-stack-arg-probe \
    -o example.exe example.c \

Notice I manually linked against kernel32.dll. The stripped final result is only 4kB, mostly PE padding. There are techniques to trim this down even further, but for a substantial program it wouldn’t make a significant difference.

From here you could create a GUI by linking against user32.dll and gdi32.dll (both also part of Win32) and calling the appropriate functions. I already ported my OpenGL demo to a freestanding .exe, dropping GLFW and directly using Win32 and WGL. It’s much less portable, but the final .exe is only 4kB, down from the original 104kB (static linking against GLFW).

I may go this route for the upcoming 7DRL 2016 in March.

tags: [ c x86 linux win32 ]

9 Elfeed Features You Might Not Know

It’s been two years since I last wrote about Elfeed, my Atom/RSS feed reader for Emacs. I’ve used it every single day since, and I continue to maintain it with help from the community. So far 18 people besides me have contributed commits. Over the last couple of years it’s accumulated some new features, some more obvious than others.

Every time I mark a new release, I update the ChangeLog at the top of elfeed.el which lists what’s new. Since it’s easy to overlook many of the newer useful features, I thought I’d list the more important ones here.

Custom Entry Colors

You can now customize entry faces through elfeed-search-face-alist. This variable maps tags to faces. An entry inherits the face of any tag it carries. Previously “unread” was a special tag that got a bold face, but this is now implemented as nothing more than an initial entry in the alist.

I’ve been using it to mark different kinds of content (videos, podcasts, comics) with different colors.


You can specify the starting tags for entries from particular feeds directly in the feed listing. This has been a feature for awhile now, but it’s not something you’d want to miss. It started out as a feature in my personal configuration that eventually migrated into Elfeed proper.

For example, your elfeed-feeds may initially look like this, especially if you imported from OPML.


If you wanted certain tags applied to entries from each, you would need to putz around with elfeed-make-tagger. For the most common case — apply certain tags to all entries from a URL — it’s much simpler to specify the information as part of the listing itself,

(("" blog emacs)
 ("" webcomic)
 ("" youtube))

Today I only use custom tagger functions in my own configuration to filter within a couple of particularly noisy feeds.

Arbitrary Metadata

Metadata is more for Elfeed extensions (i.e. elfeed-org) than regular users. You can attach arbitrary, readable metadata to any Elfeed object (entry, feed). This metadata is automatically stored in the database. It’s a plist.

Metadata is accessed entirely through one setf-able function: elfeed-meta. For example, you might want to track when you’ve read something, not just that you’ve read it. You could use this to selectively update certain feeds or just to evaluate your own habits.

(defun my-elfeed-mark-read (entry)
  (elfeed-untag entry 'unread)
  (let ((date (format-time-string "%FT%T%z")))
    (setf (elfeed-meta entry :read-date) date)))

Two things motivated this feature. First, without a plist, if I added more properties in the future, I would need to change the database format to support them. I modified the database format to add metadata, requiring an upgrade function to quietly upgrade older databases as they were loaded. I’d really like to avoid this in the future.

Second, I wanted to make it easy for extension authors to store their own data. I still imagine an extension someday to update feeds intelligently based on their history. For example, the database doesn’t track when the feed was last fetched, just the date of the most recent entry (if any). A smart-update extension could use metadata to tag feeds with this information.

Elfeed itself already uses two metadata keys: :failures on feeds and :title on both. :failures counts the total number of times fetching that feed resulted in an error. You could use this get a listing of troublesome feeds like so,

(cl-loop for url in (elfeed-feed-list)
         for feed = (elfeed-db-get-feed url)
         for failures = (elfeed-meta feed :failures)
         when failures
         collect (cons url failures))

The :title property allows for a custom title for both feeds and entries in the search buffer listing, assuming you’re using the default function (see below). It overrides the title provided by the feed itself. This is different than elfeed-entry-title and elfeed-feed-title, which is kept in sync with feed content. Metadata is not kept in sync with the feed itself.

Filter Inversion

You can invert filter components by prefixing them with !. For example, say you’re looking at all my posts from the past 6 months:


But say you’re tired of me and decide you want to see every entry from the past 6 months excluding my posts.

@6-months !

Filter Limiter

Normally you limit the number of results by date, but you can now limit the result by count using #n. For example, to see my most recent 12 posts regardless of date, #12

This is used internally in the live filter to limit the number of results to the height of the screen. If you noticed that live filtering has been much more responsive in the last few months, this is probably why.

Bookmark Support

Elfeed properly integrates with Emacs’ bookmarks (thanks to groks). You can bookmark the current filter with M-x bookmark-set (C-x r m). By default, Emacs will persist bookmarks between sessions. To revisit a filter in the future, M-x bookmark-jump (C-x r b).

Since this requires no configuration, this may serve as an easy replacement for manually building “view” toggles — filters bound to certain keys — which I know many users have done, including me.

New Header

If you’ve updated very recently, you probably noticed Elfeed got a brand new header. Previously it faked a header by writing to the first line of the buffer. This is because somehow I had no idea Emacs had official support for buffer headers (despite notmuch using them all this time).

The new header includes additional information, such as the current filter, the number of unread entries, the total number of entries, and the number of unique feeds currently in view. You’ll see this as <unread>/<total>:<feeds> in the middle of the header.

As of this writing, the new header has not been made part of a formal release. So if you’re only tracking stable releases, you won’t see this for awhile longer.

You can supply your own header via elfeed-search-header-function (thanks to Gergely Nagy).

Scoped Updates

As you already know, in the search buffer listing you can press G to update your feeds. But did you know you it takes a prefix argument? Run as C-u G, it only updates feeds with entries currently listed in the buffer.

As of this writing, this is another feature not yet in a formal release. I’d been wanting something like this for awhile but couldn’t think of a reasonable interface. Directly prompting the user for feeds is neither elegant nor composable. However, groks suggested the prefix argument, which composes perfectly with Elfeed’s existing idioms.

Listing Customizations

In addition to custom faces, there are a number of ways to customize the listing.

Gergely Nagy has been throwing lots of commits at me over the last couple of weeks to open up lots of Elfeed’s behavior to customization, so there are more to come.

Thank You, Emacs Community

Apologies about any features I missed or anyone I forgot to mention who’s made contributions. The above comes from my ChangeLogs, the commit log, the GitHub issue listing, and my own memory, so I’m likely to have forgotten some things. A couple of these features I had forgotten about myself!

tags: [ emacs elfeed ]

Quickly Access x86 Documentation in Emacs

I recently released an Emacs package called x86-lookup. Given a mnemonic, Emacs will open up a local copy of an Intel’s software developer manual PDF at the page documenting the instruction. It complements nasm-mode, released earlier this year.

x86-lookup is also available from MELPA.

To use it, you’ll need Poppler’s pdftotext command line program — used to build an index of the PDF — and a copy of the complete Volume 2 of Intel’s instruction set manual. There’s only one command to worry about: M-x x86-lookup.

Minimize documentation friction

This package should be familiar to anyone who’s used javadoc-lookup, one of my older packages. It has a common underlying itch: the context switch to read API documentation while coding should have as little friction as possible, otherwise I’m discouraged from doing it. In an ideal world I wouldn’t ever need to check documentation because it’s already in my head. By visiting documentation frequently with ease, it’s going to become familiar that much faster and I’ll be reaching for it less and less, approaching the ideal.

I picked up x86 assembly [about a year ago][x86] and for the first few months I struggled to find a good online reference for the instruction set. There are little scraps here and there, but not much of substance. The big exception is Félix Cloutier’s reference, which is an amazingly well-done HTML conversion of Intel’s PDF manuals. Unfortunately I could never get it working locally to generate my own. There’s also the X86 Opcode and Instruction Reference, but it’s more for machines than humans.

Besides, I often work without an Internet connection, so offline documentation is absolutely essential. (You hear that Microsoft? Not only do I avoid coding against Win32 because it’s badly designed, but even more so because you don’t offer offline documentation anymore! The friction to API reference your documentation is enormous.)

I avoided the official x86 documentation for awhile, thinking it would be too opaque, at least until I became more accustomed to the instruction set. But really, it’s not bad! With a handle on the basics, I would encourage anyone to dive into either Intel’s or AMD’s manuals. The reason there’s not much online in HTML form is because these manuals are nearly everything you need.

I chose Intel’s manuals for x86-lookup because I’m more familiar with it, it’s more popular, it’s (slightly) easier to parse, it’s offered as a single PDF, and it’s more complete. The regular expression for finding instructions is tuned for Intel’s manual and it won’t work with AMD’s manuals.

For a couple months prior to writing x86-lookup, I had a couple of scratch functions to very roughly accomplish the same thing. The tipping point for formalizing it was that last month I wrote my own x86 assembler. A single mnemonic often has a dozen or more different opcodes depending on the instruction’s operands, and there are often several ways to encode the same operation. I was frequently looking up opcodes, and navigating the PDF quickly became a real chore. I only needed about 80 different opcodes, so I was just adding them to the assembler’s internal table manually as needed.

How does it work?

Say you want to look up the instruction RDRAND.

Initially Emacs has no idea what page this is on, so the first step is to build an index mapping mnemonics to pages. x86-lookup runs the pdftotext command line program on the PDF and loads the result into a temporary buffer.

The killer feature of pdftotext is that it emits FORM FEED (U+0012) characters between pages. Think of these as page breaks. By counting form feed characters, x86-lookup can track the page for any part of the document. In fact, Emacs is already set up to do this with its forward-page and backward-page commands. So to build the index, x86-lookup steps forward page-by-page looking for mnemonics, keeping note of the page. Since this process typically takes about 10 seconds, the index is cached in a file (see x86-lookup-cache-directory) for future use. It only needs to happen once for a particular manual on a particular computer.

The mnemonic listing is slightly incomplete, so x86-lookup expands certain mnemonics into the familiar set. For example, all the conditional jumps are listed under “Jcc,” but this is probably not what you’d expect to look up. I compared x86-lookup’s mnemonic listing against NASM/nasm-mode’s mnemonics to ensure everything was accounted for. Both packages benefited from this process.

Once the index is built, pdftotext is no longer needed. If you’re desperate and don’t have this program available, you can borrow the index file from another computer. But you’re on your own for figuring that out!

So to look up RDRAND, x86-lookup checks the index for the page number and invokes a PDF reader on that page. This is where not all PDF readers are created equal. There’s no convention for opening a PDF to a particular page and each PDF reader differs. Some don’t even support it. To deal with this, x86-lookup has a function specialized for different PDF readers. Similar to browse-url-browser-function, x86-lookup has x86-lookup-browse-pdf-function.

By default it tries to open the PDF for viewing within Emacs (did you know Emacs is a PDF viewer?), falling back to on options if the feature is unavailable. I welcome pull requests for any PDF readers not yet supported by x86-lookup. Perhaps this functionality deserves its own package.

That’s it! It’s a simple feature that has already saved me a lot of time. If you’re ever programming in x86 assembly, give x86-lookup a spin.

tags: [ x86 emacs ]

RSA Signatures in Emacs Lisp

Emacs comes with a wonderful arbitrary-precision computer algebra system called calc. I’ve discussed it previously and continue to use it on a daily basis. That’s right, people, Emacs can do calculus. Like everything Emacs, it’s programmable and extensible from Emacs Lisp. In this article, I’m going to implement the RSA public-key cryptosystem in Emacs Lisp using calc.

If you want to dive right in first, here’s the repository:

This is only a toy implementation and not really intended for serious cryptographic work. It’s also far too slow when using keys of reasonable length.

Evaluation with calc

The calc package is particularly useful when considering Emacs’ limited integer type. Emacs uses a tagged integer scheme where integers are embedded within pointers. It’s a lot faster than the alternative (individually-allocated integer objects), but it means they’re always a few bits short of the platform’s native integer type.

calc has a large API, but the user-friendly porcelain for it is the under-documented calc-eval function. It evaluates an expression string with format-like argument substitutions ($n).

(calc-eval "2^16 - 1")
;; => "65535"

(calc-eval "2^$1 - 1" nil 128)
;; => "340282366920938463463374607431768211455"

Notice it returns strings, which is one of the ways calc represents arbitrary precision numbers. For arguments, it accepts regular Elisp numbers and strings just like this function returns. The implicit radix is 10. To explicitly set the radix, prefix the number with the radix and #. This is the same as in the user interface of calc. For example:

(calc-eval "16#deadbeef")
;; => "3735928559"

The second argument (optional) to calc-eval adjusts its behavior. Given nil, it simply evaluates the string and returns the result. The manual documents the different options, but the only other relevant option for RSA is the symbol pred, which asks it to return a boolean “predicate” result.

(calc-eval "$1 < $2" 'pred "4000" "5000")
;; => t

Generating primes

RSA is founded on the difficulty of factoring large composites with large factors. Generating an RSA keypair starts with generating two prime numbers, p and q, and using these primes to compute two mathematically related composite numbers.

calc has a function calc-next-prime for finding the next prime number following any arbitrary number. It uses a probabilistic primarily test — the Fermat Miller-Rabin primality test — to efficiently test large integers. It increments the input until it finds a result that passes enough iterations of the primality test.

(calc-eval "nextprime($1)" nil "100000000000000000")
;; => "100000000000000003"

So to generate a random n-bit prime, first generate a random n-bit number and then increment it until a prime number is found.

;; Generate a 128-bit prime, 10 iterations (0.000084% error rate)
(calc-eval "nextprime(random(2^$1), 10)" nil 128)

Unfortunately calc’s random function is based on Emacs’ random function, which is entirely unsuitable for cryptography. In the real implementation I read n bits from /dev/urandom to generate an n-bit number.

  (set-buffer-multibyte nil)
  (call-process "head" "/dev/urandom" t nil "-c" (format "%d" (/ bits 8)))
  (let ((f (apply-partially #'format "%02x")))
    (concat "16#" (mapconcat f (buffer-string) ""))))

(Note: /dev/urandom is the right choice. There’s no reason to use /dev/random for generating keys.)

Computing e and d

From here the code just follows along from the Wikipedia article. After generating the primes p and q, two composites are computed, n = p * q and i = (p - 1) * (q - 1). Lacking any reason to do otherwise, I chose 65,537 for the public exponent e.

The function rsa--inverse is just a straight Emacs Lisp + calc implementation of the extended Euclidean algorithm from the Wikipedia article pseudocode, computing d ≡ e^-1 (mod i). It’s not much use sharing it here, so take a look at the repository if you’re curious.

(defun rsa-generate-keypair (bits)
  "Generate a fresh RSA keypair plist of BITS length."
  (let* ((p (rsa-generate-prime (+ 1 (/ bits 2))))
         (q (rsa-generate-prime (+ 1 (/ bits 2))))
         (n (calc-eval "$1 * $2" nil p q))
         (i (calc-eval "($1 - 1) * ($2 - 1)" nil p q))
         (e (calc-eval "2^16+1"))
         (d (rsa--inverse e i)))
    `(:public  (:n ,n :e ,e) :private (:n ,n :d ,d))))

The public key is n and e and the private key is n and d. From here we can compute and verify cryptographic signatures.


To compute signature s of an integer m (where m < n), compute s ≡ m^d (mod n). I chose the right-to-left binary method, again straight from the Wikipedia pseudocode (lazy!). I’ll share this one since it’s short. The backslash denotes integer division.

(defun rsa--mod-pow (base exponent modulus)
  (let ((result 1))
    (setf base (calc-eval "$1 % $2" nil base modulus))
    (while (calc-eval "$1 > 0" 'pred exponent)
      (when (calc-eval "$1 % 2 == 1" 'pred exponent)
        (setf result (calc-eval "($1 * $2) % $3" nil result base modulus)))
      (setf exponent (calc-eval "$1 \\ 2" nil exponent)
            base (calc-eval "($1 * $1) % $2" nil base modulus)))

Verifying the signature is the same process, but with the public key’s e: m ≡ s^e (mod n). If the signature is valid, m will be recovered. In theory, only someone who knows d can feasibly compute s from m. If n is small enough to factor, revealing p and q, then d can be feasibly recomputed from the public key. So mind your Ps and Qs.

So that leaves one problem: generally users want to sign strings and files and such, not integers. A hash function is used to reduce an arbitrary quantity of data into an integer suitable for signing. Emacs comes with a bunch of them, accessible through secure-hash. It hashes strings and buffers.

(secure-hash 'sha224 "Hello, world!")
;; => "8552d8b7a7dc5476cb9e25dee69a8091290764b7f2a64fe6e78e9568"

Since the result is hexadecimal, just prefix 16# to turn it into a calc integer.

Here’s the signature and verification functions. Any string or buffer can be signed.

(defun rsa-sign (private-key object)
  (let ((n (plist-get private-key :n))
        (d (plist-get private-key :d))
        (hash (concat "16#" (secure-hash 'sha384 object))))
    ;; truncate hash such that hash < n
    (while (calc-eval "$1 > $2" 'pred hash n)
      (setf hash (calc-eval "$1 \\ 2" nil hash)))
    (rsa--mod-pow hash d n)))

(defun rsa-verify (public-key object sig)
  (let ((n (plist-get public-key :n))
        (e (plist-get public-key :e))
        (hash (concat "16#" (secure-hash 'sha384 object))))
    ;; truncate hash such that hash < n
    (while (calc-eval "$1 > $2" 'pred hash n)
      (setf hash (calc-eval "$1 \\ 2" nil hash)))
    (let* ((result (rsa--mod-pow sig e n)))
      (calc-eval "$1 == $2" 'pred result hash))))

Note the hash truncation step. If this is actually necessary, then your n is very easy to factor! It’s in there since this is just a toy and I want it to work with small keys.

Putting it all together

Here’s the whole thing in action with an extremely small, 128-bit key.

(setf message "hello, world!")

(setf keypair (rsa-generate-keypair 128))
;; => (:public  (:n "74924929503799951536367992905751084593"
;;               :e "65537")
;;     :private (:n "74924929503799951536367992905751084593"
;;               :d "36491277062297490768595348639394259869"))

(setf sig (rsa-sign (plist-get keypair :private) message))
;; => "31982247477262471348259501761458827454"

(rsa-verify (plist-get keypair :public) message sig)
;; => t

(rsa-verify (plist-get keypair :public) (capitalize message) sig)
;; => nil

Each of these operations took less than a second. For larger, secure-length keys, this implementation is painfully slow. For example, generating a 2048-bit key takes my laptop about half an hour, and computing a signature with that key (any size message) takes about a minute. That’s probably a little too slow for, say, signing ELPA packages.

tags: [ emacs elisp lisp ]