Articles tagged win32 at null program

Frankenwine: Multiple personas in a Wine process

2026-01-19T21:51:38Z

I came across a recent article on making Linux system calls from a Wine process. Windows programs running under Wine are still normal Linux processes and may interact with the Linux kernel like any other process. None of this was surprising, and the demonstration works just as I expect. Still, it got the wheels spinning and I realized an almost practical application: build my pkg-config implementation such that on Windows pkg-config.exe behaves as a native pkg-config, but when run under Wine this same binary takes the persona of a Linux program and becomes a cross toolchain pkg-config, bypassing Win32 and talking directly with the Linux kernel. Cosmopolitcan Libc cleverly does this out-of-the-box, but in this article we’ll mash together a couple existing sources with a bit of glue.

The results are in the merge-demo branch of u-config, and took hardly any work:

$ git show --stat
...
 main_linux_amd64.c |   8 ++---
 main_wine.c        | 101 +++++++++++++++++++++++++++++++++++++++++
 src/linux_noarch.c |  16 ++++-----
 src/u-config.c     |   1 +
 4 files changed, 114 insertions(+), 12 deletions(-)

A platform layer, main_wine.c, is a merge of two existing platform layers, one of which required unavoidable tweaks. We’ll get to those details in a moment. First we’ll need to detect if we’re running under Wine, and the best solution I found was to locate ntdll!wine_get_version. If this function exists, we’re in Wine. That works out to a pretty one-liner because ntdll.dll is already loaded:

bool running_on_wine()
{
    return GetProcAddress(GetModuleHandleA("ntdll"), "wine_get_version");
}

An x86-64 Linux syscall wrapper with thorough inline assembly:

ptrdiff_t syscall3(int n, ptrdiff_t a, ptrdiff_t b, ptrdiff_t c)
{
    ptrdiff_t r;
    asm volatile (
        "syscall"
        : "=a"(r)
        : "a"(n), "D"(a), "S"(b), "d"(c)
        : "rcx", "r11", "memory"
    );
    return r;
}

ptrdiff_t write(int fd, void *buf, ptrdiff_t len)
{
    return syscall3(SYS_write, fd, (ptrdiff_t)buf, len);
}

I’d normally use long for all these integers because Linux is LP64 (long is pointer-sized), but Windows is LLP64 (only long long is 64 bits). It’s so bizarre to interface with Linux from LLP64, and this will have consequences later. With these pieces we can see the basic shape of a split personality program:

    if (running_on_wine()) {
        write(1, "hello, wine\n", 12);
    } else {
        HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
        WriteFile(h, "hello, windows\n", 15, 0, 0);
    }

We can cram two programs into this binary and select which program at run time depending on what we see. In typical programs locating and calling into glibc would be a challenge, particularly with the incompatible ABIs involved. We’re avoiding it here by interfacing directly with the kernel.

Application to u-config

Luckily u-config has completely-optional platform layers implemented with Linux system calls. The POSIX platform layer works fine, and that’s what distributions should generally use, but these bonus platforms are unhosted and do not require libc. That means we can shove it into a Windows build with relatively little trouble.

Before we do that, let’s think about what we’re doing. Debian has great cross toolchain support, including Mingw-w64. There are even a few Windows libraries in the Debian package repository, such as zlib, and we can build Windows programs against them. If you’re cross-building and using pkg-config, you ought to use the cross toolchain pkg-config, which in GNU ecosystems gets an architecture prefix like the other cross tools. Debian cross toolchains each include a cross pkg-config, and it sometimes almost works correctly! Here’s what I get on Debian 13:

$ x86_64-w64-mingw32-pkg-config --cflags --libs zlib
-I/usr/x86_64-w64-mingw32/include -L/usr/x86_64-w64-mingw32/lib -lz

Note the architecture in the -I and -L options. It really is querying the cross sysroot. Though these paths are in the cross sysroot, and so should not be listed by pkg-config. It’s unoptimal and indicates this pkg-config is probably misconfigured. In other cases it’s far from correct:

$ x86_64-w64-mingw32-pkg-config --variable pc_path pkg-config
/usr/local/lib/x86_64-linux-gnu/pkgconfig:...

A tool prefixed x86_64-w64-mingw32- should not produce paths containing x86_64-linux-gnu (the host architecture in this case). Our version won’t have these issues.

The u-config platform interface is five functions:

filemap os_mapfile(os *, arena *, s8 path);  // read whole files
s8node *os_listing(os *, arena *, s8 path);  // list directories
void    os_write(os *, i32 fd, s8);          // standard out/err
void    os_fail(os *);                       // non-zero exit

void uconfig(config *);

Platforms implement the first four functions, and call uconfig() with the platform’s configuration, context pointer (os *), command line arguments, environment, and some memory (all in the config object). My strategy is to link two platforms into the binary, and the first challenge is they both define os_write, etc. I did not plan nor intend for one binary to contain more than one platform layer. Unity builds offer a fix without changing a single line of code:

#define os_fail     win32_fail
#define os_listing  win32_listing
#define os_mapfile  win32_mapfile
#define os_write    win32_write
#include "main_windows.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

#define os_fail     linux_fail
#define os_listing  linux_listing
#define os_mapfile  linux_mapfile
#define os_write    linux_write
#include "main_linux_amd64.c"
#undef os_write
#undef os_mapfile
#undef os_listing
#undef os_fail

This dirty, but effective trick may look familiar. It also doesn’t interfere with the other builds. Next I define the real platform functions as a dispatch based on our run-time situation:

b32 wine_detected;

filemap os_mapfile(os *ctx, arena *a, s8 path)
{
    if (wine_detected) {
        return linux_mapfile(ctx, a, path);
    } else {
        return win32_mapfile(ctx, a, path);
    }
}

If I were serious about keeping this experiment, I’d lift os as I did the functions (as win32_os, linux_os) and include wine_detected in the context, eliminating this global variable. That cannot be done with simple hacks and macros.

The next challenge is that I wrote the Linux platform layer assuming LP64, and so it uses long instead of an equivalent platform-agnostic type like ptrdiff_t. I never thought this would be an issue because this source literally contains asm blocks and no conditional compilation, yet here we are. Lesson learned. I wanted to try an extremely janky #define on long to fix it, but this source file has a couple long long that won’t play along. These multi-token type names of C are antithetical to its preprocessor! So I adjusted the source manually instead.

The Windows and Linux platform entry points are completely different, both in name and form, and so co-exist naturally. The merged platform layer is a new entry point that will pass control to the appropriate entry point:

void entrypoint(ptrdiff_t *stack);  // Linux
void __stdcall mainCRTStartup();    // Windows

On Linux stack is the initial value of the stack pointer, which points to argc, argv, envp, and auxv. We’ll need construct an artificial “stack” for the Linux platform layer to harvest. On Windows this is the process entry point, and it will find the rest on its own as a normal Windows process. Ultimately this ended up simpler than I expected:

void __stdcall merge_entrypoint()
{
    wine_detected = running_on_wine();
    if (wine_detected) {
        u8 *fakestack[CMDLINE_ARGV_MAX+1];
        c16 *cmd = GetCommandLineW();
        fakestack[0] = (u8 *)(iz)cmdline_to_argv8(cmd, fakestack+1);
        // TODO: append envp to the fake stack
        entrypoint((iz *)fakestack);
    } else {
        mainCRTStartup();
    }
}

Where cmdline_to_argv8 is my Windows argument parser, already used by u-config, and I reserve one element at the front to store argc. Since this is just a proof-of-concept I didn’t bother fabricating and pushing envp onto the fake stack. The Linux entry point doesn’t need auxv and can be omitted. Once in the Linux entry point it’s essentially a Linux process from then on, except the x64 calling convention still in use internally.

Finally, I configure the Linux platform layer for Debian’s cross sysroot:

#define PKG_CONFIG_LIBDIR "/usr/x86_64-w64-mingw32/lib/pkgconfig"
#define PKG_CONFIG_SYSTEM_INCLUDE_PATH "/usr/x86_64-w64-mingw32/include"
#define PKG_CONFIG_SYSTEM_LIBRARY_PATH "/usr/x86_64-w64-mingw32/lib"

And that’s it! We have our platform merge. Build (w64devkit):

$ cc -nostartfiles -e merge_entrypoint -o pkg-config.exe main_wine.c

On Debian use x86_64-w64-mingw32-gcc for cc. The -e linker option selects the new, higher level entry point. After installing Wine binfmt, here’s how it looks on Debian:

$ ./pkg-config.exe --cflags --libs zlib
-lz

That’s the correct output, but is it using the cross sysroot? Ask it to include the -I argument despite it being in the cross sysroot:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-I/usr/x86_64-w64-mingw32/include -lz

Looking good! It passes the pc_path test, too:

$ ./pkg-config.exe --variable pc_path pkg-config
/usr/x86_64-w64-mingw32/lib/pkgconfig

Running this same binary on Windows after installing zlib in w64devkit:

$ ./pkg-config.exe --cflags --libs --keep-system-cflags zlib
-IC:/w64devkit/include -lz

Also:

$ ./pkg-config.exe --variable pc_path pkg-config
C:/w64devkit/lib/pkgconfig;C:/w64devkit/share/pkgconfig

My Frankenwine is a success!

Closures as Win32 window procedures

2025-12-12T19:52:10Z

Back in 2017 I wrote about a technique for creating closures in C using JIT-compiled wrapper. It’s neat, though rarely necessary in real programs, so I don’t think about it often. I applied it to qsort, which sadly accepts no context pointer. More practical would be working around insufficient custom allocator interfaces, to create allocation functions at run-time bound to a particular allocation region. I’ve learned a lot since I last wrote about this subject, and a recent article had me thinking about it again, and how I could do better than before. In this article I will enhance Win32 window procedure callbacks with a fifth argument, allowing us to more directly pass extra context. I’m using w64devkit on x64, but the everything here should work out-of-the-box with any x64 toolchain that speaks GNU assembly.

A window procedure has this prototype:

LRESULT Wndproc(
  HWND hWnd,
  UINT Msg,
  WPARAM wParam,
  LPARAM lParam,
);

To create a window we must first register a class with RegisterClass, which accepts a set of properties describing a window class, including a pointer to one of these functions.

    MyState *state = ...;

    RegisterClassA(&(WNDCLASSA){
        // ...
        .lpfnWndProc   = my_wndproc,
        .lpszClassName = "my_class",
        // ...
    });

    HWND hwnd = CreateWindowExA("my_class", ..., state);

The thread drives a message pump with events from the operating system, dispatching them to this procedure, which then manipulates the program state in response:

    for (MSG msg; GetMessageW(&msg, 0, 0, 0);) {
        TranslateMessage(&msg);
        DispatchMessageW(&msg);  // calls the window procedure
    }

All four WNDPROC parameters are determined by Win32. There is no context pointer argument. So how does this procedure access the program state? We generally have two options:

Global variables. Yucky but easy. Frequently seen in tutorials.
A GWLP_USERDATA pointer attached to the window.

The second option takes some setup. Win32 passes the last CreateWindowEx argument to the window procedure when the window created, via WM_CREATE. The procedure attaches the pointer to its window as GWLP_USERDATA. This pointer is passed indirectly, through a CREATESTRUCT. So ultimately it looks like this:

    case WM_CREATE:
        CREATESTRUCT *cs = (CREATESTRUCT *)lParam;
        void *arg = (struct state *)cs->lpCreateParams;
        SetWindowLongPtr(hwnd, GWLP_USERDATA, (LONG_PTR)arg);
        // ...

In future messages we can retrieve it with GetWindowLongPtr. Every time I go through this I wish there was a better way. What if there was a fifth window procedure parameter though which we could pass a context?

typedef LRESULT Wndproc5(HWND, UINT, WPARAM, LPARAM, void *);

We’ll build just this as a trampoline. The x64 calling convention passes the first four arguments in registers, and the rest are pushed on the stack, including this new parameter. Our trampoline cannot just stuff the extra parameter in the register, but will actually have to build a stack frame. Slightly more complicated, but barely so.

Allocating executable memory

In previous articles, and in the programs where I’ve applied techniques like this, I’ve allocated executable memory with VirtualAlloc (or mmap elsewhere). This introduces a small challenge for solving the problem generally: Allocations may be arbitrarily far from our code and data, out of reach of relative addressing. If they’re further than 2G apart, we need to encode absolute addresses, and in the simple case would just assume they’re always too far apart.

These days I’ve more experience with executable formats, and allocation, and I immediately see a better solution: Request a block of writable, executable memory from the loader, then allocate our trampolines from it. Other than being executable, this memory isn’t special, and allocation works the usual way, using functions unaware it’s executable. By allocating through the loader, this memory will be part of our loaded image, guaranteed to be close to our other code and data, allowing our JIT compiler to assume a small code model.

There are a number of ways to do this, and here’s one way to do it with GNU-styled toolchains targeting COFF:

        .section .exebuf,"bwx"
        .globl exebuf
exebuf:	.space 1<<21

This assembly program defines a new section named .exebuf containing 2M of writable ("w"), executable ("x") memory, allocated at run time just like .bss ("b"). We’ll treat this like an arena out of which we can allocate all trampolines we’ll probably ever need. With careful use of .pushsection this could be basic inline assembly, but I’ve left it as a separate source. On the C side I retrieve this like so:

typedef struct {
    char *beg;
    char *end;
} Arena;

Arena get_exebuf()
{
    extern char exebuf[1<<21];
    Arena r = {exebuf, exebuf+sizeof(exebuf)};
    return r;
}

Unfortunately I have to repeat myself on the size. There are different ways to deal with this, but this is simple enough for now. I would have loved to define the array in C with the GCC section attribute, but as is usually the case with this attribute, it’s not up to the task, lacking the ability to set section flags. Besides, by not relying on the attribute, any C compiler could compile this source, and we only need a GNU-style toolchain to create the tiny COFF object containing exebuf.

While we’re at it, a reminder of some other basic definitions we’ll need:

#define S(s)            (Str){s, sizeof(s)-1}
#define new(a, n, t)    (t *)alloc(a, n, sizeof(t), _Alignof(t))

typedef struct {
    char     *data;
    ptrdiff_t len;
} Str;

Str clone(Arena *a, Str s)
{
    Str r = s;
    r.data = new(a, r.len, char);
    memcpy(r.data, s.data, (size_t)r.len);
    return r;
}

Which have been discussed at length in previous articles.

Trampoline compiler

From here the plan is to create a function that accepts a Wndproc5 and a context pointer to bind, and returns a classic WNDPROC:

WNDPROC make_wndproc(Arena *, Wndproc5, void *arg);

Our window procedure now gets a fifth argument with the program state:

LRESULT my_wndproc(HWND, UINT, WPARAM, LPARAM, void *arg)
{
    MyState *state = arg;
    // ...
}

When registering the class we wrap it in a trampoline compatible with RegisterClass:

    RegisterClassA(&(WNDCLASSA){
        // ...
        .lpfnWndProc   = make_wndproc(a, my_wndproc, state),
        .lpszClassName = "my_class",
        // ...
    });

All windows using this class will readily have access to this state object through their fifth parameter. It turns out setting up exebuf was the more complicated part, and make_wndproc is quite simple!

WNDPROC make_wndproc(Arena *a, Wndproc5 proc, void *arg)
{
    Str thunk = S(
        "\x48\x83\xec\x28"      // sub   $40, %rsp
        "\x48\xb8........"      // movq  $arg, %rax
        "\x48\x89\x44\x24\x20"  // mov   %rax, 32(%rsp)
        "\xe8...."              // call  proc
        "\x48\x83\xc4\x28"      // add   $40, %rsp
        "\xc3"                  // ret
    );
    Str r   = clone(a, thunk);
    int rel = (int)((uintptr_t)proc - (uintptr_t)(r.data + 24));
    memcpy(r.data+ 6, &arg, sizeof(arg));
    memcpy(r.data+20, &rel, sizeof(rel));
    return (WNDPROC)r.data;
}

The assembly allocates a new stack frame, with callee shadow space, and with room for the new argument, which also happens to re-align the stack. It stores the new argument for the Wndproc5 just above the shadow space. Then calls into the Wndproc5 without touching other parameters. There are two “patches” to fill out, which I’ve initially filled with dots: the context pointer itself, and a 32-bit signed relative address for the call. It’s going to be very near the callee. The only thing I don’t like about this function is that I’ve manually worked out the patch offsets.

It’s probably not useful, but it’s easy to update the context pointer at any time if hold onto the trampoline pointer:

void set_wndproc_arg(WNDPROC p, void *arg)
{
    memcpy((char *)p+6, &arg, sizeof(arg));
}

So, for instance:

    MyState *state[2] = ...;  // multiple states
    WNDPROC proc = make_wndproc(a, my_wndproc, state[0]);
    // ...
    set_wndproc_arg(proc, state[1]);  // switch states

Though I expect the most common case is just creating multiple procedures:

    WNDPROC procs[] = {
        make_wndproc(a, my_wndproc, state[0]),
        make_wndproc(a, my_wndproc, state[1]),
    };

To my slight surprise these trampolines still work with an active Control Flow Guard system policy. Trampolines do not have stack unwind entries, and I thought Windows might refuse to pass control to them.

Here’s a complete, runnable example if you’d like to try it yourself: main.c and exebuf.s

Better cases

This is more work than going through GWLP_USERDATA, and real programs have a small, fixed number of window procedures — typically one — so this isn’t the best example, but I wanted to illustrate with a real interface. Again, perhaps the best real use is a library with a weak custom allocator interface:

typedef struct {
    void *(*malloc)(size_t);   // no context pointer!
    void  (*free)(void *);     // "
} Allocator;

void *arena_malloc(size_t, Arena *);

// ...

    Allocator perm_allocator = {
        .malloc = make_trampoline(exearena, arena_malloc, perm);
        .free   = noop_free,
    };
    Allocator scratch_allocator = {
        .malloc = make_trampoline(exearena, arena_malloc, scratch);
        .free   = noop_free,
    };

Something to keep in my back pocket for the future.

Meet the new xxd for w64devkit: rexxd

2025-02-17T00:49:49Z

xxd is a versatile hexdump utility with a “reverse” feature, originally written between 1990–1996. The Vim project soon adopted it, and it’s lived there ever since. If you have Vim, you also have xxd. Its primary use cases are (1) the basis for a hex editor due to its -r reverse option that can unhexdump its previous output, and (2) a data embedding tool for C and C++ (-i). The former provides Vim’s rudimentary hex editor functionality. The second case is of special interest to w64devkit: xxd -i appears in many builds that embed arbitrary data. It’s important that w64devkit has a compatible implementation, and a freshly rewritten, improved xxd, rexxd, now replaces the original xxd (as xxd).

For those unfamiliar with xxd, examples are in order. Its default hexdump output looks like this:

$ echo hello world | xxd | tee dump
00000000: 6865 6c6c 6f20 776f 726c 640a            hello world.

Octets display in pairs with an ASCII text listing on the right. All configurable. I can run this in reverse (-r), recovering the original input:

$ xxd -r dump
hello world

The tool reads the offset before the colon, the hexadecimal octets, and ignores the text column. By editing dump with a text editor, I can change the raw octets of the original input. From this point of view, the hexdump is actually a program of two alternating instructions: seek and write. xxd seeks to the offset, writes the octets, then repeats. It also doesn’t truncate the output file, so a hexdump can express binary patches as a seek/write program.

$ echo hello world >hello
$ echo 6: 65766572796f6e650a | xxd -r - hello
$ cat hello
hello everyone

That seeks to offset 0x6, then writes the 9 octets. The xxd parser is flexible, and I did not need to follow the default format. It figured out the format on its own, and rexxd further improves on this. We can use it to create large files out of thin air, too:

$ echo 3fffffff: 00 | xxd -r - >1G

This command creates an all-zero, 1GiB file, 1G, by seeking to just before 1GiB then writing a zero. I used >1G so that the shell would truncate the file before starting xxd — in case it was larger or contained non-zeros.

This is a “smart seek” of course, and its not literally seeking on every line. The tool tracks its file position and only seeks when necessary. If seeking fails, it simulates the seek using a write if possible. When would it not be possible? Lines need not be in order, of course, and so it may need to seek backwards. Lines can also overlap in contents. If it weren’t for buffering — or if rexxd had a unified buffer cache — then by using the same file for input and output an “xxd program” could write new instructions for itself and accidentally become Turing-complete.

The other common mode, -i, looks like this:

$ echo hello world >hello
$ xxd -i hello hello.c

Which produces this hello.c:

unsigned char hello[] = {
  0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int hello_len = 12;

Note how it converted the file name into variable names. Characters disallowed in variable names become underscores _. When reading from standard input, xxd only emits the octets. Unless the new-ish -n name option is given, in which case that becomes the variable name. This remains popular because, #embed notwithstanding, as of this writing all major toolchains remain stubborn about embedding data on their own.

The case for replacement

The idea of replacing it began with backporting the -n name option to Vim 9.0 xxd. The feature did not appear in a release until a year ago, 28 years after -i, despite its obviousness. I’ve also felt that xxd is slower than it could be, and a momentary examination reveals it’s buggier than it ought to be. As expected, a few seconds of fuzz testing xxd -r reveals bugs, and it doesn’t even require writing a single line of code:

$ afl-gcc -fsanitize=address,undefined xxd.c
$ mkdir inputs
$ echo >inputs/sample
$ afl-fuzz -i inputs/ -o fuzzout/ ./a.out -r

The Windows port is lacking in the usual ways, unable to handle Unicode paths. The new Vim 9.1 xxd -R color feature broke the Windows port, and if w64devkit included Vim 9.1 then I’d need to patch out the new bugs. As demonstrated above, at least it’s trivial to compile! It’s a single source file, xxd.c, and requires no configuration. I love that.

The more I looked, the more problems I found. It’s not doing anything terribly complex, so I expected it wouldn’t be difficult to rewrite it with a better foundation. So I did. Ignoring tests and documentation, my rewrite is about twice as long. In exchange, it’s substantially faster:

$ dd if=/dev/urandom of=bigfile bs=1M count=64

$ time orig-xxd bigfile dump
real    0m 4.40s
user    0m 2.89s
sys     0m 1.46s

$ time rexxd bigfile dump
real    0m 0.31s
user    0m 0.07s
sys     0m 0.21s

Same in reverse:

$ time orig-xxd -r dump nul
real    0m 5.81s
user    0m 5.67s
sys     0m 0.07s

$ time rexxd -r dump nul
real    0m 0.33s
user    0m 0.23s
sys     0m 0.09s

Or embedding data with rexxd:

$ time orig-xxd -i bigfile bigfile.c
real    0m 10.32s
user    0m 9.85s
sys     0m 0.37s

$ time rexxd -i bigfile bigfile.c
real    0m 0.40s
user    0m 0.07s
sys     0m 0.34s

I wanted to keep it portable and simple, so that’s without fancy SIMD processing. Just SWAR parsing, branch avoidance, no division on hot paths, and sound architecture. I also optimized for the typical case at the cost of the atypical case. It’s a little unfair to compare it to a program probably first written on a 16-bit machine, but there was time for it to pick up these techniques over the decades, too.

Unicode support works well:

$ cat π
3.14159265358979323846264338327950288419716939937510582097494
$ rexxd -i π π.c

Producing this source with Unicode variables:

unsigned char π[] = {
  0x33, 0x2e, 0x31, 0x34, 0x31, 0x35, 0x39, 0x32, 0x36, 0x35, 0x33, 0x35,
  // ...
  0x34, 0x0a
};
unsigned int π_len = 62;

Whereas the original xxd on Windows has the usual CRT problems:

$ orig-xxd -i π
orig-xxd: p: No such file or directory

It also struggles with 64-bit offsets, particularly on 32-bit hosts and LLP64 hosts like Windows. In contrast, I designed rexxd to robustly process file offsets as 64-bit on all hosts. Its tests operate on a virtual file system with virtual files at those sizes, so those paths really have been tested, too.

The original xxd only uses static allocation, which places small range limits on the configuration:

$ orig-xxd -c 1000
orig-xxd: invalid number of columns (max. 256).

In rexxd everything is arena allocated of course, and options are limited only by the available memory, so the above, and more, would work. The arena helps make the SWAR tricks possible, too, providing a fast runway to load more data at a time.

While reverse engineering the original, I documented bugs I discovered and noted them with a BUG: comment if you wanted to see more. I’m not aiming for bug compatibility, so these are not present in rexxd.

Platform layer

The xxd man page suggests using strace to examine the execution of -r reverse. That is, to monitor the seeks and writes of a binary patch in order to debug it. That’s so insightful that I decided to build that as a new -x option (think sh -x). That is, rexxd has a built-in strace on all platforms! The trace is expressed in terms of unix system calls, even on Windows:

$ printf '00:41 \n02:42 \n04:43' | rexxd -x -r - data.bin
open("data.bin", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 19
write(1, "A", 1) = 1
lseek(1, 2, SEEK_SET) = 2
read(0, ..., 4096) = 0
write(1, "B", 1) = 1
lseek(1, 4, SEEK_SET) = 4
write(1, "C", 1) = 1
exit(0) = ?

Is this doing some kind of self-ptrace debugger voodoo? Nope. Like u-config, it has a platform layer, and it simply logs the platform layer calls — except for the trace printout itself of course. While the intention is to debug binary patches, it was also quite insightful in examining rexxd itself. It helped me spot that rexxd flushed more often than strictly necessary.

To port rexxd to any system, define Plt as needed, implement these five plt_ functions, then call xxd. The five functions mostly have the expected unix-like semantics:

typedef struct Plt Plt;
b32  plt_open(Plt *, i32 fd, u8 *path, b32 trunc, Arena *);
i64  plt_seek(Plt *, i32 fd, i64 off, i32 whence);
i32  plt_read(Plt *, u8 *buf, i32 len);
b32  plt_write(Plt *, i32 fd, u8 *buf, i32 len);
void plt_exit(Plt *, i32);
i32  xxd(i32 argc, u8 **argv, Plt *, byte *heap, iz heapsize);

If the platform wants these functions to be “virtual” then it can put function pointers in the Plt struct. Otherwise it stores anything it might need in Plt. Global variables are never necessary. The application layer doesn’t use the standard library except (indirectly) memset and memcpy, and it allocates everything it uses from the provided heap parameter.

plt_open is a little unusual in that it picks the file descriptor: 0 to replace standard input, or 1 to replace standard output. All platforms currently use a virtual file descriptor table, and these do not map onto the real process file descriptors. But they could! Calls are straced in the application layer, so they log virtual file descriptors as seen by rexxd. The arena parameter offers scratch space for the Windows platform layer to convert paths from narrow to wide for CreateFileW, so it can handle long path names with ease.

plt_read doesn’t accept a file descriptor because there’s only one from which to read, 0. plt_write on the other hand allows writing to standard error, 2.

plt_exit doesn’t return, of course. In tests it longjmps back to the top level, as though returning from xxd with a status. This lets me skip allocation null pointer checks, with OOM unwinding safely back to the top level. Since rexxd allocates everything from the arena, it’s all automatically deallocated, so it’s a clean exit.

On Windows, plt_seek calls SetFilePointerEx. I learned the hard way that the behavior of calling it on a non-file is undefined, not an error, so at least one GetFileType call is mandatory. I also learned that Windows will successfully seek all the way to INT64_MAX. If the file system doesn’t support that offset, it’s a write failure later. For correct operation, rexxd must take care not to overflow its own internal file position tracking near these offsets with Windows allowing seeks to operate at the edge until the first flush. Tests run on a virtual file system thanks to the platform layer, and some tests permit huge seeks and simulate impossibly enormous files in order to probe behavior at the extremes.

This is in contrast to Linux, where seeks beyond the underlying file system’s supported file size is a seek error. For example, on ext4 with the default configuration:

$ echo ffffffff000: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040320, SEEK_SET) = 17592186040320
read(0, ..., 4096) = 0
write(1, "\0", 1) = -1
exit(3) = ?

We can see the seek succeeded then the write failed because it went one byte beyond the file system limit. While seeking one byte further will cause the seek to fail (22 EINVAL), and rexxd falls back on write until it fills the storage and runs out of space:

$ echo ffffffff001: 00 | rexxd -x -r - somefile
open("somefile", O_CREAT|O_WRONLY, 0666) = 1
read(0, ..., 4096) = 16
lseek(1, 17592186040321, SEEK_SET) = -1
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
write(1, "\0\0\0\0\0\0...\0\0\0\0\0\0", 4096) = 4096
...

Mostly for fun, I wrote a libc-free platform layer using raw Linux system calls, and it maps almost perfectly onto the kernel interface:

struct Plt { int fds[3]; };

b32 plt_open(Plt *plt, i32 fd, u8 *path, b32 trunc, Arena *)
{
    i32 mode = fd ? O_CREAT|O_WRONLY : 0;
    mode |= trunc ? O_TRUNC : 0;
    plt->fds[fd] = (i32)syscall3(SYS_open, (uz)path, mode, 0666);
    return plt->fds[fd] >= 0;
}

i64 plt_seek(Plt *plt, i32 fd, i64 off, i32 whence)
{
    return syscall3(SYS_lseek, plt->fds[fd], off, whence);
}

i32 plt_read(Plt *plt, u8 *buf, i32 len)
{
    return (i32)syscall3(SYS_read, plt->fds[0], (uz)buf, len);
}

b32 plt_write(Plt *plt, i32 fd, u8 *buf, i32 len)
{
    return len == syscall3(SYS_write, plt->fds[fd], (uz)buf, len);
}

void plt_exit(Plt *, i32 r)
{
    syscall3(SYS_exit, r, 0, 0);
}

On Windows I use the artisanal function prototypes of which I’ve grown so fond. It’s also my first time using w64devkit’s -lmemory in a serious application. I’m using -lchkstk in the “xxd as a DLL” platform layer, too, but that one’s just a toy. In that one I use alloca to allocate an arena, which is a rather novel combination, and the large stack frame requires a stack probe. Otherwise none of rexxd requires stack probes.

w64devkit’s new xxd.exe is delightfully tidy as viewed by peports:

$ du -h xxd.exe
28.0K   xxd.exe
$ peports xxd.exe
KERNEL32.dll
        0       CreateFileW
        0       ExitProcess
        0       GetCommandLineW
        0       GetFileType
        0       GetStdHandle
        0       MultiByteToWideChar
        0       ReadFile
        0       SetFilePointerEx
        0       VirtualAlloc
        0       WideCharToMultiByte
        0       WriteFile
SHELL32.dll
        0       CommandLineToArgvW

Other notes

Buffered output and buffered input is custom tailored for rexxd. When parsing line-oriented input, like -r, it attempts to parse from of a view of the input buffer, no copying. The view is the usual string representation:

typedef struct {
    u8 *data;
    iz  len;
} Str;

Does it fail if the line is longer than the buffer? If it straddles reads, does that hurt efficiency? The answer to both is “no” due to the spillover arena. Input is the buffered input struct, and here’s the interface to get the next line:

Str nextline(Input *, Arena *);

If the line isn’t entirely contained in the input buffer, the complete line is concatenated in the arena. So it comfortably handles huge lines while no-copy optimizing for typical short, non-straddling lines. With a per-iteration arena, any arena-backed line is automatically freed at the end of the iteration, so it’s all transparent:

    for (;;) {
        Arena scratch = perm;
        Str line = nextline(b, &scratch);
        // ... line may point into an Input or scratch ...
    }

If the line doesn’t fit in the arena, it triggers OOM handling. That is, it calls plt_exit and something platform-appropriate happens without returning. Beats the pants off old getline!

I came up with a maxof macro that evaluates the maximum of any integral type, signed or unsigned. It appears in overflow checks and more, I really like how it turned out. For example:

    if (pos > maxof(i64) - off) {
        // overflow
    }
    pos += off;

Or:

i32 trunc32(iz n)
{
    return n>maxof(i32) ? maxof(i32) : (i32)n;
}

Now that I have -lmemory and generally solved string function issues for myself, I leaned into __builtin_memset and __builtin_memcpy for this project. Despite restrict, it’s surprisingly difficult to get compilers to optimize loops into semantically equivalent string function calls. An explicit built-in solves that. It also produces faster debug builds, which is what I run while I work. At -O0, rexxd is about half the speed of a release build.

Other than -x, I don’t plan on inventing new features. I’d like to maintain compatibility with the xxd found everywhere else, and I don’t expect adoption beyond w64devkit. Overall the project took about twice as long as I anticipated — two weekends instead of one — but it turned out better than I expected and I’m very pleased with the results.

Windows dynamic linking depends on the active code page

2024-10-07T19:50:17Z

Windows paths have been WTF-16-encoded for decades, but module names in the import tables of Portable Executable are octets. If a name contains values beyond ASCII — technically out of spec — then the dynamic linker must somehow decode those octets into Unicode in order to construct a lookup path. There are multiple ways this could be done, and the most obvious is the process’s active code page (ACP), which is exactly what happens. As a consequence, the specific DLL loaded by the linker may depend on the system code page. In this article I’ll contrive such a situation.

LoadLibraryA is a similar situation, and potentially applies the code page to a longer portion of the module path. LoadLibraryW is unaffected, at least for the directly-named module, because it’s Unicode all the way through.

For my contrived demonstration I came up with two names that to English-reading eyes appears as two words with extraneous markings:

Ãµral.dll: CP-1252="C3 B5 …"
õral.dll: CP-1252="F5 …"; UTF-8="C3 B5 …"

Both end with ral.dll. I’ve included the CP-1252 encoding for the differing prefixes, and the UTF-8 encoding for the second. I’m using CP-1252 because it’s the most common system code page in the world, especially the Western hemisphere. Due to case insensitivity, the actual DLL may be named ãµral.dll — i.e. to match the second library case — but the module name must be encoded as uppercase when building the import library. Alternatively the second could be Õral.dll, particularly because I won’t use it when constructing an import library.

The plan is to store the octets C3 B5 … in the import table. A process using CP-1252 decodes it to Ãµral.dll. In the UTF-8 code page it decodes to õral.dll. For testing we can use an application manifest to control the code page for a particular PE image — a lot easier than changing the system code page. Otherwise, this trick could dynamically change the behavior of a program in response to the system code page without actually inspecting the active code page.

The libraries will have a single function get, which returns a string indicating which library was loaded:

#define X(s) #s
#define S(s) X(s)
__declspec(dllexport) char *get(void) { return S(V); }

Constructing the import library can be tricky because you must consider how the toolchain, editors, and shells decode and encode text, which may involve the build system’s code page. It’s shockingly difficult to script! Binutils dlltool cannot process these names and cannot be used at all. With bleeding edge w64devkit I could reliably construct the DLLs and import library like so, even in a script (Windows 10 and later only):

$ gcc -shared -DV=UTF-8 -o Õral.dll  detect.c
$ gcc -shared -DV=ANSI  -o Ãµral.dll detect.c -Wl,--out-implib=detect.lib

That produces two DLLs and one import library, detect.lib, with the desired module name octets. A straightforward MSVC cl invocation also works so long as it’s not from a batch file. It will quite correctly warn about the strange name situation, which I like. My test program, main.c:

#include 

char *get(void);
int main(void) { puts(get()); }

I link detect.lib when I build it:

$ cc -o main.exe main.c detect.lib

I designed peports to print non-ASCII octets unambiguously (\xXX), and it’s the only tool I know that does so:

$ peports main.exe | tail -n 2
\xc3\xb5ral.dll
        1       get

The module name has the C3 B5 … prefix octets. When I run it under my system code page, CP-1252:

$ ./main
ANSI

If I add a UTF-8 manifest, even just a “side-by-side” manifest, it loads the other library despite an identical import table:

$ cc -o main.exe main.c detect.lib libwinsane.o
$ ./main
UTF-8

Again, without the manifest, if I switched my system code page to UTF-8 then UTF-8 would still be the result.

I can’t think of much practical use for this trick outside of malware. In a real program it would be simpler to inspect code page, and there’s no benefit to avoiding such a check if it’s needed. Malware could use it to trick inspection tools and scanners that decode module names differently than the dynamic linker. Such tools often incorrectly assume UTF-8, which is what motivated this article.

Slim Reader/Writer Locks are neato

2024-10-03T22:40:13Z

I’m 18 years late, but Slim Reader/Writer Locks have a fantastic interface: pointer-sized (“slim”), zero-initialized, and non-allocating. Lacking cleanup, they compose naturally with arena allocation. Sounds like a futex? That’s because they’re built on futexes introduced at the same time. They’re also complemented by condition variables with the same desirable properties. My only quibble is that slim locks could easily have been 32-bit objects, but it hardly matters. This article, while treating Win32 as a foreign interface, discusses a paper-thin C++ wrapper interface around lock and condition variables, in my own style.

If you’d like to see/try a complete, working demonstration before diving into the details: demo.cpp. We’re going to build this from the ground up, so let’s establish a few primitive integer definitions:

using b32 = signed;
using i32 = signed;
using uz  = decltype(0uz);

Think of uz as like uintptr_t. This implementation will support both 32-bit and 64-bit targets, and we’ll need it as the basis for locks and condition variables:

enum Lock : uz;
enum Cond : uz;

Opaque enums provide additional type safety: They have the properties of an integer, including trivial destruction, but are distinct types which compilers forbid mixing with other integers. We can’t, say, accidentally cross condition variable and lock parameters — my main concern. Aside from zero-initialization, we do not actually care about the values of these variables, so enumerators are unnecessary. (Caveat: GDB cannot display opaque enums, which is slightly irritating.)

The documentation doesn’t explicitly mention zero initialization, but the official *_INIT constants are defined as zero. That locks in zero at the ABI level, so we can count on it.

All the functions we’ll need are exported by kernel32.dll. Locks have two variations on lock/unlock: “exclusive” (write) and “shared” (read). There are also “try” versions, but I won’t be using them.

#define W32(r, p) extern "C" __declspec(dllimport) r __stdcall p noexcept
W32(void, AcquireSRWLockExclusive(Lock *));
W32(void, AcquireSRWLockShared(Lock *));
W32(void, ReleaseSRWLockExclusive(Lock *));
W32(void, ReleaseSRWLockShared(Lock *));

Declaring Win32 functions in C++ is a mouthful, and everything must be written in just the right order, but it’s mostly tucked away in a macro. Usually there’s a stack discipline to these locks, so an RAII scoped guard is in order:

struct Guard {
    Lock *l;
    Guard(Lock *l) : l{l} { AcquireSRWLockExclusive(l); }
    ~Guard()              { ReleaseSRWLockExclusive(l); }
};

struct RGuard {
    Lock *l;
    RGuard(Lock *l) : l{l} { AcquireSRWLockShared(l); }
    ~RGuard()              { ReleaseSRWLockShared(l); }
};

Dead simple. (What about rule of three? Instead of working around this language design flaw, reach into the distant future where it’s been fixed: -Werror=deprecated-copy-dtor.) Usage might look like:

struct Example {
    Lock lock = {};
    i32  value;
};

i32 incr(Example *e)
{
    Guard g(&e->lock);
    return ++e->value;
}

Note the = {} to guarantee the lock is always ready for use. It gets more interesting with condition variables in the mix. That’s three more functions:

W32(b32,  SleepConditionVariableSRW(Cond *, Lock *, i32, b32));
W32(void, WakeAllConditionVariable(Cond *));
W32(void, WakeConditionVariable(Cond *));

The last parameter on SleepConditionVariableSRW indicates if the lock was acquired shared. Why do locks have distinct acquire and release functions while condition variables use a flag for the same purpose? Beats me. I’ll unfold it into two functions, selected by type, with a default infinite wait:

b32 wait(Cond *c, Guard *g, i32 ms = -1)
{
    return SleepConditionVariableSRW(c, g->l, ms, 0);
}

b32 wait(Cond *c, RGuard *g, i32 ms = -1)
{
    return SleepConditionVariableSRW(c, g->l, ms, 1);
}

Usage might look like:

for (RGuard g(&lock); remaining;) {
    wait(&done, &g);
}

The other side is nothing more than a rename (but could also be accomplished through linking):

void signal(Cond *c)
{
    WakeConditionVariable(c);
}

void broadcast(Cond *c)
{
    WakeAllConditionVariable(c);
}

And a couple examples of its usage:

if (Guard g(&lock); !--remaining) {
    signal(&done);
}

// Or:

Guard g(&lock);
ready = true;
broadcast(&init);
while (remaining) {
    wait(&done, &g);
}

A satisfying, powerful synchronization interface with hardly any code!

Giving C++ std::regex a C makeover

2024-09-04T17:15:07Z

Suppose you’re working in C using one of the major toolchains — that is, it’s mainly a C++ implementation — and you need regular expressions. You could integrate a library, but there’s a regex implementation in the C++ standard library included with your compiler, just within reach. As a resourceful engineer, using an asset already in hand seems prudent. But it’s a C++ interface, and you’re using C instead of C++ for a reason, perhaps to avoid dealing with C++. Have no worries. This article is about wrapping std::regex in a tidy C interface which not only hides all the C++ machinery, but utterly tames it. It’s not so much practical as a potpourri of interesting techniques.

If you’d like to skip ahead, here’s the full source up front. Tested with w64devkit, MSVC cl, and clang-cl: scratch/regex-wrap

Interface design

The C interface I came up with, regex.h:

#pragma once
#include 

#define S(s) (str){s, sizeof(s)-1}

typedef struct {
    char     *data;
    ptrdiff_t len;
} str;

typedef struct {
    char *beg;
    char *end;
} arena;

typedef struct regex regex;

typedef struct {
    str      *data;
    ptrdiff_t len;
} strlist;

regex  *regex_new(str, arena *);
strlist regex_match(regex *, str, arena *);

Longtime readers will find it familiar: my favorite non-owning, counted strings form in place of null-terminated strings — similar to C++ std::string_view — and arena allocation. Yes, such fundamental types wouldn’t “belong” to a regex library like this, but imagine they’re standardized by the project or whatever. Also, this is purely a C header, not a C/C++ polyglot, and will not be used by the C++ portion.

In particular note the lack of “free” functions. The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc. So in a sense, it could be called a non-allocating library. This requires a bit of C++ abuse: I will not call some C++ regex destructors. It shouldn’t matter because they only redundantly manage memory in the arena. (If regex objects are holding file handles or something else unnecessary then its implementation so poor as to not be worth using, and we should just use a better regex library.)

Now’s a good time to mention a caveat: In order to pull this off the regex library lives in its own Dynamic-Link Library with its own copy of the C++ standard library, i.e. statically linked. My demo is Windows-only, but this concept theoretically extends to shared objects on Linux. Since it’s a C interface that doesn’t expose standard library objects, the DLL can be used by programs compiled with different toolchains. Though that wouldn’t apply to my inciting hypothetical.

Example usage:

regex  *re = regex_new(S("(\\w+)"), perm);
str     s  = S("Hello, world! This is a test.");
strlist m  = regex_match(re, s, perm);
for (ptrdiff_t i = 0; i < m.len; i++) {
    printf("%2td = %.*s\n", i, (int)m.data[i].len, m.data[i].data);
}

This program prints:

= Hello
= world
= This
= is
= a
= test

If matching lots of source strings, scope the arena to the loop and then the results, and any regex working memory, are automatically freed in O(1) at the end of each iteration:

for (ptrdiff_t i = 0; i < ninputs; i++) {
    arena   scratch = *perm;
    strlist matches = regex_match(re, inputs[i], &scratch);
    // ... consume matches ...
}

C++ implementation

On the C++ side the first thing I do is replace new and delete, which is how I force it to allocate from the arena. This replaces new/delete for globally, but recall that the regex library has its own, private C++ implementation. Replacements apply only to itself even if there’s other C++ present in the process. If this is the only C++ in the process then it doesn’t require such careful isolation.

I can’t tell std::regex about the arena — it calls operator new the usual way, without extra arguments — so I have to smuggle it in through a thread-local variable:

static thread_local arena *perm;

If I’m sure the library is only used by a single thread then I can omit thread_local, but it’s useful here to demonstrate and measure. Using it in my operator replacements:

void *operator new(size_t size, std::align_val_t align)
{
    arena    *a     = perm;
    ptrdiff_t ssize = size;
    ptrdiff_t pad   = (uintptr_t)a->end & ((int)align - 1);
    if (ssize < 0 || ssize > a->end - a->beg - pad) {
        throw std::bad_alloc{};
    }
    return a->end -= size + pad;
}

void *operator new(size_t size)
{
    return operator new(
        size,
        std::align_val_t(__STDCPP_DEFAULT_NEW_ALIGNMENT__)
    );
}

Starting in C++17, replacing the global allocator requires definitions for both plain new/delete and aligned new/delete. The many other variants, including arrays, call these four and so may be skipped. Allocating over-aligned objects isn’t a special case for arenas, so I implemented plain new by calling aligned new. I’d prefer to allocate through a template so that I can “see” the type, but that’s not an option in this case.

After converting to signed sizes because they’re simpler, it’s the usual from-the-end allocation. I prefer -fno-exceptions but std::regex is inherently exceptional — and I mean that in at least two bad ways — so they’re required. The good news is this library gracefully and reliably handles out-of-memory errors. (The arena makes this trivial to test, so try it for yourself!)

I added a little extra flair replacing delete:

void operator delete(void *) noexcept {}
void operator delete(void *, std::align_val_t) noexcept {}

void operator delete(void *p, size_t size) noexcept
{
    arena *a = perm;
    if (a->end == (char *)p) {
        a->end += size;
    }
}

The two mandatory replacements are no-ops because that’s simply how arenas work. We don’t free individual objects, but many at once. It’s completely optional, but I also replaced sized delete for little other reason than sized deallocation is cool. C++ destructs in reverse order, so this is likely to work out. At least with GCC libstdc++, it freed about a third of the workspace memory before returning to C. I’d rather it didn’t try to free anything at all, but since it’s going to call delete anyway I can get some use out of it.

Interesting side note: In a rough benchmark these replacements made MSVC std::regex matching four times faster! I expected a small speedup, but not that. In the typical case it appears to be wasting most of its time on allocation. On the other hand, libstdc++ std::regex is overall quite a bit slower than MSVC, and my replacements had no performance effect. It’s spending its time elsewhere, and the small gains are lost interacting with the thread-local.

Finally the meat:

extern "C" std::regex *regex_new(str re, arena *a)
{
    perm = a;
    try {
        return new std::regex(re.data, re.data+re.len);
    } catch (...) {
        return {};
    }
}

It sets the thread-local to the arena, then constructs with “iterators” at each end of the input. All exceptions are caught and turned into a null return. Depending on need, we may want to indicate why it failed — out of memory, invalid regex, etc. — by returning an error value of some sort. An exercise for the reader.

The matcher is a little more complicated:

extern "C" strlist regex_match(std::regex *re, str s, arena *a)
{
    perm = a;
    try {
        std::cregex_iterator it(s.data, s.data+s.len, *re);
        std::cregex_iterator end;

        strlist r = {};
        r.len  = std::distance(it, end);
        r.data = new str[r.len]();
        for (ptrdiff_t i = 0; it != end; it++, i++) {
            r.data[i].data = s.data + it->position();
            r.data[i].len  = it->length();
        }
        return r;

    } catch (...) {
        return {};
    }
}

I create a char * “cregex” iterator, again giving it each end of the input. I hope it’s not just making a copy (MSVC std::regex does grumble grumble). The result is allocated out of the arena. As before, exceptions convert to a null return. Callers can distinguish errors because no-match results have a non-null pointer. The iterator, being a local variable, is destroyed before returning, uselessly calling delete. I could avoid this by allocating it with new, but in practice it doesn’t matter.

You might have noticed the lack of declspec(dllexport). DEF files are great, and I’ve come to appreciate and prefer them. GCC and MSVC accept them as another input on the command line, and the source need not be aware exports. My regex.def:

LIBRARY regex
EXPORTS
regex_new
regex_match

In w64devkit, the command to build the DLL:

$ g++ -shared -std=c++17 -o regex.dll regex.cpp regex.def

The MSVC command almost maps 1:1 to the GCC command:

$ cl /LD /std:c++17 /EHsc regex.cpp regex.def

In either case only the C interface is exported (via peports):

$ peports -e regex.dll
EXPORTS
        1       regex_match
        2       regex_new

Reasons against

Though this library is conveniently on hand, and my minimalist C wrapper interface is nicer than a typical C regex library interface, and even hides some std::regex problems, trade-offs must be considered:

No Unicode support, particularly UTF-8
std::regex implementations are universally poor and slow
libstdc++ std::regex is especially slow to compile
Isolating in a DLL (if needed) is inconvenient
DLL is 200K (MSVC) to 700K (GCC) or so

Depending on what I’m doing, some of these may have me looking elsewhere.

An improved chkstk function on Windows

2024-02-05T17:56:05Z

If you’ve spent much time developing with Mingw-w64 you’ve likely seen the symbol ___chkstk_ms, perhaps in an error message. It’s a little piece of runtime provided by GCC via libgcc which ensures enough of the stack is committed for the caller’s stack frame. The “function” uses a custom ABI and is implemented in assembly. So is the subject of this article, a slightly improved implementation soon to be included in w64devkit as libchkstk (-lchkstk).

The MSVC toolchain has an identical (x64) or similar (x86) function named __chkstk. We’ll discuss that as well, and w64devkit will include x86 and x64 implementations, useful when linking with MSVC object files. The new x86 __chkstk in particular is also better than the MSVC definition.

A note on spelling: ___chkstk_ms is spelled with three underscores, and __chkstk is spelled with two. On x86, cdecl functions are decorated with a leading underscore, and so may be rendered, e.g. in error messages, with one fewer underscore. The true name is undecorated, and the raw symbol name is identical on x86 and x64. Further complicating matters, libgcc defines a ___chkstk with three underscores. As far as I can tell, this spelling arose from confusion regarding name decoration, but nobody’s noticed for the past 28 years. libgcc’s x64 ___chkstk is obviously and badly broken, so I’m sure nobody has ever used it anyway, not even by accident thanks to the misspelling. I’ll touch on that below.

When referring to a particular instance, I will use a specific spelling. Otherwise the term “chkstk” refers to the family. If you’d like to skip ahead to the source for libchkstk: libchkstk.S.

A gradually committed stack

The header of a Windows executable lists two stack sizes: a reserve size and an initial commit size. The first is the largest the main thread stack can grow, and the second is the amount committed when the program starts. A program gradually commits stack pages as needed up to the reserve size. Binutils objdump option -p lists the sizes. Typical output for a Mingw-w64 program:

$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve      0000000000200000
SizeOfStackCommit       0000000000001000

The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB initially committed. With the Binutils linker, ld, you can set them at link time using --stack. Via gcc, use -Xlinker. For example, to reserve an 8MiB stack and commit half of it:

$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...

MSVC link.exe similarly has /stack.

The purpose of this mechanism is to avoid paying the commit charge for unused stack. It made sense 30 years ago when stacks were a potentially large portion of physical memory. These days it’s a rounding error and silly we’re still dealing with it. Using the above options you can choose to commit the entire stack up front, at which point a chkstk helper is no longer needed (-mno-stack-arg-probe, /Gs2147483647). This requires link-time control of the main module, which isn’t always an option, like when supplying a DLL for someone else to run.

The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.

In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:

void callee(char *);

void example(void)
{
    char large[1<<20];
    callee(large);
}

Compiled with 64-bit gcc -O:

example:
    movl    $1048616, %eax
    call    ___chkstk_ms
    subq    %rax, %rsp
    leaq    32(%rsp), %rcx
    call    callee
    addq    $1048616, %rsp
    ret

I used GCC, but this is practically identical to the code generated by MSVC and Clang. Note the call to ___chkstk_ms in the function prologue before allocating the stack frame (subq). Also note that it sets eax. As a volatile register, this would normally accomplish nothing because it’s done just before a function call, but recall that ___chkstk_ms has a custom ABI. That’s the argument to chkstk. Further note that it uses rax on the return. That’s not the value returned by chkstk, but rather that x64 chkstk preserves all registers.

Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.

Implementing chkstk

In a high level language, chkstk might look something like so:

// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
    volatile char frame[frame_size];  // NOTE: variable-length array
    for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
        frame[i] = 0;  // touch the guard page
    }
}

This wouldn’t work for a number of reasons, but if it did, volatile would serve two purposes. First, forcing the side effect to occur. The second is more subtle: The loop must happen in exactly this order, from high to low. Without volatile, loop iterations would be independent — as there are no dependencies between iterations — and so a compiler could reverse the loop direction.

The store can happen anywhere within the guard page, so it’s not necessary to align frame to the page. Simply touching at least one byte per page is enough. This is essentially the definition of libgcc ___chkstk_ms.

How many iterations occur? In example above, the stack frame will be around 1MiB (2²⁰). With pages of 4KiB (2¹²) that’s 256 iterations. The loop happens unconditionally, meaning every function call requires 256 iterations of this loop. Wouldn’t it be better if the loop ran only as needed, i.e. the first time? MSVC x64 __chkstk skips iterations if possible, and the same goes for my new ___chkstk_ms. Much like the command line string, the low address of the current thread’s guard page is accessible through the Thread Information Block (TIB). A chkstk can cheaply query this address, only looping during initialization or so. (In contrast to Linux, a thread’s stack is fundamentally managed by the operating system.)

Taking that into account, an improved algorithm:

Push registers that will be used
Compute the low address of the new stack frame (F)
Retrieve the low address of the committed stack (C)
Go to 7
Subtract the page size from C
Touch memory at C
If C > F, go to 5
Pop registers to restore them and return

A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:

___chkstk_ms:
    push %rax              // 1.
    push %rcx              // 1.
    neg  %rax              // 2. rax = frame low address
    add  %rsp, %rax        // 2. "
    mov  %gs:(0x10), %rcx  // 3. rcx = stack low address
    jmp  1f                // 4.
0:  sub  $0x1000, %rcx     // 5.
    test %eax, (%rcx)      // 6. page fault (very slow!)
1:  cmp  %rax, %rcx        // 7.
    ja   0b                // 7.
    pop  %rcx              // 8.
    pop  %rax              // 8.
    ret                    // 8.

I’ve labeled each instruction with its corresponding pseudo-code. Step 6 is unusual among chkstk implementations: It’s not a store, but a load, still sufficient to fault the page. That test instruction is just two bytes, and unlike other two-byte options, doesn’t write garbage onto the stack — which would be allowed — nor use an extra register. I searched through single byte instructions that can page fault, all of which involve implicit addressing through rdi or rsi, but they increment rdi or rsi, and would would require another instruction to correct it.

Because of the return address and two push operations, the low stack frame address is technically too low by 24 bytes. That’s fine. If this exhausts the stack, the program is really cutting it close and the stack is too small anyway. I could be more precise — which, as we’ll soon see, is required for x86 __chkstk — but it would cost an extra instruction byte.

On x64, ___chkstk_ms and __chkstk have identical semantics, so name it __chkstk — which I’ve done in libchkstk — and it works with MSVC. The only practical difference between my chkstk and MSVC __chkstk is that mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking the optimization, is libgcc ___chkstk_ms, weighing 50 bytes, or in practice, due to an unfortunate Binutils default of padding sections, 64 bytes.

I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.

Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.

32-bit chkstk

On x86 ___chkstk_ms has identical semantics to x64. Mine is a copy-paste of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC was ahead of the curve on this design.

However, x86 __chkstk is bonkers. It not only commits the stack, but also allocates the stack frame. That is, it returns with a different stack pointer. The return pointer is initially inside the new stack frame, so chkstk must retrieve it and return by other means. It must also precisely compute the low frame address.

__chkstk:
    push %ecx               // 1.
    neg  %eax               // 2.
    lea  8(%esp,%eax), %eax // 2.
    mov  %fs:(0x08), %ecx   // 3.
    jmp  1f                 // 4.
0:  sub  $0x1000, %ecx      // 5.
    test %eax, (%ecx)       // 6. page fault (very slow!)
1:  cmp  %eax, %ecx         // 7.
    ja   0b                 // 7.
    pop  %ecx               // 8.
    xchg %eax, %esp         // ?. allocate frame
    jmp  *(%eax)            // 8. return

The main differences are:

eax is treated as volatile, so it is not saved
The low frame address is precisely computed with lea (2)
The frame is allocated at step (?) by swapping F and the stack pointer
Post-swap F now points at the return address, so jump through it

MSVC x86 __chkstk does not query the TIB (3), and so unconditionally runs the loop. So there’s an advantage to my implementation besides size.

libgcc x86 ___chkstk has this behavior, and so it’s also a suitable __chkstk aside from the misspelling. Strangely, libgcc x64 ___chkstk also allocates the stack frame, which is never how chkstk was supposed to work on x64. I can only conclude it’s never been used.

Optimization in practice

Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.

_Bool path_contains(wchar_t *name, wchar *path)
{
    wchar_t var[1<<15];
    GetEnvironmentVariableW(name, var, countof(var));
    // ... search for path in var ...
}

int64_t getfilesize(char *path)
{
    wchar_t wide[1<<15];
    MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
    // ... look up file size via wide path ...
}

void example(void)
{
    if (path_contains(L"PATH", L"c:\\windows\\system32")) {
        // ...
    }

    int64_t size = getfilesize("π.txt");
    // ...
}

Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.

So ultimately after writing an improved ___chkstk_ms I could only measure a tiny difference in contrived programs, and none in any real application. Though there’s still one more benefit I haven’t yet mentioned…

“The first thing we do, let’s kill all the lawyers”.

My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.

Every once in awhile ___chkstk_ms rears its ugly head, and I have to make a decision. Do I re-work my code to avoid it? Do I take the reigns of the linker and disable stack probes? I haven’t necessarily allocated a large local array: A bit of luck with function inlining can combine several smaller stack frames into one that’s just large enough to require chkstk.

Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!

(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)

MSVC is worse. Hardly anyone follows its license, but fortunately for most the license is practically unenforced. Its chkstk, which currently resides in a loose chkstk.obj, falls into what Microsoft calls “Distributable Code.” Its license requires “external end users to agree to terms that protect the Distributable Code.” In other words, if you compile a program with MSVC, you’re required to have a EULA including the relevant terms from the Visual Studio license. You’re not legally permitted to distribute software in the manner of w64devkit — no installer, just a portable zip distribution — if that software has been built with MSVC. At least not without special care which nobody does. (Don’t worry, I won’t tell.)

How to use libchkstk

To avoid libgcc entirely you need -nostdlib. Otherwise it’s implicitly offered to the linker, and you’d need to manually check if it picked up code from libgcc. If ld complains about a missing chkstk, use -lchkstk to get a definition. If you use -lchkstk when it’s not needed, nothing happens, so it’s safe to always include.

I also recently added a libmemory to w64devkit, providing tiny, public domain definitions of memset, memcpy, memmove, memcmp, and strlen. All compilers fabricate calls to these five functions even if you don’t call them yourself, which is how they were selected. (Not because I like them. I really don’t.). If a -nostdlib build complains about these, too, then add -lmemory.

$ gcc -nostdlib ... -lchkstk -lmemory

In MSVC the equivalent option is /nodefaultlib, after which you may see missing chkstk errors, and perhaps more. libchkstk.a is compatible with MSVC, and link.exe doesn’t care that the extension is .a rather than .lib, so supply it at link time. Same goes for libmemory.a if you need any of those, too.

$ cl ... /link /nodefaultlib libchkstk.a libmemory.a

While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.

Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!

How to link identical function names from different DLLs

2023-08-27T01:46:31Z

For the typical DLL function call you declare the function prototype (via header file), you inform the link editor (ld, link) that the DLL exports a symbol with that name (import library), it matches the declared name with this export, and it becomes an import in your program’s import table. What happens when two different DLLs export the same symbol? The link editor will pick the first found. But what if you want to use both exports? If they have the same name, how could program or link editor distinguish them? In this article I’ll demonstrate a technique to resolve this by creating a program which links with and directly uses two different C runtimes (CRTs) simultaneously.

In PE executable images, an import isn’t just a symbol, but a tuple of DLL name and symbol. For human display, a tuple is typically formatted with an exclamation point delimiter, as in msvcrt.dll!malloc, though sometimes without the .dll suffix. You’ve likely seen this in stack traces. Because it’s a tuple and not just a symbol, it’s possible to refer to, and import, the same symbol from different DLLs. Contrast that with ELF, which has a list of shared objects, and a separate list of symbols, with the dynamic linker pairing them up at load time. That permits cool tricks like LD_PRELOAD, but for the same reason loading is less predictable.

Windows comes with several CRTs, and various libraries and applications use one or another (or none) depending on how they were built. As C standard library implementations they export mostly the same symbols, malloc, printf, etc. With imports as tuples, it’s not so unusual for an application to load multiple CRTs at once. Typically coexistence is transitive. That is, a module does not directly access both CRTs but depends on modules that use different CRTs. One module calls, say, msvcrt.dll!malloc, and another module calls ucrtbase.dll!malloc. With DLL-qualified symbols, this is sound so long as modules don’t cross the streams, e.g. an allocation in one module must not be freed in the other. Libraries in this ecosystem must avoid exposing their CRT through their interfaces, such as expecting the library’s caller to free() objects: The caller might not have access to the right free!

Contrast again with the unix ecosystem generally, where a process can only load one libc and everyone is expected to share. Libraries commonly expect callers to free() their objects (e.g. libreadline, xcb), blending their interface with libc.

Suppose you’re in such a situation where, due to unix-oriented libraries, your application must use functions from two different CRTs at once. One might have been compiled with Mingw-w64 and linked with MSVCRT, and the other compiled with MSVC and linked with UCRT. We need to call malloc and free in each, but they have the same name. What a pickle!

There’s an obvious, and probably most common, solution: run-time dynamic linking. Use load-time linking on one CRT, and LoadLibrary on the other CRT with GetProcAddress to obtain function pointers. However, it’s possible to do this entirely with load-time linking!

A malloc by any other name would allocate as well

Think about it a moment and you might wonder: If the names are the same, how can I pick which I’m calling? The tuple representation won’t work because ! cannot appear in an identifier, which is, after all, why it was chosen. The trick is that we’re going to rename one of them! To demonstrate, I’ll use my Windows development kit, w64devkit, a Mingw-w64 distribution that links MSVCRT. I’m going to use UCRT as the second CRT to access ucrtbase.dll!malloc.

I can choose whatever valid identifier I’d like, so I’m going to pick ucrt_malloc. This will require a declaration:

__declspec(dllimport) void *ucrt_malloc(size_t);

If I stop here and try to use it, of course it won’t work:

ld: undefined reference to `__imp_ucrt_malloc'

The linker hasn’t yet been informed of the change in management. For that we’ll need an import library. I’ll define one using a .def file, which I’ll name ucrtbase.def:

LIBRARY ucrtbase.dll
EXPORTS
ucrt_malloc == malloc

The last line says that this library has the symbol ucrt_malloc, but that it should be imported as malloc. This line is the lynchpin to the whole scheme. Note: The double equals is important, as a single equals sign means something different. Next, use dlltool to build the import library:

$ dlltool -d ucrtbase.def -l ucrtbase.lib

The equivalent MSVC tool is lib, but as far as I know it cannot quite do this sort of renaming. However, MSVC link will work just fine with this dlltool-created import library. The name ucrtbase.lib, while obvious, is irrelevant. It’s that LIBRARY line that ties it to the DLL. My test source file looks like this:

#include 

__declspec(dllimport) void *ucrt_malloc(size_t);

int main(void)
{
    void *msvcrt[] = {malloc(1), malloc(1), malloc(1)};
    void *ucrt[] = {ucrt_malloc(1), ucrt_malloc(1), ucrt_malloc(1)};
    return 0;
}

It compiles successfully:

$ cc -g3 -o main.exe main.c ucrtbase.lib

I can see the two malloc imports with objdump:

$ objdump -p main.exe
...
DLL Name: msvcrt.dll
...
844a	 1021  malloc
...
DLL Name: ucrtbase.dll
847e	    1  malloc

It loads and runs successfully, too:

$ gdb main.exe
Reading symbols from main.exe...
(gdb) break 9
Breakpoint 1 at 0x1400013cd: file main.c, line 9.
(gdb) run
Thread 1 hit Breakpoint 1, main () at main.c:9
9           return 0;
(gdb) p msvcrt
$1 = {0xd06a30, 0xd06a70, 0xd06ab0}
(gdb) p ucrt
$2 = {0x6e9490, 0x6eb7c0, 0x6eb800}

The pointer addresses confirm that these are two, distinct allocators. Perhaps you’re wondering what happens if I cross the streams?

int main(void)
{
    free(ucrt_malloc(1));
}

The MSVCRT allocator justifiably panics over the bad pointer:

$ cc -g3 -o chaos.exe chaos.c ucrtbase.lib
$ gdb -ex run chaos.exe
Starting program: chaos.exe
warning: HEAP[chaos.exe]:
warning: Invalid address specified to RtlFreeHeap
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffc42c369af in ntdll!RtlRegisterSecureMemoryCacheCallback ()
(gdb)

While you’re probably not supposed to meddle with ucrtbase.dll like this, the general principle of export renames is reasonable. I don’t expect I’ll ever need to do it, but I like that I have the option.

Everything you never wanted to know about Win32 environment blocks

2023-08-23T21:51:10Z

In an effort to avoid programming by superstition, I did a deep dive into the Win32 “environment block,” the data structure holding a process’s environment variables, in order to better understand it. Along the way I discovered implied and undocumented behaviors. (The environment block must not to be confused with the Process Environment Block (PEB) which is different.) Because I cannot possibly retain all the quirky details in my head for long, I’m writing them down for future reference. I ran my tests on different Windows versions as far back as Windows XP SP3 in order to fill in gaps where documentation is ambiguous, incomplete, or wrong. Overall conclusion: Correct, direct manipulation of an environment block is impossible in the general case due to under-specified and incorrect documentation. This has important consequences mainly for programming language runtimes.

Win32 has two interfaces for interacting with environment variables:

The first, which I’ll call get/set, is the easy interface, with Windows doing all the searching and sorting on your behalf. It’s also the only supported interface through which a process can manipulate its own variables. It has no function for enumerating variables.

The second, which I’ll call get/free, allocates a copy of the environment block. Calls to get/set does not modify existing copies. Similarly, manipulating this block has no effect on the environment as viewed through get/set. In other words, it’s read only. We can enumerate our environment variables by walking the environment block. As I will discuss below, enumeration is it’s only consistently useful purpose!

Technically it’s possible to access the actual environment block through undocumented fields in the PEB. It’s the same content as returned by get/free except that it’s not a copy. It cannot be accessed safely, so I’m ignoring this route.

The environment block format is a null-terminated block of null-terminated strings:

keyA=a\0keyBB=bb\0keyCCC=ccc\0\0

Each string ~~begins with a character other than = and~~ contains at least one =. In my tests this rule was strictly enforced by Windows, and I could not construct an environment block that broke this rule. This list is usually, but not always, sorted. It may contain repeated variables, but they’re always assigned the same value, which is also strictly enforced by Windows.

~~The get/free interface has no “set” function, and a process cannot set its own environment block to a custom buffer.~~ (Update: Stefan Kanthak points out SetEnvironmentStringsW. I missed it because it was only officially documented a few months before this article was written.) There is one interface where a process gets to provide a raw environment block: CreateProcess. That is, a parent can construct one for its children.

    wchar_t env[] = L"HOME=C:\\Users\\me\0PATH=C:\\bin;C:\\Windows\0";
    CreateProcessW(L"example.exe", ..., env, ...);

Windows imposes some rules upon this environment block:

~~If an element begins with = or does not contain =, CreateProcess fails.~~
Repeated variables are modified to match the first instance. If you’re potentially overriding using a duplicate, put the override first.
Some cases of bad formatting become memory access violations.

As usual for Win32, there are no rules against ill-formed UTF-16, and I could always pass such “UTF-16” through into the child environment block. Keep that in mind even when using the get/set interface.

The SetEnvironmentVariable documentation gives a maximum variable size:

The maximum size of a user-defined environment variable is 32,767 characters. There is no technical limitation on the size of the environment block.

At least on more recent versions of Windows, my experiments proved exactly the opposite. There is no limit on a user-defined environment variables, but environment blocks are limited to 2GiB, for both 32-bit and 64-bit processes. I could even create such huge environments in large address aware 32-bit processes, though the interfaces are prone to error due to allocations problems.

There’s one special case where CreateProcess is illogical, and it’s certainly a case of confusion within its implementation. An environment block is not allowed to be empty. An empty environment is represented as a block containing one empty (zero length) element. That is, two null terminators in a row. It’s the one case where an environment block may contain an element without a =. The logical empty environment block would be just one null terminator, to terminate the block itself, because it contains no variables. You can safely pretend that’s the case when parsing an environment block, as this special case is superfluous.

However, CreateProcess partially enforces this silly, unnecessary special case! If an environment block begins with a null terminator, the next character must be in a mapped memory region because it will read this character. If it’s not mapped, the result is a memory access violation. Its actual value doesn’t matter, and CreateProcess will treat it as though it was another null terminator. Surely someone at Microsoft would have noticed by now that this behavior makes no sense, but I guess it’s kept for backwards compatibility?

The CreateProcess documentation says that “the system uses a sorted environment” but this made no difference in my tests. The word “must” appears in this sentence, but it’s unclear if it applies to sorting, or even outside the special case being discussed. GetEnvironmentVariable works fine on an unsorted environment block. SetEnvironmentVariable maintains sorting, but given an unsorted block it goes somewhere in the middle, probably wherever a bisection happens to land. Perhaps look-ups in sorted blocks are faster, but environment blocks are so small — ~~a maximum of 32K characters~~ (Update: only true for ANSI) — that, in practice, it really does not matter.

Suppose you’re meticulous and want to sort your environment block before spawning a process. How do you go about it? There’s the rub: The official documentation is incomplete! The Changing Environment Variables page says:

All strings in the environment block must be sorted alphabetically by name. The sort is case-insensitive, Unicode order, without regard to locale.

What do they mean by “case-insensitive” sort? Does “Unicode order” mean case folding? A reasonable guess, but no, that’s not how get/set works. Besides, how does “Unicode order” apply to ill-formed UTF-16? Worse, get/set sorting is certainly not “Unicode order” even outside of case-insensitivity! For example, U+1F31E (SUN WITH FACE) sorts ahead of U+FF01 (FULLWIDTH EXCLAMATION MARK) because the former encodes in UTF-16 as U+D83C U+DF1E. Maybe it’s case-insensitive only in ASCII? Nope, π (U+03C0) and Π (U+03A0) are considered identical. Windows uses some kind of case-insensitive, but not case-folded, undocumented early 1990s UCS-2 sorting logic for environment variables.

Update: John Doty suspects the RtlCompareUnicodeString function for sorting. It lines up perfectly with get/set for all possible inputs.

Without better guidance, the only reliable way to “correctly” sort an environment block is to build it with get/set, then retrieve the result with get/free. The algorithm looks like:

Get a copy of the environment with GetEnvironmentStrings.
Walk the environment and call SetEnvironmentVariable on each name with a null pointer as the value. This clears out the environment.
Call SetEnvironmentVariable for each variable in the new environment.
Get a sorted copy of the new environment with GetEnvironmentStrings.

Unfortunately that’s all global state, so you can only construct one new environment block at a time.

If you know all your variable names ahead of time, then none of this is a problem. Determine what Windows thinks the order should be, then use that in your program when constructing the environment block. It’s the general case where this is a challenge, such as a language runtime designed to operate on arbitrary environment variables with behavior congruent to the rest of the system.

There are similar issues with looking up variables in an environment block. How does case-insensitivity work? Sorting is “without regard to locale” but what about when comparing variable names? The documentation doesn’t say. When enumerating variables using get/free, you might read what get/set considers to be duplicates, though at least values will always agree with get/set, i.e. they’re aliases of one variables. Windows maintains that invariant in my tests. The above algorithm would also delete these duplicates.

For example, if someone passed you a “dirty” environment with duplicates, or that was unsorted, this would clean it up in a way that allows get/free to be traversed in order without duplicates.

    wchar_t *env = GetEnvironmentStringsW();

    // Clear out the environment
    for (wchar_t *var = env; *var;) {
        size_t len = wcslen(var);
        size_t split = wcscspn(var, L"=");
        var[split] = 0;
        SetEnvironmentVariableW(var, 0);
        var[split] = '=';
        var += len + 1;
    }

    // Restore the original variables
    for (wchar_t *var = env; *var;) {
        size_t len = wcslen(var);
        size_t split = wcscspn(var, L"=");
        var[split] = 0;
        SetEnvironmentVariableW(var, var+split+1);
        var += len + 1;
    }

    FreeEnvironmentStringsW(env);

On the second pass, SetEnvironmentVariableW will gobble up all the duplicates.

As a final note, the CreateProcess page had said this up until February 2023 about the environment block parameter:

If this parameter is NULL and the environment block of the parent process contains Unicode characters, you must also ensure that dwCreationFlags includes CREATE_UNICODE_ENVIRONMENT.

That seems to indicate it’s virtually always wrong to call CreateProcess without that flag — that is, Windows will trash the child’s environment unless this flag is passed — which is a bonkers default. Fortunately this appears to be wrong, which is probably why the documentation was finally corrected (after several decades). Omitting this flag was fine under all my tests, and I was unable to produce surprising behavior on any system.

In summary:

Prefer get/set for all operations except enumeration
Environment blocks are not necessarily sorted
Repeat variables are forced to the value of the first instance
Variables may contain ill-formed UTF-16
Empty environment blocks have a superfluous special case
~~Entries cannot begin with =~~
Entries must contain at least one =
Sort order is ambiguous, so you cannot reliably do it yourself
Case-insensitivity of names is ambiguous, so rely on get/set
CREATE_UNICODE_ENVIRONMENT necessary only for non-null environment

Update September 2024: Correction from Kasper Brandt regarding variables beginning with =. I misunderstood how it was parsed and came to the wrong conclusion.

Hand-written Windows API prototypes: fast, flexible, and tedious

2023-05-31T01:38:31Z

I love fast builds, and for years I’ve been bothered by the build penalty for translation units including windows.h. This header has an enormous number of definitions and declarations and so, for C programs, it tends to dominate the build time of those translation units. Most programs, especially systems software, only needs a tiny portion of it. For example, when compiling u-config with GCC, two thirds of the debug build was spent processing windows.h just for 4 types, 16 definitions, and 16 prototypes.

To give a sense of the numbers, here’s empty.c, which does nothing but include windows.h.

#include 

With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):

$ gcc -E empty.c | grep -vc '^$'
82041

With w64devkit this takes my system ~450ms to compile with GCC:

$ time gcc -c empty.c
real    0m 0.45s
user    0m 0.00s
sys     0m 0.00s

Compiling an actually empty source file takes ~10ms, so it really is spending practically all that time processing headers. MSVC is a faster compiler, and this extends to processing an even larger windows.h that crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:

$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real    0m 0.12s
user    0m 0.09s
sys     0m 0.01s

That’s just low enough to be tolerable, but I’d like the situation with GCC to be better. Defining WIN32_LEAN_AND_MEAN reduces the number of included headers, which has a significant effect:

$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real    0m 0.30s
user    0m 0.00s
sys     0m 0.00s

$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real    0m 0.07s
user    0m 0.01s
sys     0m 0.01s

Precompiled headers

The official solution is precompiled headers. Put all the system header includes, or similar, into a dedicated header, then compile that header into a special format. For example, headers.h:

#define WIN32_LEAN_AND_MEAN
#include 

Then main.c includes windows.h through this header:

#include "headers.h"

int mainCRTStartup(void)
{
    return 0;
}

If I ask GCC to compile headers.h:

$ gcc headers.h

It produces headers.h.gch. When a source includes headers.h, GCC first searches for an appropriate .gch. Not only must the name match, but so must all the definitions at the moment of inclusion: headers.h should always be the first included header, otherwise it may not work. Now when I compile main.c:

$ time gcc -c main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

Much better! MSVC has a conventional name for this header recognizable to every Visual Studio user: stdafx.h. It works a bit differently, and I’ve never used it myself, but I trust it has similar results.

Precompiled headers requires some extra steps that vary by toolchain. Can we do better? That depends on your definition of “better!”

Artisan, handcrafted prototypes

As mentioned, systems software tends to need only a few declarations: open, read, write, stat, etc. What if I wrote these out manually? A bit tedious, but it doesn’t require special precompiled header handling. It also creates some new possibilities. To illustrate, a CRT-free “hello world” program:

#include 

int mainCRTStartup(void)
{
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    char message[] = "Hello, world!\n";
    DWORD len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

This takes my system half a second to compile — quite long to produce just 26 assembly instructions:

$ time cc -nostartfiles -o hello.exe hello.c
real    0m 0.50s
user    0m 0.00s
sys     0m 0.00s
$ ./hello.exe
Hello, world!

The program requires prototypes only for GetStdHandle and WriteFile, a definition for STD_OUTPUT_HANDLE, and some typedefs. Starting with the easy stuff, the definition and types look like this:

#define STD_OUTPUT_HANDLE ((DWORD)-11)

typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;

By the way, here’s a cheat code for quickly finding preprocessor definitions, faster than looking them up elsewhere:

$ echo '#include ' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)

Did you catch the pattern? It’s -10 - fd, where fd is the conventional unix file descriptor number: a kind of mnemonic.

Prototypes are a little trickier, especially if you care about 32-bit. The Windows API uses the “stdcall” calling convention, which is distinct from the “cdecl” calling convention on x86, though the same on x64. Of course, you must already be aware of this merely using the API, as your own callbacks must usually be stdcall themselves. Further, API functions are DLL imports and should be declared as such. Putting it together, here’s GetStdHandle:

__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);

This works with both Mingw-w64 and MSVC. MSVC requires __stdcall between the return type and function name, so don’t get clever about it. If you only care about GCC then you can declare both at once using attributes:

HANDLE GetStdHandle(DWORD)
    __attribute__((dllimport,stdcall));

I like to hide all this behind a macro, with a “table” of all my imports listed just below:

#define W32(r) __declspec(dllimport) r __stdcall
W32(HANDLE) GetStdHandle(DWORD);
W32(BOOL)   WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);

In WriteFile you may have noticed I’m taking shortcuts. The “official” definition uses an ugly pointer typedef, LPCVOID, instead of pointer syntax, but I skipped that type definition. I also replaced the last argument, an OVERLAPPED pointer, with a generic pointer. I only need to pass null. I can keep sanding it down to something more ergonomic:

W32(int)    WriteFile(void *, void *, int, int *, void *);

That’s how I typically write these prototypes. I dropped the const because it doesn’t help me. I used signed sizes because I like them better and it’s what I’m usually holding at the call site. But doesn’t changing the signedness potentially break compatibility? It makes no difference to any practical ABI: It’s passed the same way. In general, signedness is a matter for operators, and only some of them — mainly comparisons (<, >, etc.) and division. It’s a similar story for pointers starting with the 32-bit era, so I can choose whatever pointer types are convenient.

In general, I can do anything I want so long as I know my compiler will produce an appropriate function call. These are not standard functions, like printf or memcpy, which are implemented in part by the compiler itself, but foreign functions. It’s no different than teaching an FFI how to make a call. This is also, in essence, how OpenGL and Vulkan work, with applications defining the API for themselves.

Considering all this, my new hello world:

#define W32(r) __declspec(dllimport) r __stdcall
W32(void *) GetStdHandle(int);
W32(int)    WriteFile(void *, void *, int, int *, void *);

int mainCRTStartup(void)
{
    void *stdout = GetStdHandle(-10 - 1);
    char message[] = "Hello, world!\n";
    int len;
    return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}

You know, there’s a kind of beauty to a program that requires no external definitions. It builds quickly and produces a binary bit-for-bit identical to the original:

$ time cc -nostartfiles -o hello.exe main.c
real    0m 0.04s
user    0m 0.00s
sys     0m 0.00s

$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real    0m 0.03s
user    0m 0.00s
sys     0m 0.00s

I’ve also been using this to patch over API rough edges. For example, WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus takes OVERLAPPED. These types are explicitly compatible, and only defined separately for annoying technical reasons. I must use the same overlapped object with both APIs at once, meaning I would normally need ugly pointer casts on my Winsock calls, or vice versa with I/O completion ports. But because I’m writing all these definitions myself, I can define a common overlapped structure for both!

Perhaps you’re worried that this would be too fragile. Well, as a legacy software aficionado, I enjoy building and running my programs on old platforms. So far these programs still work properly going back 30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag, it’s always been a bug (now long fixed) in the old operating system, not in my programs or these prototypes. So, in effect, this technique has worked well for the past 30 years!

Writing out these definitions is a bit of a chore, but after paying that price I’ve been quite happy with the results. I will likely continue doing it in the future, at least for non-graphical applications.

CRT-free in 2023: tips and tricks

2023-02-15T02:12:00Z

Seven years ago I wrote about “freestanding” Windows executables. After an additional seven years of practical experience both writing and distributing such programs, half using a custom-built toolchain, it’s time to revisit these cabalistic incantations and otherwise scant details. I’ve tweaked my older article over the years as I’ve learned, but this is a full replacement and does not assumes you’ve read it. The “why” has been covered and the focus will be on the “how”. Both the GNU and MSVC toolchains will be considered.

I no longer call these “freestanding” programs since that term is, at best, inaccurate. In fact, we will be actively avoiding GCC features associated with that label. Instead I call these CRT-free programs, where CRT stands for the C runtime the Windows-oriented term for libc. This term communicates both intent and scope.

Entry point

You should already know that main is not the program’s entry point, but a C application’s entry point. The CRT provides the entry point, where it initializes the CRT, including parsing command line options, then calls the application’s main. The real entry point doesn’t have a name. It’s just the address of the function to be called by the loader without arguments.

You might naively assume you could continue using the name main and tell the linker to use it as the entry point. You would be wrong. Avoid the name main! It has a special meaning in C gets special treatment. Using it without a conventional CRT will confuse your tools an may cause build issues.

While you can use almost any other name you like, the conventional names are mainCRTStartup (console subsystem) and WinMainCRTStartup (windows subsystem). It’s easy to remember: Append CRTStartup to the name you’d use in a normal CRT-linking application. I strongly recommend using these names because it reduces friction. Your tools are already familiar with them, so you won’t need to do anything special.

int mainCRTStartup(void);     // console subsystem
int WinMainCRTStartup(void);  // windows subsystem

The MSVC linker documentation says the entry point uses the __stdcall calling convention. Ignore this and do not use __stdcall for your entry point! Since entry points may take no arguments, there is no practical difference from the __cdecl calling convention, so it matters little. Rather, the goal is to avoid __stdcall function decorations. In particular, the GNU linker --entry option does not understand them, nor can it find decorated entry points on its own. If you use __stdcall, then the 32-bit GNU linker will silently (!) choose the beginning of your .text section as the entry point. (This bug was fixed in Binutils 2.42, released January 2024. __stdcall entry points now link correctly.)

If you’re using C++, then of course you will also need to use extern "C" so that it’s not name-mangled. Otherwise the results are similarly bad.

If using -fwhole-program, you will need to mark your entry point as externally visible for GCC so that it knows its an entry point. While linkers are familiar with conventional entry point names, GCC the compiler is not. Normally you do not need to worry about this.

__attribute((externally_visible))  // for -fwhole-program
int mainCRTStartup(void)
{
    return 0;
}

The entry point returns int. If there are no other threads then the process will exit with the returned value as its exit status. In practice this is only useful for console programs. Windows subsystem programs have threads started automatically, without warning, and it’s almost certain your main thread is not the last thread. You probably want to use ExitProcess or even TerminateProcess instead of returning. The latter exits more abruptly and can avoid issues with certain subsystems, like DirectSound, not shutting down gracefully: It doesn’t even let them try.

int WinMainCRTStartup(void)
{
    // ...
    TerminateProcess(GetCurrentProcess(), 0);
}

Compilation

Starting with the GNU toolchain, you have two ways to get into “CRT-free mode”: -nostartfiles and -nostdlib. The former is more dummy-proof, and it’s what I use in build documentation. The latter can be a more complicated, but when it succeeds you get guarantees about the result. I use it in build scripts I intend to run myself, which I want to fail if they don’t do exactly what I expect. To illustrate, consider this trivial program:

#include 

int mainCRTStartup(void)
{
    ExitProcess(0);
}

This program uses ExitProcess from kernel32.dll. Compiling is easy:

$ cc -nostartfiles example.c

The -nostartfiles prevents it from linking the CRT entry point, but it still implicitly passes other “standard” linker flags, including libraries -lmingw32 and -lkernel32. Programs can use kernel32.dll functions without explicitly linking that DLL. But, hey, isn’t -lmingw32 the CRT, the thing we’re avoiding? It is, but it wasn’t actually linked because the program didn’t reference it.

$ objdump -p a.exe | grep -Fi .dll
        DLL Name: KERNEL32.dll

However, -nostdlib does not pass any of these libraries, so you need to do so explicitly.

$ cc -nostdlib example.c -lkernel32

The MSVC toolchain behaves a little like -nostartfiles, not linking a CRT unless you need it, semi-automatically. However, you’ll need to list kernel32.dll and tell it which subsystem you’re using.

$ cl example.c /link /subsystem:console kernel32.lib

However, MSVC has a handy little feature to list these arguments in the source file.

#ifdef _MSC_VER
  #pragma comment(linker, "/subsystem:console")
  #pragma comment(lib, "kernel32.lib")
#endif

This information must go somewhere, and I prefer the source file rather than a build script. Then anyone can point MSVC at the source without worrying about options.

$ cl example.c

I try to make all my Windows programs so simply built.

Stack probes

On Windows, it’s expected that stacks will commit dynamically. That is, the stack is merely reserved address space, and it’s only committed when the stack actually grows into it. This made sense 30 years ago as a memory saving technique, but today it no longer makes sense. However, programs are still built to use this mechanism.

To function properly, programs must touch each stack page for the first time in order. Normally that’s not an issue, but if your stack frame exceeds the page size, there’s a chance it might step over a page. When a function has a large stack frame, GCC inserts a call to a “stack probe” in libgcc that touches its pages in the prologue. It’s not unlike stack clash protection.

For example, if I have a 4kiB local variable:

int mainCRTStartup(void)
{
    char buf[1<<12] = {0};
    return 0;
}

When I compile with -nostdlib:

$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'

It’s trying to link the CRT stack probe. You can disable this behavior with -mno-stack-arg-probe.

$ cc -mno-stack-arg-probe -nostdlib example.c

Or you can just link -lgcc to provide a definition:

$ cc -nostdlib example.c -lgcc

Had you used -nostartfiles, you wouldn’t have noticed because it passes -lgcc automatically. It’s “dummy-proof” because this sort of issue goes away before it comes up, though for the same reason it’s harder to tell exactly what went into a program.

If you disable the probe altogether — my preference — you’ve only solved the linker problem, but the underlying stack commit problem remains and your program may crash. You can solve that by telling the linker to ask the loader to commit a larger stack up front rather than grow it at run time. Say, 2MiB:

$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c

Of course, I wish that this was simply the default behavior because it’s far more sensible! A much better option is to avoid large stack frames in the first place. Allocate locals larger than, say, 1KiB in a scratch arena instead of on the stack.

MSVC doesn’t have libgcc of course, but it still generates stack probes both for growing the stack and for security checks. The latter requires kernel32.dll, so if I compile the same program with MSVC, I get a bunch of linker failures:

$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...

Using /Gs1000000000 turns off the stack probes, /GS- turns off the checks, /stack commits a larger stack:

$ cl /GS- /Gs1000000000 example.c /link
     /subsystem:console /stack:0x200000,200000

Though, as before, better to avoid large stack frames in the first place.

Built-in functions… ugh

The three major C and C++ compilers — GCC, MSVC, Clang — share a common, evil weakness: “built-in” functions. No matter what, they each assume you will supply definitions for standard string functions at link time, particularly memset and memcpy. They do this no matter how many “seriously now, do not use standard C functions” options you pass. When you don’t link a CRT, you may need to define them yourself.

With GCC there’s a catch: it will transform your memset definition — that is, in a function named memset — into a call to itself. After all, it looks an awful lot like memset! This typically manifests as an infinite loop. Use -fno-builtin to prevent GCC from mis-compiling built-in functions.

Even with -fno-builtin, both GCC and Clang will continue inserting calls to built-in functions elsewhere. For example, making an especially large local variable (and using volatile to prevent it from being optimized out):

int mainCRTStartup(void)
{
    volatile char buf[1<<14] = {0};
    return 0;
}

As of this writing, the latest GCC and Clang will generate a memset call despite -fno-builtin:

$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...

To be absolutely pure, you will need to address this in just about any non-trivial program. On the other hand, -nostartfiles will grab a definition from msvcrt.dll for you:

$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
        DLL Name: msvcrt.dll

To be clear, this is a completely legitimate and pragmatic route! You get the benefits of both worlds: the CRT is still out of the way, but there’s also no hassle from misbehaving compilers. If this sounds like a good deal, then do it! (For on-lookers feeling smug: there is no such easy, general solution for this problem on Linux.)

When you write your own definitions, I suggest putting each definition in its own section so that they can be discarded via -Wl,--gc-sections when unused:

__attribute((section(".text.memset")))
void *memset(void *d, int c, size_t n)
{
    // ...
}

So far, for all three compilers, I’ve only needed to provide definitions for memset and memcpy.

Stack alignment on 32-bit x86

GCC expects a 16-byte aligned stack and generates code accordingly. Such is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However, the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal with it, there will likely be unaligned loads. Some may not be valid (e.g. SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a function attribute for this:

__attribute((force_align_arg_pointer))
int mainCRTStartup(void)
{
    // ...
}

GCC will now align the stack in this function’s prologue. Adjustment is only necessary at entry points, as GCC will maintain alignment through its own frames. This includes all entry points, not just the program entry point, particularly thread start functions. Rule of thumb for i686 GCC: If WINAPI or __stdcall appears in a definition, the stack likely requires alignment.

__attribute((force_align_arg_pointer))
DWORD WINAPI mythread(void *arg)
{
    // ...
}

It’s harmless to use this attribute on x64. The prologue will just be a smidge larger. If you’re worried about it, use #ifdef __i686__ to limit it to 32-bit builds.

Putting it all together

If I’ve written a graphical application with WinMainCRTStartup, used large stack frames, marked my entry point as externally visible, plan to support 32-bit builds, and defined a couple of needed string functions, my optimal entry point may look something like:

#ifdef __GNUC__
__attribute((externally_visible))
#endif
#ifdef __i686__
__attribute((force_align_arg_pointer))
#endif
int WinMainCRTStartup(void)
{
    // ...
}

Then my “optimize all the things” release build may look something like:

$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
     -fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32

Or with MSVC:

$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows

Or if I’m taking it easy maybe just:

$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c

Or with MSVC (linker flags in source):

$ cl /O2 app.c

SDL2 common mistakes and how to avoid them

2023-01-08T02:09:26Z

This article was discussed on reddit.

SDL has grown on me over the past year. I didn’t understand its value until viewing it in the right lens: as a complete platform and runtime replacing the host’s runtime, possibly including libc. Ideally an SDL application links exclusively against SDL and otherwise not directly against host libraries, though in practice it’s somewhat porous. With care — particularly in avoiding mistakes covered in this article — that ideal is quite achievable for C applications that fit within SDL’s feature set.

SDL applications are always interesting one way or another, so I like to dig in when I come across them. The items in this article are mistakes I’ve either made myself or observed across many such passion projects in the wild.

Mistake 1: Not using `sdl2-config`

This shell script comes with SDL2 and smooths over differences between platforms, even when cross compiling. It informs your compiler where to find and how to link SDL2. The script even works on Windows if you have a unix shell, such as via w64devkit. Use it as a command substitution at the end of the build command, particularly when using --libs. A one-shot or unity build (my preference) looks like so:

$ cc app.c $(sdl2-config --cflags --libs)

Or under separate compilation:

$ cc -c app.c $(sdl2-config --cflags)
$ cc app.o $(sdl2-config --libs)

Alternatively, static link by replacing --libs with --static-libs, though this is discouraged by the SDL project. When dynamically linked, users can, and do, trivially substitute a different SDL2 binary, such as one patched for their system. In my experience, static linking works reliably on Windows but poorly on Linux.

Alternatively, use the general purpose pkg-config. Don’t forget eval!

$ eval cc app.c $(pkg-config sdl2 --cflags --libs)

I wrote a pkg-config for Windows specifically for this case.

Caveats:

Some circumstances require special treatment, and sdl2-config may be too blunt a tool. That’s fine, but generally prefer sdl2-config as the default approach.
sdl2-config does not support extensions such as SDL2_image, so you will need to use pkg-config. Personally I don’t think they’re worth the trouble when there’s stb, or QOI instead of PNG.
There’s an alternative build option using CMake, without any use of sdl2-config, but I won’t discuss it here.

Mistake 2: Including `SDL2/SDL.h`

A lot of examples, including tutorials linked from the official SDL website, have SDL2/ in their include paths. That’s because they’re making mistake 1, not using sdl2-config, and are instead relying on Linux distributions having installed SDL2 in a place coincidentally accessible through that include path.

This is annoying when SDL2 not installed there, or if I don’t want it using the system’s SDL2. Worse, it can result in subtly broken builds as it mixes and matches different SDL installations. The correct SDL2 include is the following:

#include "SDL.h"

Note the quotes, which helps prevent picking up an arbitrary system header by accident. When carefully and narrowly targeting SDL-the-platform, this will be the only “system” include anywhere in your application.

Mistake 3: Not surrendering `main`

A conventional SDL application has a main function defined in its source, but despite the name, this is distinct from C main. To smooth over platform differences, SDL may rename the application’s main to SDL_main and substitute its own C main. Because of this, main must have the conventional argc/argv prototype and must return a value. (As a special case, C permits main to implicitly return 0, so it’s an easy mistake to make.)

With this in mind, the bare minimum SDL2 application:

#include "SDL.h"

int main(int argc, char **argv)
{
    return 0;
}

Caveat: Like with sdl2-config, some special circumstances require control over the application entry point — see SDL_MAIN_HANDLED and SDL_SetMainReady — but that should be reserved until there’s a need.

One such special case is avoiding linking a CRT on Windows. In principle it’s this simple:

#include "SDL.h"

int WinMainCRTStartup(void)
{
    SDL_SetMainReady();
    // ...
    return 0;
}

Then it’s the usual compiler and linker flags:

$ cc -nostdlib -o app.exe app.c $(sdl2-config --cflags --libs)

This will create a tiny .exe that doesn’t link any system DLL, just SDL2.dll. Quite platform agnostic indeed!

$ objdump -p app.exe | grep -Fi .dll
        DLL Name: SDL2.dll

Alas, as of this writing, this does not work reliably. SDL2’s accelerated renderers on Windows do not clean up properly in SDL_QuitSubSystem nor SDL_Quit, so the process cannot exit without calling ExitProcess in kernel32.dll (or similar). This is still an open experiment.

Mistake 4: Using the SDL wiki for API documentation

The SDL wiki is not authoritative documentation, merely a convenient web-linkable — and downloadable (see “offline html”) — information source. However, anyone who’s spent time on it can tell you it’s incomplete. The authoritative API documentation is the SDL headers, which fortunately are already on hand for building SDL applications. The SDL maintainers themselves use the headers, not the wiki.

If, like me, you’re using ctags, this is actually good news! With a bit of configuration, you can jump to any bit of SDL documentation at any time in your editor, treating the SDL headers like a hyperlinked wiki built into your editor. Just like building, sdl2-config can tell ctags where find those headers:

$ ctags -a -R --kinds-c=dept $(sdl2-config --prefix)/include/SDL2

I’m using -a (--append) to append to the tags file I’ve already generated for my own program, -R (--recurse) to automatically find all the headers, and --kinds-c=dept capture exactly the kinds of symbols I care about — #define, enum, prototypes, typedef — no more no less.

In Vim I CTRL-] over any SDL symbol to jump to its documentation, and then I can use it again within its documentation comment to jump further still to any symbols it mentions, then finally use the jump or tag stack to return. As long as I have t in 'complete' ('cpt'), which is the default, I can also “tab”-complete any SDL symbol using the tags table. There are a few rough edges here and there, but overall it’s a solid editing paradigm.

By the way, with sdl2-config in your $PATH, all the above works out of the box in w64devkit! That’s where I’ve mostly been working with SDL.

Mistake 5: Using stdio streams

A common bit of code in real SDL programs and virtually every tutorial:

if (SDL_Init(...)) {
    fprintf(stderr, "SDL_Init(): %s\n", SDL_GetError());
    return 1;
}

This is not ideal:

fprintf is not part of the SDL platform. This is going behind SDL’s back, reaching around the abstraction to a different platform. Strictly speaking, this API may not even be available to an SDL application.
SDL applications are graphical, so stderr is likely disconnected from anything useful. Few would ever see this message.

Fortunately SDL provides two alternatives:

SDL_Log: like C printf, but SDL will strive to connect it to somewhere useful. If the application was launched from a terminal or console, SDL will find it and hook it up to the logger. On Windows, if there’s a debugger attached, SDL will use OutputDebugString to send logs to the debugger.
SDL_ShowSimpleMessageBox: using any means possible, attempt to display a message to the user. Like SDL_Log, it’s safe to use before/without initializing SDL subsystems.

If you’re paranoid, you could even use both:

if (SDL_Init(...)) {
    SDL_ShowSimpleMessageBox(
        SDL_MESSAGEBOX_ERROR, "SDL_Init()", SDL_GetError(), 0
    );
    SDL_Log("SDL_Init(): %s", SDL_GetError());
    return 1;
}

Though note that SDL_ShowSimpleMessageBox can fail, which will set a new, different error message for SDL_Log!

There’s a similar story again with fopen and loading assets. SDL has an I/O API, SDL_RWops. It’s probably better than the host’s C equivalent, particularly with regards to paths. If you’re not already embedding your assets, use the SDL API instead.

Mistake 6: Using `SDL_RENDERER_ACCELERATED`

This flag — and its surrounding bit set, SDL_RendererFlags — are a subtle design flaw in the SDL2 API. Its existence is misleading, causing to widespread misuse. It does not help that the documentation, both header and wiki, is incomplete and unclear. The SDL_CreateRenderer function accepts a bit set as its third argument, and it serves two simultaneous purposes:

Indicates mandatory properties of the renderer. Examples: “must use accelerated rendering,” “must use software rendering,” “must support vertical synchronization (vsync).” Drivers without the chosen properties are skipped.
If SDL_RENDERER_PRESENTVSYNC is set, also enables vsync in the created render.

The common mistake is thinking that this bit indicates preference: “prefer an accelerated renderer if possible”. But it really means “accelerated renderer or bust.”

Given a zero for renderer flags, SDL will first attempt to create an accelerated renderer. Failing that, it will then attempt to create a software renderer. A software renderer fallback is exactly the behavior you want! After all, this fallback is one of the primary features of the SDL renderer API. This is so straightforward there are no caveats.

Mistake 7: Not accounting for vsync

For a game, you probably ought to enable vsync in your renderer. The hint: You’re using SDL_PollEvent in your main event loop. Otherwise you will waste lots of resources rendering thousands of frames per second. If my laptop fan spins up running your SDL application, it’s probably because you didn’t do this. The following should be the most conventional SDL renderer configuration:

r = SDL_CreateRenderer(window, -1, SDL_RENDERER_PRESENTVSYNC);

The software renderer supports vsync, so it will not be excluded from the driver search when vsync is requested.

That’s only for SDL renderers. If you’re using OpenGL, set a non-zero SDL_GL_SetSwapInterval so that SDL_GL_SwapWindow synchronizes. For the other rendering APIs, consult their documentation. (I can only speak to SDL and OpenGL from experience.)

Caveat: Beware accidentally relying on vsync for timing in your game. You don’t want your game’s physics to depend on the host’s display speed. Even the pros make this mistake from time to time.

However, if you’re not making a game – perhaps instead an IMGUI application without active animations — there’s a good chance you don’t need or want vsync. The hint: You’re using SDL_WaitEvent in your main event loop.

In summary, graphical SDL applications fall into one of two cases:

SDL_PollEvent with vsync
SDL_WaitEvent without vsync

Mistake 8: Using `assert.h` instead of `SDL_assert`

Alright, this one isn’t so common, but I’d like to highlight it. The SDL_assert macro is fantastic, easily beating assert.h which doesn’t even break in the right place. It uses SDL to present a user interface to the assertion, with support for retrying and ignoring. It also works great under debuggers, breaking exactly as it should. I have nothing but praise for it, so don’t pass up the chance to use it when you can.

While I’m at it: during developing and testing, always always always run your application under a debugger. Don’t close the debugger, just launch through it again after rebuilding. Also, enable UBSan and ASan when available for the extra assertions.

SDL wishlist

For months I had wondered why SDL provides no memory allocation API. I’m fine if it doesn’t have a general purpose allocator since I just want to grab a chunk of host memory for an arena. However, SDL does have allocations functions — SDL_malloc, etc. I didn’t know about them until I stopped making mistake 4.

It was the same story again with math functions: I’d like not to stray from SDL as a platform, but what if I need transcendental functions? I could whip up crude implementations myself, but I’d prefer not. SDL has those too: SDL_sin, etc. Caveat: The math.h functions are built-ins, and compilers use that information to better optimize programs, e.g. cool stuff like -mrecip, or SIMD vectorization. That cannot be done with SDL’s equivalents.

I’m surprised SDL has no random number generator considering how important it is to games. Since I prefer to handle this myself, I don’t mind that so much, but it does leave a lot of toy programs out there calling C rand. I would like SDL if provided a single, good seed early during startup. There isn’t even a wall clock function for the classic srand(time(0)) seeding event! My solution has been to mix event timestamps into the random state:

static Uint32 rand32(Uint64 *);

Uint64 rng = 0;
for (SDL_Event e; SDL_PollEvent(&e);) {
    rng ^= e.common.timestamp;
    rand32(&rng);  // stir
    switch (e.type) { /* ... */ }
}

As I learn more in the future, I may come back and add to this list. At the very least I expect to use SDL increasingly in my own projects.

How to build a WaitGroup from a 32-bit integer

2022-10-05T03:19:07Z

Go has a nifty synchronization utility called a WaitGroup, on which one or more goroutines can wait for concurrent task completion. In other languages, the usual task completion convention is joining threads doing the work. In Go, goroutines aren’t values and lack handles, so a WaitGroup replaces joins. Building a WaitGroup using typical, portable primitives is a messy affair involving constructors and destructors, managing lifetimes. However, on at least Linux and Windows, we can build a WaitGroup out of a zero-initialized integer, much like my 32-bit queue and 32-bit barrier.

In case you’re not familiar with it, a typical WaitGroup use case in Go:

var wg sync.WaitGroup
for _, task := range tasks {
    wg.Add(1)
    go func(t Task) {
        // ... do task ...
        wg.Done()
    }(task)
}
wg.Wait()

I zero-initialize the WaitGroup, the main goroutine increments the counter before starting each task goroutine, each goroutine decrements the counter when done, and the main goroutine waits until the counter reaches zero. My goal is to build the same mechanism in C:

void workfunc(task t, int *wg)
{
    // ... do task ...
    waitgroup_done(wg);
}

int main(void)
{
    // ...
    int wg = 0;
    for (int i = 0; i < ntasks; i++) {
        waitgroup_add(&wg, 1);
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);
    // ...
}

When it’s done, the WaitGroup is back to zero, and no cleanup is required.

I’m going to take it a little further than that: Since its meaning and contents are explicit, you may initialize a WaitGroup to any non-negative task count! In other words, waitgroup_add is optional if the total number of tasks is known up front.

    int wg = ntasks;
    for (int i = 0; i < ntasks; i++) {
        go(workfunc, tasks[i], &wg);
    }
    waitgroup_wait(&wg);

A sneak peek at the full source: waitgroup.c

The four elements (of synchronization)

To build this WaitGroup, we’re going to need four primitives from the host platform, each operating on an int. The first two are atomic operations, and the second two interact with the system scheduler. To port the WaitGroup to a platform you need only implement these four functions, typically as one-liners.

static int  load(int *);           // atomic load
static int  addfetch(int *, int);  // atomic add-then-fetch
static void wait(int *, int);      // wait on change at address
static void wake(int *);           // wake all waiters by address

The first two should be self-explanatory. The wait function waits for the pointed-at integer to change its value, and the second argument is its expected current value. The scheduler will double-check the integer before putting the thread to sleep in case it changes at the last moment — in other words, an atomic check-then-maybe-sleep. The wake function is the other half. After changing the integer, a thread uses it to wake all threads waiting for the pointed-at integer to change. Together, this mechanism is known as a futex.

I’m going to simplify the WaitGroup semantics a bit in order to make my implementation even simpler. Go’s WaitGroup allows adding negatives, and the Add method essentially does double-duty. My version forbids adding negatives. That means the “add” operation is just an atomic increment:

void waitgroup_add(int *wg, int delta)
{
    addfetch(wg, delta);
}

Since it cannot bring the counter to zero, there’s nothing else to do. The “done” operation can decrement to zero:

void waitgroup_done(int *wg)
{
    if (!addfetch(wg, -1)) {
        wake(wg);
    }
}

If the atomic decrement brought the count to zero, we finished the last task, so we need to wake the waiters. We don’t know if anyone is actually waiting, but that’s fine. Some futex use cases will avoid making the relatively expensive system call if nobody’s waiting — i.e. don’t waste time on a system call for each unlock of an uncontended mutex — but in the typical WaitGroup case we expect a waiter when the count finally goes to zero. That’s the common case.

The most complicated of the three is waiting:

void waitgroup_wait(int *wg)
{
    for (;;) {
        int c = load(wg);
        if (!c) {
            break;
        }
        wait(wg, c);
    }
}

First check if the count is already zero and return if it is. Otherwise use the futex to wait for it to change. Unfortunately that’s not exactly the semantics we want, which would be to wait for a certain target. This doesn’t break the wait, but it’s a potential source of inefficiency. If a thread finishes a task between our load and wait, we don’t go to sleep, and instead try again. However, in practice, I ran thousands of threads through this thing concurrently and I couldn’t observe such a “miss.” As far as I can tell, it’s so rare it doesn’t matter.

If this was a concern, the WaitGroup could instead be a pair of integers: the counter and a “latch” that is either 0 or 1. Waiters wait on the latch, and the latch is modified (atomically) when the counter transitions to or from zero. That gives waiters a stable value on which to wait, proxying the counter. However, since this doesn’t seem to matter in practice, I prefer the elegance and simplicity of the single-integer WaitGroup.

Four elements: Linux

With the WaitGroup done at a high level, we now need the per-platform parts. Both GCC and Clang support GNU-style atomics, so I’ll just assume these are available on Linux without worrying about the compiler. The first two functions wrap these built-ins:

static int load(int *p)
{
    return __atomic_load_n(p, __ATOMIC_SEQ_CST);
}

static int addfetch(int *p, int addend)
{
    return __atomic_add_fetch(p, addend, __ATOMIC_SEQ_CST);
}

For wait and wake we need the futex(2) system call. In an attempt to discourage its direct use, glibc doesn’t wrap this system call in a function, so we must make the system call ourselves.

static void wait(int *p, int current)
{
    syscall(SYS_futex, p, FUTEX_WAIT, current, 0, 0, 0);
}

static void wake(int *p)
{
    syscall(SYS_futex, p, FUTEX_WAKE, INT_MAX, 0, 0, 0);
}

The INT_MAX means “wake as many as possible.” The other common value is 1 for waking a single waiter. Also, these system calls can’t meaningfully fail, so there’s no need to check the return value. If wait wakes up early (e.g. EINTR), it’s going to check the counter again anyway. In fact, if your kernel is more than 20 years old, predating futexes, and returns ENOSYS (“Function not implemented”), it will still work correctly, though it will be incredibly inefficient.

Four elements: Windows

Windows didn’t support futexes until Windows 8 in 2012, and were still supporting Windows without it into 2020, so they’re still relatively “new” for this platform. Nonetheless, they’re now mature enough that we can count on them being available.

I’d like to support both GCC-ish (via Mingw-w64) and MSVC-ish compilers. Mingw-w64 provides a compatible intrin.h, so I can stick to MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t define atomics for int (or even int32_t), strictly long, so I have to sneak in a little cast. (Recall: sizeof(long) == sizeof(int) on every version of Windows supporting futexes.) The other option is to typedef the WaitGroup so that it’s int on Linux (for the futex) and long on Windows (for atomics).

static int load(int *p)
{
    return _InterlockedOr((long *)p, 0);
}

static int addfetch(int *p, int addend)
{
    return addend + _InterlockedExchangeAdd((long *)p, addend);
}

The official, sanctioned futex functions are WaitOnAddress and WakeByAddressAll. They used to be in kernel32.dll, but as of this writing they live in API-MS-Win-Core-Synch-l1-2-0.dll, linked via -lsynchronization. Gross. Since I can’t stomach this, I instead call the low-level RTL functions where it’s actually implemented: RtlWaitOnAddress and RtlWakeAddressAll. These live in the nice neighborhood of ntdll.dll. They’re undocumented as far as I can tell, but thankfully Wine comes to the rescue, providing both documentation and several different implementations. Reading through it is educational, and hints at ways to construct futexes on systems lacking them.

These functions aren’t declared in any headers, so I have to do it myself. On the plus side, so far I haven’t paid the substantial compile-time costs of including windows.h, and so I can continue avoiding it. These functions are listed in the ntdll.dll import library, so I don’t need to invent the import library entries.

__declspec(dllimport)
long __stdcall RtlWaitOnAddress(void *, void *, size_t, void *);
__declspec(dllimport)
long __stdcall RtlWakeAddressAll(void *);

Rather conveniently, the semantics perfectly line up with Linux futexes!

static void wait(int *p, int current)
{
    RtlWaitOnAddress(p, &current, sizeof(*p), 0);
}

static void wake(int *p)
{
    RtlWakeAddressAll(p);
}

Like with Linux, there’s no meaningful failure, so the return values don’t matter.

That’s the whole implementation. Considering just a single platform, a flexible, lightweight, and easy-to-use synchronization facility in ~50 lines of relatively simple code is a pretty good deal if you ask me!

My new debugbreak command

2022-07-31T12:59:59Z

I previously mentioned the Windows feature where pressing F12 in a debuggee window causes it to break in the debugger. It works with any debugger — GDB, RemedyBG, Visual Studio, etc. — since the hotkey simply raises a breakpoint structured exception. It’s been surprisingly useful, and I’ve wanted it available in more contexts, such as console programs or even on Linux. The result is a new debugbreak command, now included in w64devkit. Though, of course, you already have everything you need to build it and try it out right now. I’ve also worked out a Linux implementation.

It’s named after an MSVC intrinsic and Win32 function. It takes no arguments, and its operation is indiscriminate: It raises a breakpoint exception in all debuggee processes system-wide. Reckless? Perhaps, but certainly convenient. You don’t need to tell it which process you want to pause. It just works, and a good debugging experience is one of ease and convenience.

The linchpin is DebugBreakProcess. The command walks the process list and fires this function at each process. Nothing happens for programs without a debugger attached, so it doesn’t even bother checking if it’s a debuggee. It couldn’t be simpler. I’ve used it on everything from Windows XP to Windows 11, and it’s worked flawlessly.

HANDLE s = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
PROCESSENTRY32W p = {sizeof(p)};
for (BOOL r = Process32FirstW(s, &p); r; r = Process32NextW(s, &p)) {
    HANDLE h = OpenProcess(PROCESS_ALL_ACCESS, 0, p.th32ProcessID);
    if (h) {
        DebugBreakProcess(h);
        CloseHandle(h);
    }
}

I use it almost exclusively from Vim, where I’ve given it a leader mapping. With the editor focused, I can type backslash then d to pause the debuggee.

map <leader>d :call system("debugbreak")<cr>

With the debuggee paused, I’m free to add new breakpoints or watchpoints, or print the call stack to see what the heck it’s busy doing. The mechanism behind DebugBreakProcess is to create a new thread in the target, with that thread raising the breakpoint exception. The debugger will be stopped in this new thread. In GDB you can use the thread command to switch over to the thread that actually matters, usually thr 1.

debugbreak on Linux

On unix-like systems the equivalent of a breakpoint exception is a SIGTRAP. There’s already a standard command for sending signals, kill, so a debugbreak command can be built using nothing more than a few lines of shell script. However, unlike DebugBreakProcess, signaling every process with SIGTRAP will only end in tears. The script will need a way to determine which processes are debuggees.

Linux exposes processes in the file system as virtual files under /proc, where each process appears as a directory. Its status file includes a TracerPid field, which will be non-zero for debuggees. The script inspects this field, and if non-zero sends a SIGTRAP.

#!/bin/sh
set -e
for pid in $(find /proc -maxdepth 1 -printf '%f\n' | grep '^[0-9]\+$'); do
    grep -q '^TracerPid:\s[^0]' /proc/$pid/status 2>/dev/null &&
        kill -TRAP $pid
done

This script, now part of my dotfiles, has worked very well so far, and effectively smoothes over some debugging differences between Windows and Linux, reducing my context switching mental load. There’s probably a better way to express this script, but that’s the best I could do so far. On the BSDs you’d need to parse the output of ps, though each system seems to do its own thing for distinguishing debuggees.

A missing feature

I had originally planned for one flag, -k. Rather than breakpoint debugees, it would terminate all debuggee processes. This is especially important on Windows where debuggee processes block builds due to file locking shenanigans. I’d just run debugbreak -k as part of the build. However, it’s not possible to terminate debuggees paused in the debugger — the common situation. I’ve given up on this for now.

The wild west of Windows command line parsing

2022-02-18T03:52:12Z

I’ve been experimenting again lately with writing software without a runtime aside from the operating system itself, both on Linux and Windows. Another way to look at it: I write and embed a bespoke, minimal runtime within the application. One of the runtime’s core jobs is retrieving command line arguments from the operating system. On Windows this is a deeper rabbit hole than I expected, and far more complex than I realized. There is no standard, and every runtime does it a little differently. Five different applications may see five different sets of arguments — even different argument counts — from the same input, and this is before any sort of option parsing. It’s truly a modern day Tower of Babel: “Confound their command line parsing, that they may not understand one another’s arguments.”

Unix-like systems pass the argv array directly from parent to child. On Linux it’s literally copied onto the child’s stack just above the stack pointer on entry. The runtime just bumps the stack pointer address a few bytes and calls it argv. Here’s a minimalist x86-64 Linux runtime in just 6 instructions (22 bytes):

_start: mov   edi, [rsp]     ; argc
        lea   rsi, [rsp+8]   ; argv
        call  main
        mov   edi, eax
        mov   eax, 60        ; SYS_exit
        syscall

It’s 5 instructions (20 bytes) on ARM64:

_start: ldr  w0, [sp]        ; argc
        add  x1, sp, 8       ; argv
        bl   main
        mov  w8, 93          ; SYS_exit
        svc  0

On Windows, argv is passed in serialized form as a string. That’s how MS-DOS did it (via the Program Segment Prefix), because that’s how CP/M did it. It made more sense when processes were mostly launched directly by humans: The string was literally typed by a human operator, and somebody has to parse it after all. Today, processes are nearly always launched by other programs, but despite this, must still serialize the argument array into a string as though a human had typed it out.

Windows itself provides an operating system routine for parsing command line strings: CommandLineToArgvW. Fetch the command line string with GetCommandLineW, pass it to this function, and you have your argc and argv. Plus maybe LocalFree to clean up. It’s only available in “wide” form, so if you want to work in UTF-8 you’ll also need WideCharToMultiByte. It’s around 20 lines of C rather than 6 lines of assembly, but it’s not too bad.

My GetCommandLineW

GetCommandLineW returns a pointer into static storage, which is why it doesn’t need to be freed. More specifically, it comes from the Process Environment Block. This got me thinking: Could I locate this address myself without the API call? First I needed to find the PEB. After some research I found a PEB pointer in the Thread Information Block, itself found via the gs register (x64, fs on x86), an old 386 segment register. Buried in the PEB is a UNICODE_STRING, with the command line string address. I worked out all the offsets for both x86 and x64, and the whole thing is just three instructions:

wchar_t *cmdline_fetch(void)
{
    void *cmd = 0;
    #if __amd64
    __asm ("mov %%gs:(0x60), %0\n"
           "mov 0x20(%0), %0\n"
           "mov 0x78(%0), %0\n"
           : "=r"(cmd));
    #elif __i386
    __asm ("mov %%fs:(0x30), %0\n"
           "mov 0x10(%0), %0\n"
           "mov 0x44(%0), %0\n"
           : "=r"(cmd));
    #endif
    return cmd;
}

From Windows XP through Windows 11, this returns exactly the same address as GetCommandLineW. There’s little reason to do it this way other than to annoy Raymond Chen, but it’s still neat and maybe has some super niche use. Technically some of these offsets are undocumented and/or subject to change, except Microsoft’s own static link CRT also hardcodes all these offsets. It’s easy to find: disassemble any statically linked program, look for the gs register, and you’ll find it using these offsets, too.

If you look carefully at the UNICODE_STRING you’ll see the length is given by a USHORT in units of bytes, despite being a 16-bit wchar_t string. This is the source of Windows’ maximum command line length of 32,767 characters (including terminator).

GetCommandLineW is from kernel32.dll, but CommandLineToArgvW is a bit more off the beaten path in shell32.dll. If you wanted to avoid linking to shell32.dll for important reasons, you’d need to do the command line parsing yourself. Many runtimes, including Microsoft’s own CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s messier than I expected, and when I started digging into it I wasn’t expecting it to involve a few days of research.

The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.

I eventually came across David Deley’s How Command Line Parameters Are Parsed, which is the closest there is to an authoritative document on the matter (also). Unfortunately it focuses on runtimes rather than CommandLineToArgvW, and so some of those details aren’t captured. In particular, the first argument (i.e. argv[0]) follows entirely different rules, which really confused me for while. The Wine documentation was helpful particularly for CommandLineToArgvW. As far as I can tell, they’ve re-implemented it perfectly, matching it bug-for-bug as they do.

My CommandLineToArgvW

Before finding any of this, I started building my own implementation, which I now believe matches CommandLineToArgvW. These other documents helped me figure out what I was missing. In my usual fashion, it’s a little state machine: cmdline.c. The interface:

int cmdline_to_argv8(const wchar_t *cmdline, char **argv);

Unlike the others, mine encodes straight into WTF-8, a superset of UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative lines of code: invisible since it involves not reacting to ill-formed input. If you use the new-ish UTF-8 manifest Win32 feature then your program cannot handle command line strings with ill-formed UTF-16, a problem solved by WTF-8.

As documented, that argv must be a particular size — a pointer-aligned, 224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case. That’s not too bad when the command line is limited to 32,766 UTF-16 characters. The worst case argument is a single long sequence of 3-byte UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be half as many. The worst case argc is 16,383 (plus one more argv slot for the null pointer terminator), which is one argument for each pair of command line characters. The second half (roughly) of the argv is actually used as a char buffer for the arguments, so it’s all a single, fixed allocation. There is no error case since it cannot fail.

int mainCRTStartup(void)
{
    static char *argv[CMDLINE_ARGV_MAX];
    int argc = cmdline_to_argv8(cmdline_fetch(), argv);
    return main(argc, argv);
}

Also: Note the FUZZ option in my source. It has been pretty thoroughly fuzz tested. It didn’t find anything, but it does make me more confident in the result.

I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.

Building a command line string

I’ve always been boggled as to why there’s no complementary inverse to CommandLineToArgvW. When spawning processes with arbitrary arguments, everyone is left to implement the inverse of this under-specified and non-trivial command line format to serialize an argv. Hopefully the receiver parses it compatibly! There’s no falling back on a system routine to help out. This has lead to a lot of repeated effort: it’s not limited to high level runtimes, but almost any extensible application (itself a kind of runtime). Fortunately serializing is not quite as complex as parsing since many of the edge cases simply don’t come up if done in a straightforward way.

Naturally, I also wrote my own implementation (same source):

int cmdline_from_argv8(wchar_t *cmdline, int len, char **argv);

Like before, it accepts a WTF-8 argv, meaning it can correctly pass through ill-formed UTF-16 arguments. It returns the actual command line length. Since this one can fail when argv is too large, it returns zero for an error.

char *argv[] = {"python.exe", "-c", code, 0};
wchar_t cmd[CMDLINE_CMD_MAX];
if (!cmdline_from_argv8(cmd, CMDLINE_CMD_MAX, argv)) {
    return "argv too large";
}
if (!CreateProcessW(0, cmd, /*...*/)) {
    return "CreateProcessW failed";
}

How do others handle this?

The aged Emacs implementation is written in C rather than Lisp, steeped in history with vestigial wrong turns. Emacs still only calls the “narrow” CreateProcessA despite having every affordance to do otherwise, and uses the wrong encoding at that. A personal source of headaches.
CPython uses Python rather than C via subprocess.list2cmdline. While undocumented, it’s accessible on any platform and easy to test against various inputs. Try it out!
Go (gc) is just as delightfully boring I’d expect.
OpenJDK optimistically optimizes for command line strings under 80 bytes, and like Emacs, displays the weathering of long use.

I don’t plan to write a language implementation anytime soon, where this might be needed, but it’s nice to know I’ve already solved this problem for myself!

Some sanity for C and C++ development on Windows

2021-12-30T23:25:53Z

A hard reality of C and C++ software development on Windows is that there has never been a good, native C or C++ standard library implementation for the platform. A standard library should abstract over the underlying host facilities in order to ease portable software development. On Windows, C and C++ is so poorly hooked up to operating system interfaces that most portable or mostly-portable software — programs which work perfectly elsewhere — are subtly broken on Windows, particularly outside of the English-speaking world. The reasons are almost certainly political, originally motivated by vendor lock-in, than technical, which adds insult to injury. This article is about what’s wrong, how it’s wrong, and some easy techniques to deal with it in portable software.

There are multiple C implementations, so how could they all be bad, even the early ones? Microsoft’s C runtime has defined how the standard library should work on the platform, and everyone else followed along for the sake of compatibility. I’m excluding Cygwin and its major fork, MSYS2, despite not inheriting any of these flaws. They change so much that they’re effectively whole new platforms, not truly “native” to Windows.

In practice, C++ standard libraries are implemented on top of a C standard library, which is why C++ shares the same problems. CPython dodges these issues: Though written in C, on Windows it bypasses the broken C standard library and directly calls the proprietary interfaces. Other language implementations, such “gc” Go, simply aren’t built on C at all, and instead do things correctly in the first place — the behaviors the C runtimes should have had all along.

If you’re just working on one large project, bypassing the C runtime isn’t such a big deal, and you’re likely already doing so to access important platform functionality. You don’t really even need a C runtime. However, if you write many small programs, as I do, writing the same special Windows support for each one ends up being most of the work, and honestly makes properly supporting Windows not worth the trouble. I end up just accepting the broken defaults most of the time.

Before diving into the details, if you’re looking for a quick-and-easy solution for the Mingw-w64 toolchain, including w64devkit, which magically makes your C and C++ console programs behave well on Windows, I’ve put together a “library” named libwinsane. It solves all problems discussed in this article, except for one. No source changes required, simply link it into your program.

What exactly is broken?

The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix, and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy API, where an active code page maps 256 bytes onto (up to) 256 specific characters. On typical machines configured for European languages, this means code page 1252. Roughly speaking, Windows internally uses UTF-16, and calls through the narrow interface use the active code page to translate the narrow strings to wide strings. The result is that calls through the narrow API have limited access to the system.

The UTF-8 encoding was invented in 1992 and standardized by January 1993. UTF-8 was adopted by the unix world over the following years due to its backwards-compatibility with its existing interfaces. Programs could read and write Unicode data, access Unicode paths, pass Unicode arguments, and get and set Unicode environment variables without needing to change anything. Today UTF-8 has become the dominant text encoding format in the world, in large part due to the world wide web.

In July 1993, Microsoft introduced the wide Windows API with the release of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather than UTF-8. This turned out to be a mistake, since UTF-16 is inferior to UTF-8 in practically every way, though admittedly some problems weren’t so obvious at the time.

The major problem: The C and C++ standard libraries only hook up to the narrow Windows interfaces. The standard library, and therefore typical portable software on Windows, cannot handle anything but ASCII. The effective result is that these programs:

Cannot accept non-ASCII arguments
Cannot get/set non-ASCII environment variables
Cannot access non-ASCII paths
Cannot read and write non-ASCII on a console

Doing any of these requires calling proprietary functions, treating Windows as a special target. It’s part of what makes correctly porting software to Windows so painful.

The sensible solution would have been for the C runtime to speak UTF-8 and connect to the wide API. Alternatively, the narrow API could have been changed over to UTF-8, phasing out the old code page concept. In theory this is what the UTF-8 “code page” is about, though it doesn’t always work. There would have been compatibility problems with abruptly making such a change, but until very recently, this wasn’t even an option. Why couldn’t there be a switch I could flip to get sane behavior that works like every other platform?

How to mostly fix Unicode support

In 2019, Microsoft introduced a feature to allow programs to request UTF-8 as their active code page on start, along with supporting UTF-8 on more narrow API functions. This is like the magic switch I wanted, except that it involves embedding some ugly XML into your binary in a particular way. At least it’s now an option.

For Mingw-w64, that means writing a resource file like so:

#include 
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"

Compiling it with windres:

$ windres -o manifest.o manifest.rc

Then linking that into your program. Amazingly it mostly works! Programs can access Unicode arguments, Unicode environment variables, and Unicode paths, including with fopen, just as it’s worked on other platforms for decades. Since the active code page is set at load time, it happens before argv is constructed (from GetCommandLineA), which is why that works out.

Alternatively you could create a “side-by-side assembly” placing that XML in a file with the same name as your EXE but with .manifest suffix (after the .exe suffix), then placing that next to your EXE. Just be mindful that there’s a “side-by-side” cache (WinSxS), and so it might not immediately pick up your changes.

What doesn’t work is console input and output since the console is external to the process, and so isn’t covered by the process’s active code page. It must be configured separately using a proprietary call:

SetConsoleOutputCP(CP_UTF8);

Annoying, but at least it’s not that painful. This only covers output, though, meaning programs can only print UTF-8. Unfortunately UTF-8 input still doesn’t work, and setting the input code page doesn’t do anything despite reporting success:

SetConsoleCP(CP_UTF8);  // doesn't work

If you care about reading interactive Unicode input, you’re stuck bypassing the C runtime since it’s still broken.

Text stream translation

Another long-standing issue is that C and C++ on Windows has distinct “text” and “binary” streams, which it inherited from DOS. Mainly this means automatic newline conversion between CRLF and LF. The C standard explicitly allows for this, though unix-like platforms have never actually distinguished between text and binary streams.

The standard also specifies that standard input, output, and error are all open as text streams, and there’s no portable method to change the stream mode to binary — a serious deficiency with the standard. On unix-likes this doesn’t matter, but on Windows it means programs can’t read or write binary data on standard streams without calling a non-standard function. It also means reading and writing standard streams is slow, frequently a bottleneck unless I route around it.

Personally, I like writing binary data to standard output, including video, and sometimes binary filters that also read binary input. I do it so often that in probably half my C programs I have this snippet in main just so they work correctly on Windows:

    #ifdef _WIN32
    int _setmode(int, int);
    _setmode(0, 0x8000);
    _setmode(1, 0x8000);
    #endif

That incantation sets standard input and output in the C runtime to binary mode without the need to include a header, making it compact, simple, and self-contained.

This built-in newline translation, along with the Windows standard text editor, Notepad, lagging decades behind, meant that many other programs, including Git, grew their own, annoying, newline conversion misfeatures that cause other problems.

libwinsane

I introduced libwinsane at the beginning of the article, which fixes all this simply by being linked into a program. It includes the magic XML manifest .rsrc section, configures the console for UTF-8 output, and sets standard streams to binary before main (via a GCC constructor). I called it a “library”, but it’s actually a single object file. It can’t be a static library since it must be linked into the program despite not actually being referenced by the program.

So normally this program:

#include 
#include 

int main(int argc, char **argv)
{
    char *arg = argv[argc-1];
    size_t len = strlen(arg);
    printf("%zu %s\n", len, arg);
}

Compiled and run:

C:\>cc -o example example.c
C:\>example π
1 p

As usual, the Unicode argument is silently mangled into one byte. Linked with libwinsane, it just works like everywhere else:

C:\>gcc -o example example.c libwinsane.o
C:\>example π
2 π

If you’re maintaining a substantial program, you probably want to copy and integrate the necessary parts of libwinsane into your project and build, rather than always link against this loose object file. This is more for convenience and for succinctly capturing the concept. You may even want to enable ANSI escape processing in your version.

Update December 2024: Pavel Galkin demonstrates how libwinsane.o changes the console state, which affects all processes associated with the terminal. This is mostly unavoidable, and it’s one reason I’ve since concluded that UTF-8 manifests are a poor solution. Better to solve the problem using a platform layer.

More DLL fun with w64devkit: Go, assembly, and Python

2021-06-29T21:50:30Z

My previous article explained how to work with dynamic-link libraries (DLLs) using w64devkit. These techniques also apply to other circumstances, including with languages and ecosystems outside of C and C++. In particular, w64devkit is a great complement to Go and reliably fullfills all the needs of cgo — Go’s C interop — and can even bootstrap Go itself. As before, this article is in large part an exercise in capturing practical information I’ve picked up over time.

Go: bootstrap and cgo

The primary Go implementation, confusingly named “gc”, is an incredible piece of software engineering. This is apparent when building the Go toolchain itself, a process that is fast, reliable, easy, and simple. It was originally written in C, but was re-written in Go starting with Go 1.5. The C compiler in w64devkit can build the original C implementation which then can be used to bootstrap any more recent version. It’s so easy that I personally never use official binary releases and always bootstrap from source.

You will need the Go 1.4 source, go1.4-bootstrap-20171003.tar.gz. This “bootstrap” tarball is the last Go 1.4 release plus a few additional bugfixes. You will also need the source of the actual version of Go you want to use, such as Go 1.16.5 (latest version as of this writing).

Start by building Go 1.4 using w64devkit. On Windows, Go is built using a batch script and no special build system is needed. Since it shouldn’t be invoked with the BusyBox ash shell, I use cmd.exe explicitly.

$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ && cmd /c make)

In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use it to build the desired toolchain. You can move this new toolchain after it’s built if necessary.

$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ && cmd /c make)

At this point you can delete the bootstrap toolchain. You probably also want to put Go on your PATH.

$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" >>~/.profile
$ source ~/.profile

Not only is Go now available, so is the full power of cgo. (Including its costs if used.)

Vim suggestions

Since w64devkit is oriented so much around Vim, here’s my personal Vim configuration for Go. I don’t need or want fancy plugins, just access to goimports and a couple of corrections to Vim’s built-in Go support ([[ and ]] navigation). The included ctags understands Go, so tags navigation works the same as it does with C. \i saves the current buffer, runs goimports, and populates the quickfix list with any errors. Similarly :make invokes go build and, as expected, populates the quickfix list.

autocmd FileType go setlocal makeprg=go\ build
autocmd FileType go map <silent> <buffer> <leader>i
    \ :update \|
    \ :cexpr system("goimports -w " . expand("%")) \|
    \ :silent edit<cr>
autocmd FileType go map <buffer> [[
    \ ?^\(func\\|var\\|type\\|import\\|package\)\><cr>
autocmd FileType go map <buffer> ]]
    \ /^\(func\\|var\\|type\\|import\\|package\)\><cr>

Go only comes with gofmt but goimports is just one command away, so there’s little excuse not to have it:

$ go install golang.org/x/tools/cmd/goimports@latest

Thanks to GOPROXY, all Go dependencies are accessible without (or before) installing Git, so this tool installation works with nothing more than w64devkit and a bootstrapped Go toolchain.

cgo DLLs

The intricacies of cgo are beyond the scope of this article, but the gist is that a Go source file contains C source in a comment followed by import "C". The imported C object provides access to C types and functions. Go functions marked with an //export comment, as well as the commented C code, are accessible to C. The latter means we can use Go to implement a C interface in a DLL, and the caller will have no idea they’re actually talking to Go.

To illustrate, here’s an little C interface. To keep it simple, I’ve specifically sidestepped some more complicated issues, particularly involving memory management.

// Which DLL am I running?
int version(void);

// Generate 64 bits from a CSPRNG.
unsigned long long rand64(void);

// Compute the Euclidean norm.
float dist(float x, float y);

Here’s a C implementation which I’m calling “version 1”.

#include 
#include 
#include 

__declspec(dllexport)
int
version(void)
{
    return 1;
}

__declspec(dllexport)
unsigned long long
rand64(void)
{
    unsigned long long x;
    RtlGenRandom(&x, sizeof(x));
    return x;
}

__declspec(dllexport)
float
dist(float x, float y)
{
    return sqrtf(x*x + y*y);
}

As discussed in the previous article, each function is exported using __declspec so that they’re available for import. As before:

$ cc -shared -Os -s -o hello1.dll hello1.c

Side note: This could be trivially converted into a C++ implementation just by adding extern "C" to each declaration. It disables C++ features like name mangling, and follows the C ABI so that the C++ functions appear as C functions. Compiling the C++ DLL is exactly the same.

Suppose we wanted to implement this in Go instead of C. We already have all the tools needed to do so. Here’s a Go implementation, “version 2”:

package main

import "C"
import (
	"crypto/rand"
	"encoding/binary"
	"math"
)

//export version
func version() C.int {
	return 2
}

//export rand64
func rand64() C.ulonglong {
	var buf [8]byte
	rand.Read(buf[:])
	r := binary.LittleEndian.Uint64(buf[:])
	return C.ulonglong(r)
}

//export dist
func dist(x, y C.float) C.float {
	return C.float(math.Sqrt(float64(x*x + y*y)))
}

func main() {
}

Note the use of C types for all arguments and return values. The main function is required since this is the main package, but it will never be called. The DLL is built like so:

$ go build -buildmode=c-shared -o hello2.dll hello2.go

Without the -o option, the DLL will lack an extension. This works fine since it’s mostly only convention on Windows, but it may be confusing without it.

What if we need an import library? This will be required when linking with the MSVC toolchain. In the previous article we asked Binutils to generate one using --out-implib. For Go we have to handle this ourselves via gendef and dlltool.

$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def

The only way anyone upgrading would know version 2 was implemented in Go is that the DLL is a lot bigger (a few MB vs. a few kB) since it now contains an entire Go runtime.

NASM assembly DLL

We could also go the other direction and implement the DLL using plain assembly. It won’t even require linking against a C runtime.

w64devkit includes two assemblers: GAS (Binutils) which is used by GCC, and NASM which has friendlier syntax. I prefer the latter whenever possible — exactly why I included NASM in the distribution. So here’s how I implemented “version 3” in NASM assembly.

bits 64

section .text

global DllMainCRTStartup
export DllMainCRTStartup
DllMainCRTStartup:
	mov eax, 1
	ret

global version
export version
version:
	mov eax, 3
	ret

global rand64
export rand64
rand64:
	rdrand rax
	ret

global dist
export dist
dist:
	mulss  xmm0, xmm0
	mulss  xmm1, xmm1
	addss  xmm0, xmm1
	sqrtss xmm0, xmm0
	ret

The global directive is common in NASM assembly and causes the named symbol to have the external linkage needed when linking the DLL. The export directive is Windows-specific and is equivalent to dllexport in C.

Every DLL must have an entrypoint, usually named DllMainCRTStartup. The return value indicates if the DLL successfully loaded. So far this has been handled automatically by the C implementation, but at this low level we must define it explicitly.

Here’s how to assemble and link the DLL:

$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o

Call the DLLs from Python

Python has a nice, built-in C interop, ctypes, that allows Python to call arbitrary C functions in shared libraries, including DLLs, without writing C to glue it together. To tie this all off, here’s a Python program that loads all of the DLLs above and invokes each of the functions:

import ctypes

def load(version):
    hello = ctypes.CDLL(f"./hello{version}.dll")
    hello.version.restype = ctypes.c_int
    hello.version.argtypes = ()
    hello.dist.restype = ctypes.c_float
    hello.dist.argtypes = (ctypes.c_float, ctypes.c_float)
    hello.rand64.restype = ctypes.c_ulonglong
    hello.rand64.argtypes = ()
    return hello

for hello in load(1), load(2), load(3):
    print("version", hello.version())
    print("rand   ", f"{hello.rand64():016x}")
    print("dist   ", hello.dist(3, 4))

After loading the DLL with CDLL the program defines each function prototype so that Python knows how to call it. Unfortunately it’s not possible to build Python with w64devkit, so you’ll also need to install the standard CPython distribution in order to run it. Here’s the output:

$ python finale.py
version 1
rand    b011ea9bdbde4bdf
dist    5.0
version 2
rand    f7c86ff06ae3d1a2
dist    5.0
version 3
rand    2a35a05b0482c898
dist    5.0

That output is the result of four different languages interfacing in one process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!

How to build and use DLLs on Windows

2021-05-31T02:13:40Z

I’ve recently been involved with a couple of discussions about Windows’ dynamic linking. One was Joe Nelson in considering how to make libderp accessible on Windows, and the other was about w64devkit, my Mingw-w64 distribution. I use these techniques so infrequently that I need to figure it all out again each time I need it. Unfortunately there’s a whole lot of outdated and incorrect information online which gets in the way every time this happens. While it’s all fresh in my head, I will now document what I know works.

In this article, all commands and examples are being run in the context of w64devkit (1.8.0).

Mingw-w64

If all you care about is the GNU toolchain then DLLs are straightforward, working mostly like shared objects on other platforms. To illustrate, let’s build a “square” library with one “exported” function, square, that returns the square of its input (square.c):

long square(long x)
{
    return x * x;
}

The header file (square.h):

#ifndef SQUARE_H
#define SQUARE_H

long square(long);

#endif

To build a stripped, size-optimized DLL, square.dll:

$ cc -shared -Os -s -o square.dll square.c

Now a test program to link against it (main.c), which “imports” square from square.dll:

#include 
#include "square.h"

int main(void)
{
    printf("%ld\n", square(2));
}

Linking and testing it:

$ cc -Os -s main.c square.dll
$ ./a
4

It’s that simple. Or more traditionally, using the -l flag:

$ cc -Os -s -L. main.c -lsquare

Given -lxyz GCC will look for xyz.dll in the library path.

Viewing exported symbols

Given a DLL, printing a list of the exported functions of a DLL is not so straightforward. For ELF shared objects there’s nm -D, but despite what the internet will tell you, this tool does not support DLLs. objdump will print the exports as part of the “private” headers (-p). A bit of awk can cut this down to just a list of exports. Since we’ll need this a few times, here’s a script, exports.sh, that composes objdump and awk into the tool I want:

#!/bin/sh
set -e
printf 'LIBRARY %s\nEXPORTS\n' "$1"
objdump -p "$1" | awk '/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'

Running this on square.dll above:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square

This can be helpful when debugging. It also works outside of Windows, such as on Linux. By the way, the output format is no accident: This is the .def file format (also), which will be particularly useful in a moment.

Mingw-w64 has a gendef tool to produce the above output, and this tool is now included in w64devkit. To print the exports to standard output:

$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square

Alternatively Visual Studio provides dumpbin. It’s not as concise as exports.sh but it’s a lot less verbose than objdump -p.

$ dumpbin /nologo /exports square.dll
...
          1    0 000012B0 square
...

Mingw-w64 (improved)

You can get by without knowing anything more, which is usually enough for those looking to support Windows as a secondary platform, even just as a cross-compilation target. However, with a bit more work we can do better. Imagine doing the above with a non-trivial program. GCC doesn’t know which functions are part of the API and which are not. Obviously static functions should not be exported, but what about non-static functions visible between translation units (i.e. object files)?

For instance, suppose square.c also has this function which is not part of its API but may be called by another translation unit.

void internal_func(void) {}

Now when I build:

$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square

On the other side, when I build main.c how does it know which functions are imported from a DLL and which will be found in another translation unit? GCC makes it work regardless, but it can generate more efficient code if it knows at compile time (vs. link time).

On Windows both are solved by adding __declspec notation on both sides. In square.c the exports are marked as dllexport:

__declspec(dllexport)
long square(long x)
{
    return x * x;
}

void internal_func(void) {}

In the header, it’s marked as an import:

__declspec(dllimport)
long square(long);

The mere presence of dllexport tells the linker to only export those functions marked as exports, and so internal_func disappears from the exports list. Convenient!

On the import side, during compilation of the original program, GCC assumed square wasn’t an import and generated a local function call. When the linker later resolved the symbol to the DLL, it generated a trampoline to fill in as that local function (like a PLT). With dllimport, GCC knows it’s an imported function and so doesn’t go through a trampoline.

While generally unnecessary for the GNU toolchain, it’s good hygiene to use __declspec. It’s also mandatory when using MSVC, in case you care about that as well.

MSVC

Mingw-w64-compiled DLLs will work with LoadLibrary out of the box, which is sufficient in many cases, such as for dynamically-loaded plugins. For example (loadlib.c):

#include 
#include 

int main(void)
{
    HANDLE h = LoadLibrary("square.dll");
    long (*square)(long) = GetProcAddress(h, "square");
    printf("%ld\n", square(2));
}

Compiled with MSVC cl (via vcvars.bat):

$ cl /nologo loadlib.c
$ ./loadlib
4

However, the MSVC linker, unlike Binutils ld, cannot link directly with DLLs. It requires an import library. Conventionally this matches the DLL name but has a .lib extension — square.lib in this case. The Mingw-w64 ecosystem conventionally uses .dll.a, as in square.dll.a, in order to distinguish it from a static library, but it’s the same format. The most convenient way to get an import library is to ask GCC to generate one at link-time via --out-implib:

$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c

Back to cl, just add square.lib as another input. You don’t actually need square.dll present at link time.

$ cl /nologo /Os main.c square.lib
$ ./main
4

What if you already have the DLL and you just need an import library? GNU Binutils’ dlltool can do this, though not without help. It cannot generate an import library from a DLL alone since it requires a .def file enumerating the exports. (Why?) What luck that we have a tool for this!

$ ./exports.sh square.dll >square.def
$ dlltool --input-def square.def --output-lib square.lib

Reversing directions

Going the other way, building a DLL with MSVC and linking it with Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it requires that all exports are tagged with dllexport. The /LD (case sensitive) is just like GCC’s -shared.

$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4

cl outputs three files: square.dll, square.lib, and square.exp. The last can be discarded, and the second will be needed if linking with MSVC, but as before, Mingw-w64 requires only the first.

This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at least for C interfaces that don’t share CRT objects.

Tying it all together

If your program is designed to be portable, those __declspec will get in the way. That can be tidied up with some macros, but even better, those macros can be used to control ELF symbol visibility so that the library has good hygiene on, say, Linux as well.

The strategy will be to mark all API functions with SQUARE_API and expand that to whatever is necessary at the time. When building a library, it will expand to dllexport, or default visibility on unix-likes. When consuming a library it will expand to dllimport, or nothing outside of Windows. The new square.h:

#ifndef SQUARE_H
#define SQUARE_H

#if defined(SQUARE_BUILD)
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllexport)
#  elif defined(__ELF__)
#    define SQUARE_API __attribute__ ((visibility ("default")))
#  else
#    define SQUARE_API
#  endif
#else
#  if defined(_WIN32)
#    define SQUARE_API __declspec(dllimport)
#  else
#    define SQUARE_API
#  endif
#endif

SQUARE_API
long square(long);

#endif

The new square.c:

#define SQUARE_BUILD
#include "square.h"

SQUARE_API
long square(long x)
{
    return x * x;
}

main.c remains the same. When compiling on unix-like systems, add the -fvisibility=hidden to hide all symbols by default so that this macro can reveal them.

$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4

Makefile ideas

While Mingw-w64 hides a lot of the differences between Windows and unix-like systems, when it comes to dynamic libraries it can only do so much, especially if you care about import libraries. If I were maintaining a dynamic library — unlikely since I strongly prefer embedding or static linking — I’d probably just use different Makefiles per toolchain and target. Aside from the SQUARE_API type of macros, the source code can fortunately remain fairly agnostic about it.

Here’s what I might use as NMakefile for MSVC nmake:

CC     = cl /nologo
CFLAGS = /Os

all: main.exe square.dll square.lib

main.exe: main.c square.h square.lib
	$(CC) $(CFLAGS) main.c square.lib

square.dll: square.c square.h
	$(CC) /LD $(CFLAGS) square.c

square.lib: square.dll

clean:
	-del /f main.exe square.dll square.lib square.exp

Usage:

nmake /nologo /f NMakefile

For w64devkit and cross-compiling, Makefile.w64, which includes import library generation for the sake of MSVC consumers:

CC      = cc
CFLAGS  = -Os
LDFLAGS = -s
LDLIBS  =

all: main.exe square.dll square.lib

main.exe: main.c square.dll square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c square.dll $(LDLIBS)

square.dll: square.c square.h
	$(CC) -shared -Wl,--out-implib,$(@:dll=lib) \
	    $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

square.lib: square.dll

clean:
	rm -f main.exe square.dll square.lib

Usage:

make -f Makefile.w64

And a Makefile for everyone else:

CC      = cc
CFLAGS  = -Os -fvisibility=hidden
LDFLAGS = -s
LDLIBS  =

all: main libsquare.so

main: main.c libsquare.so square.h
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c ./libsquare.so $(LDLIBS)

libsquare.so: square.c square.h
	$(CC) -shared $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)

clean:
	rm -f main libsquare.so

Now that I have this article, I’m glad I won’t have to figure this all out again next time I need it!

A guide to Windows application development using w64devkit

2021-03-11T01:40:31Z

There’s a trend of building services where a monolithic application is better suited, or using JavaScript and Python then being stumped by their troublesome deployment story. This leads to solutions like bundling an entire web browser with an application, or using containers to circumscribe a sprawling dependency tree made of mystery meat.

My small development distribution for Windows, w64devkit, is my own little way of pushing back against this trend where it affects me most. Following in the footsteps of projects like Handmade Hero and Making a Video Game from Scratch, this is my guide to no-nonsense software development using my development kit. It’s an overview of the tooling and development workflow, and I’ve tried not to assume too much knowledge of the reader. Being a guide rather than manual, it is incomplete on its own, and I link to substantial external resources to fill in the gaps. The guide is capped with a small game I wrote entirely using my development kit, serving as a demonstration of what sorts of things are not only possible, but quite reasonably attainable.

Game repository: https://github.com/skeeto/asteroids-demo
Guide to source: Understanding Asteroids

Initial setup

Of course you cannot use the development kit if you don’t have it yet. Go to the releases section and download the latest release. It will be a .zip file named w64devkit-x.y.z.zip where x.y.z is the version.

You will need to unzip the development kit before using it. Windows has built-in support for .zip files, so you can either right-click to access “Extract All…” or navigate into it as a folder then drag-and-drop the w64devkit directory somewhere outside the .zip file. It doesn’t care where it’s unzipped (aka it’s “portable”), so put it where ever is convenient: your desktop, user profile directory, a thumb drive, etc. You can move it later if you change your mind just so long as you’re not actively running it. If you decide you don’t need it anymore then delete it.

Entering the development environment

There is a w64devkit.exe in the unzipped w64devkit directory. This is the easiest way to enter the development environment, and will not require system configuration changes. This program puts the kit’s programs in the PATH environment variable then runs a Bourne shell — the standard unix shell. Aside from the text editor, this is the primary interface for developing software. In time you may even extend this environment with your own tools.

If you want an additional “terminal” window, run w64devkit.exe again. If you use it a lot, you may want to create a shortcut and even pin it to your task bar.

Whether on Windows or unix-like systems, when you type a command into the system shell it uses the PATH environment variable to locate the actual program to run for that command. In practice, the PATH variable is a concatenation of multiple directories, and the shell searches these directories in order. On unix-like systems, PATH elements are separated by colons. However, Windows uses colons to delimit drive letters, so its PATH elements are separated by semicolons.

# Prepending to PATH on unix
PATH="$HOME/bin:$PATH"

# Prepending to PATH on Windows (w64devkit)
PATH="$HOME/bin;$PATH"

For more advanced users: Rather than use w64devkit.exe, you could “Edit environment variables for your account” and manually add w64devkit’s bin directory to your PATH, making the tools generally available everywhere on your system. If you’ve gone this route, you can start a Bourne shell at any time with sh -l. (The -l option requests a login shell.)

Also borrowed from the unix world is the concept of a home directory, specified by the HOME environment variable. By default this will be your user profile directory, typically C:/Users/$USER. Login shells always start in the home directory. This directory is often indicated by tilde (~), and many programs automatically expand a leading tilde to the home directory.

Shell basics

The shell is a command interpreter. It’s named such because it was originally a shell around the operating system kernel — the user interface to the kernel. Your system’s graphical interface — Windows Explorer, or Explorer.exe — is really just a kind of shell, too. That shell is oriented around the mouse and graphics. This is fine for some tasks, but a keyboard-oriented command shell is far better suited for development tasks. It’s more efficient, but more importantly its features are composable: Complex operations and processes can be constructed from simple, easy-to-understand tools. Embrace it!

In the shell you can navigate between directories with cd, make directories with mkdir, remove files with rm, regular expression text searches with grep, etc. Run busybox to see a listing of the available standard commands. Unfortunately there are no manual pages, but you can access basic usage information for any command with busybox CMD --help.

Windows’ standard command shell is cmd.exe. Unfortunately this shell is terrible and exists mostly for legacy compatibility. The intended replacement is PowerShell for users who regularly use a shell. However, PowerShell is fundamentally broken, does virtually everything incorrectly, and manages to be even worse than cmd.exe. Besides, sticking to POSIX shell conventions significantly improves build portability, and unix tool knowledge is transferable to basically every other operating system.

Unix’s standard shell was the Bourne shell, sh. The shells in use today are Bourne shell clones with a superset of its features. The most popular interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist shell) has become popular for non-interactive use (scripting). The shell included with w64devkit is the BusyBox fork of the Almquist shell (ash), closely related to dash. The Almquist shell has almost no non-interactive features beyond the standard Bourne shell, and so as far as scripts are concerned can be regarded as a plain Bourne shell clone. That’s why I typically refer to it by the name sh.

However, BusyBox’s Almquist shell has interactive features much like Bash, and Bash users should be quite comfortable. It’s not just tab-completion but a slew of Emacs-like keybindings:

Ctrl-r: search backwards in history
Ctrl-s: search forwards in history
Ctrl-p: previous command (Up)
Ctrl-n: next command (Down)
Ctrl-a: cursor to the beginning of line (Home)
Ctrl-e: cursor to the end of line (End)
Alt-b: cursor back one word
Alt-f: cursor forward one word
Ctrl-l: clear the screen
Alt-d: delete word after the cursor
Ctrl-w: delete the word before the cursor
Ctrl-k: delete to the end of the line
Ctrl-u: delete to the beginning of the line
Ctrl-f: cursor forward one character (Right)
Ctrl-b: cursor backward one character (Left)
Ctrl-d: delete character under the cursor (Delete)
Ctrl-h: delete character before the cursor (Backspace)

Take special note of Ctrl-r, which is the most important and powerful shortcut of the bunch. Frequent use is a good habit. Don’t mash the up arrow to search through the command history.

Special note for Cygwin and MSYS2 users: the shell is aware of Windows paths and does not present a virtual unix file system scheme. This has important consequences for scripting, both good and bad. The shell even supports backslash as a directory separator, though you should of course prefer forward slashes.

Shell customization

Login shells (-l) evaluate the contents of ~/.profile on startup. This is your chance to customize the shell configuration, such as setting environment variables or defining aliases and functions. For instance, if you wanted the prompt to show the working directory in green you’d set PS1 in your ~/.profile:

PS1="$(printf '\x1b[33;1m\\w\x1b[0m$ ')"

If you find yourself using the same command sequences or set of options again and again, you might consider putting those commands into a script, and then installing that script somewhere on your PATH so that you can run it as a new command. First make a directory to hold your scripts, say in ~/bin:

mkdir ~/bin

In ~/.profile prepend it to your PATH:

PATH="$HOME/bin;$PATH"

If you don’t want to start a fresh shell to try it out, then load the new configuration in your current shell:

source ~/.profile

Suppose you keep getting the tar switches mixed up and you’d like to just have an untar command that does the right thing. Create a file named untar or untar.sh in ~/bin with these contents:

#!/bin/sh
set -e
tar -xaf "$@"

Now a command like untar something.tar.gz will extract the archive contents.

To learn more about Bourne shell scripting, the POSIX shell command language specification is a good reference. All of the features listed in that document are available to your shell scripts.

Text editing

The development kit includes the powerful and popular text editor Vim. It takes effort to learn, but is well worth the investment. It’s packed with features, but since you only need a small number of them on a regular basis it’s not as daunting as it might appear. Using Vim effectively, you will write and edit text so much more quickly than before. That includes not just code, but prose: READMEs, documentation, etc.

(The catch: Non-modal editing will forever feel frustratingly inefficient. That’s not because you will become unpracticed at it, or even have trouble code switching between input styles, but because you’ll now be aware how bad it is. Ignorance is bliss.)

Vim includes its own tutorial for absolute beginners which you can access with the vimtutor command. It will run in the console window and guide you through the basics in about half an hour. Do not be afraid to return to the tutorial at any time since this is the stuff you need to know by heart.

When it comes time to actually use Vim to write code, you can continue writing code via the terminal interface (vim), or you can run the graphical interface (gvim). The latter is recommended since it has some nice quality-of-life features, but it’s not strictly necessary. When starting the GUI, put an ampersand (&) on the command so that it runs in the background. For instance this brings up the editor with two files open but leaves the shell running in the foreground so you can continue using it while you edit:

gvim main.c Makefile &

Vim’s defaults are good but imperfect. Before getting started with actually editing code you should establish at least the following minimal configuration in ~/_vimrc. (To understand these better, use :help to jump the built-in documentation.)

set hidden encoding=utf-8 shellslash
filetype plugin indent on
syntax on

The graphical interface defaults to a white background. Many people prefer “dark mode” when editing code, so inverting this is simply a matter of choosing a dark color scheme. Vim comes with a handful of color schemes, around half of which have dark backgrounds. Use :colorscheme to change it, and put it in your ~/_vimrc to persist it.

colorscheme slate

The default graphical interface includes a menu bar and tool bar. There are better ways to accomplish all these operations, none of which require touching the mouse, so consider removing all that junk:

set guioptions=ac

Finally, since the development kit is oriented around C and C++, here’s my own entire Vim configuration for C which makes it obey my own style:

set cinoptions+=t0,l1,:0 cinkeys-=0#

Once you’re comfortable with the basics, the best next step is to read Practical Vim: Edit Text at the Speed of Thought by Drew Neil. It’s an opinionated guide to Vim that instills good habits. If you want something cost-free to whet your appetite, check out Seven habits of effective text editing.

Writing an application

We’ve established a shell and text editor. Next is the development workflow for writing an actual application. Ultimately you will invoke a compiler from within Vim, which will parse compiler messages and take you directly to the parts of your source code that need attention. Before we get that far, let’s start with the basics.

The classic example is the “hello world” program, which we’ll suppose is in a file called hello.c:

#include 

int main(void)
{
    puts("Hello, world!");
}

While this development kit provides a version of the GNU compiler, gcc, this guide mostly speaks of it in terms of the generic unix C compiler name, cc. Unix-like systems install cc as an alias for the system’s default C compiler, and w64devkit is no exception.

cc -o hello.exe hello.c

This command creates hello.exe from hello.c. Since this is not (yet?) on your PATH, you must invoke it via a path name (i.e. the command must include a slash), since otherwise the shell will search for it via the PATH variable. Typically this means putting ./ in front of the program name, meaning “run the program in the current directory”. As a convenience you do not need to include the .exe extension:

./hello

Unlike the untar shell script from before, this hello.exe is entirely independent of w64devkit. You can share it with anyone running Windows and they’ll be able to execute it. There’s a little bit of runtime embedded in the executable, but the bulk of the runtime is in the operating system itself. I want to highlight this point because most programming languages don’t work like this, or at least doing so is unnatural with lots of compromises. The users of your software do not need to install a runtime or other supporting software. They just run the executable you give them!

That executable is probably pretty small, less than 50kB — basically a miracle by today’s standards. Sure, it’s hardly doing anything right now, but you can add a whole lot more functionality without that executable getting much bigger. In fact, it’s entirely unoptimized right now and could be even smaller. Passing the -Os flag tells the compiler to optimize for size and -s flag tells the linker to strip out unneeded information.

cc -Os -s -o hello.exe hello.c

That cuts the program down to around a third of its previous size. If necessary you can still do even better than this, but that’s outside the scope of this guide.

So far the program could still be valid enough to compile but contain obvious mistakes. The compiler can warn about many of these mistakes, and so it’s always worth enabling these warnings. This requires two flags: -Wall (“all” warnings) and -Wextra (extra warnings).

cc -Wall -Wextra -o hello.exe hello.c

When you’re working on a program, you often don’t want optimization enabled since it makes it more difficult to debug. However, some warnings aren’t fired unless optimization is enabled. Fortunately there’s an optimization level to resolve this, -Og (optimize for debugging). Combine this with -g3 to embed debug information in the program. This will be handy later.

cc -Wall -Wextra -Og -g3 -o hello.exe hello.c

These are the compiler flags you typically want to enable while developing your software. When you distribute it, you’d use either -Os -s (optimize for size) or -O3 -s (optimize for speed).

Makefiles

I mentioned running the compiler from Vim. This isn’t done directly but via special build script called a Makefile. You invoke the make program from Vim, which invokes the compiler as above. The simplest Makefile would look like this, in a file literally named Makefile:

hello.exe: hello.c
    cc -Wall -Wextra -Og -g3 -o hello.exe hello.c

This tells make that the file named hello.exe is derived from another file called hello.c, and the tab-indented line is the recipe for doing so. Running the make command will run the compiler command if and only if hello.c is newer than hello.exe.

To run make from Vim, use the :make command inside Vim. It will not only run make but also capture its output in an internal buffer called the quickfix list. If there is any warning or error, Vim will jump to it. Use :cn (next) and :cp (prev) to move between issues and correct them, or :cc to re-display the current issue. When you’re done fixing the issues, run :make again to start the cycle over.

Try that now by changing the printed message and recompiling from within Vim. Intentionally create an error (bad syntax, too many arguments, etc.) and see what happens.

Makefiles are a powerful and conventional way to build C and C++ software. Since the development kit includes the standard set of unix utilities, it’s very easy to write portable Makefiles that work across a variety a operating systems and environments. Your software isn’t necessarily tied to Windows just because you’re using a Windows-based development environment. If you want to learn how Makefiles work and how to use them effectively, read A Tutorial on Portable Makefiles. From here on I’ll assume you’ve read that tutorial.

Ultimately I’d probably write my “hello world” Makefile like so:

.POSIX:
CC      = cc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =
EXE     = .exe

hello$(EXE): hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)

When building a release, optimize for size or speed:

make CFLAGS=-Os LDFLAGS=-s

This is very much a Windows-first style of Makefile, but still allows it to be comfortably used on other systems. On Linux this make invocation strips away the .exe extension:

make EXE=

For a Windows-second Makefile, remove the line with EXE = .exe. This allows EXE to come from the environment. So, for instance, I already define the EXE environment variable in my w64devkit ~/.profile:

export EXE=.exe

On Linux running make does the right thing, as does running make on Windows. No special configuration required.

If my software is truly limited to Windows, I’m likely still interested in supporting cross-compilation. A common convention for GNU toolchains is a CROSS Makefile macro. For example:

.POSIX:
CROSS   =
CC      = $(CROSS)gcc
CFLAGS  = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS  =

hello.exe: hello.c
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)

On Windows I just run make, but on Linux I’d set CROSS appropriately.

make CROSS=x86_64-w64-mingw32-

Navigating

What happens if you’re working on a larger program and you need to jump to the definition of a function, macro, or variable? It would be tedious to use grep all the time to find definitions. The development kit includes a solid implementation of ctags for building a tags database lists the locations for various kinds of definitions, and Vim knows how to read this database. Most often you’ll want to run it recursively like so:

ctags -R

You can of course do this from Vim, too: :!ctags -R

With the cursor over an identifier, press CTRL-] to jump to a definition for that name. Use :tn and :tp to move between different definitions (e.g. when the name is overloaded). Or if you have a tag in mind rather than a name listed in the buffer, use the :tag command to jump by name. Vim maintains a tag stack and jump list for going back and forth, like the backward and forward buttons in a browser.

Debugging

I had mentioned that the -g3 option embeds extra information in the executable. This is for debuggers, and the development kit includes the GNU Debugger, gdb, to help you debug your programs. To use it, invoke GDB on your executable:

gdb hello.exe

From here you can set breakpoints and such, then run the program with start or run, then step through it line by line. See Beej’s Quick Guide to GDB for a guide. During development, always run your program through GDB, and never exit GDB. See also: Assertions should be more debugger-oriented.

Learning C and C++

So far this guide hasn’t actually assumed any C knowledge. One of the best ways to learn C is by reading the highly-regarded The C Programming Language and doing the exercises. Alternatively, cost-free options are Beej’s Guide to C Programming and Modern C (more advanced). You can use the development kit to go through any of these.

I’ve focused on C, but everything above also applies to C++. To learn C++ A Tour of C++ is a safe bet.

Demonstration

To illustrate how much you can do with nothing beyond than this 76MB development kit, here’s a taste in the form of a weekend project: an Asteroids Clone for Windows. That’s the game in the video at the top of this guide.

The development kit doesn’t include Git so you’d need to install it separately in order to clone the repository, but you could at least skip that and download a .zip snapshot of the source. It has no third-party dependencies yet it includes hardware-accelerated graphics, real-time sound mixing, and gamepad input. Building a larger and more complex game is much less about tooling and more about time and skill. That’s what I mean about w64devkit being (almost) everything you need.

Well-behaved alias commands on Windows

2021-02-08T20:32:45Z

Since its inception I’ve faced a dilemma with w64devkit, my all-in-one Mingw-w64 toolchain and development environment distribution for Windows. A major goal of the project is no installation: unzip anywhere and it’s ready to go as-is. However, full functionality requires alias commands, particularly for BusyBox applets, and the usual solutions are neither available nor viable. It seemed that an installer was needed to assemble this last puzzle piece. This past weekend I finally discovered a tidy and complete solution that solves this problem for good.

That solution is a small C source file, alias.c. This article is about why it’s necessary and how it works.

Hard and symbolic links

Some alias commands are for convenience, such as a cc alias for gcc so that build systems need not assume any particular C compiler. Others are essential, such as an sh alias for “busybox sh” so that it’s available as a shell for make. These aliases are usually created with links, hard or symbolic. A GCC installation might include (roughly) a symbolic link created like so:

ln -s gcc cc

BusyBox looks at its argv[0] on startup, and if it names an applet (ls, sh, awk, etc.), it behaves like that applet. Typically BusyBox aliases are installed as hard links to the original binary, and there’s even a busybox --install to set these up. Both kinds of aliases are cheap and effective.

ln busybox sh
ln busybox ls
ln busybox awk

Unfortunately links are not supported by .zip files on Windows. They’d need to be created by a dedicated installer. As a result, I’ve strongly recommended that users run “busybox --install” at some point to establish the BusyBox alias commands. While w64devkit works without them, it works better with them. Still, that’s an installation step!

An alternative option is to simply include a full copy of the BusyBox binary for each applet — all 150 of them — simulating hard links. BusyBox is small, around 4kB per applet on average, but it’s not quite that small. Since the .zip format doesn’t use block compression — files are compressed individually — this duplication will appear in the .zip itself. My 573kB BusyBox build duplicated 150 times would double the distribution size and increase the installation footprint by 25%. It’s not worth the cost.

Since .zip is so limited, perhaps I should use a different distribution format that supports links. However, another w64devkit goal is making no assumptions about what other tools are installed. Windows natively supports .zip, even if that support isn’t so great (poor performance, low composability, missing features, etc.). With nothing more than the w64devkit .zip on a fresh, offline Windows installation, you can begin efficiently developing professional, native applications in under a minute.

Scripts as aliases

With links off the table, the next best option is a shell script. On unix-like systems shell scripts are an effective tool for creating complex alias commands. Unlike links, they can manipulate the argument list. For instance, w64devkit includes a c99 alias to invoke the C compiler configured to use the C99 standard. To do this with a shell script:

#!/bin/sh
exec cc -std=c99 "$@"

This prepends -std=c99 to the argument list and passes through the rest untouched via the Bourne shell’s special case "$@". Because I used exec, the shell process becomes the compiler in place. The shell doesn’t hang around in the background. It’s just gone. This really quite elegant and powerful.

The closest available on Windows is a .bat batch file. However, like some other parts of DOS and Windows, the Batch language was designed as though its designer once glimpsed at someone using a unix shell, perhaps looking over their shoulder, then copied some of the ideas without understanding them. As a result, it’s not nearly as useful or powerful. Here’s the Batch equivalent:

@cc -std=c99 %*

The @ is necessary because Batch prints its commands by default (Bourne shell’s -x option), and @ disables it. Windows lacks the concept of exec(3), so Batch file interpreter cmd.exe continues running alongside the compiler. A little wasteful but that hardly matters. What does matter though is that cmd.exe doesn’t behave itself! If you, say, Ctrl+C to cancel compilation, you will get the infamous “Terminate batch job (Y/N)?” prompt which interferes with other programs running in the same console. The so-called “batch” script isn’t a batch job at all: It’s interactive.

I tried to use Batch files for BusyBox applets, but this issue came up constantly and made this approach impractical. Nearly all BusyBox applets are non-interactive, and lots of things break when they aren’t. Worst of all, you can easily end up with layers of cmd.exe clobbering each other to ask if they should terminate. It was frustrating.

The prompt is hardcoded in cmd.exe and cannot be disabled. Since so much depends on cmd.exe remaining exactly the way it is, Microsoft will never alter this behavior either. After all, that’s why they made PowerShell a new, separate tool.

Speaking of PowerShell, could we use that instead? Unfortunately not:

It’s installed by default on Windows, but is not necessarily enabled. One of my own use cases for w64devkit involves systems where PowerShell is disabled by policy. A common policy is it can be used interactively but not run scripts (“Running scripts is disabled on this system”).
PowerShell is not a first class citizen on Windows, and will likely never be. Even under the friendliest policy it’s not normally possible to put a PowerShell script on the PATH and run it by name. (I’m sure there are ways to make this work via system-wide configuration, but that’s off the table.)
Everything in PowerShell is broken. For example, it does not support input redirection with files, and instead you must use the cat-like command, Get-Content, to pipe file contents. However, Get-Content translates its input and quietly damages your data. There is no way to disable this “feature” in the version of PowerShell that ships with Windows, meaning it cannot accomplish the simplest of tasks. This is just one of many ways that PowerShell is broken beyond usefulness.

Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts are still not first class citizens since Windows doesn’t know what to do with them. Fixing would require system-wide configuration, antithetical to the philosophy of the project.

Solution: compiled shell “scripts”

My working solution is inspired by an insanely clever hack used by my favorite media player, mpv. The Windows build is strange at first glance, containing two binaries, mpv.exe (large) and mpv.com (tiny). Is that COM as in an old-school 16-bit DOS binary? No, that’s just a trick that works around a Windows limitation.

The Windows technology is broken up into subsystems. Console programs run in the Console subsystem. Graphical programs run in the Windows subsystem. The original WSL was a subsystem. Unfortunately this design means that a program must statically pick a subsystem, hardcoded into the binary image. The program cannot select a subsystem dynamically. For example, this is why Java installations have both java.exe and javaw.exe, and Emacs has emacs.exe and runemacs.exe. Different binaries for different subsystems.

On Linux, a program that wants to do graphics just talks to the Xorg server or Wayland compositor. It can dynamically choose to be a terminal application or a graphical application. Or even both at once. This is exactly the behavior of mpv, and it faces a dilemma on Windows: With subsystems, how can it be both?

The trick is based on the environment variable PATHEXT which tells Windows how to prioritize executables with the same base name but different file extensions. If I type mpv and it finds both mpv.exe and mpv.com, which binary will run? It will be the first listed in PATHEXT, and by default that starts with:

PATHEXT=.COM;.EXE;.BAT;...

So it will run mpv.com, which is actually a plain old PE+ .exe in disguise. The Windows subsystem mpv.exe gets the shortcut and file associations while Console subsystem mpv.com catches command line invocations and serves as console liaison as it invokes the real mpv.exe. Ingenious!

I realized I can pull a similar trick to create command aliases — not the .com trick, but the miniature flagger program. If only I could compile each of those Batch files to tiny, well-behaved .exe files so that it wouldn’t rely on the badly-behaved cmd.exe…

Tiny C programs

Years ago I wrote about tiny, freestanding Windows executables. That research paid off here since that’s exactly what I want. The alias command program need only manipulate its command line, invoke another program, then wait for it to finish. This doesn’t require the C library, just a handful of kernel32.dll calls. My alias command programs can be so small that would no longer matter that I have 150 of them, and I get complete control over their behavior.

To compile, I use -nostdlib and -ffreestanding to disable all system libraries, -lkernel32 to pull that one back in, -Os (optimize for size), and -s (strip) all to make the result as small as possible.

I don’t want to write a little program for each alias command. Instead I’ll use a couple of C defines, EXE and CMD, to inject the target command at compile time. So this Batch file:

@target arg1 arg2 %*

Is equivalent to this alias compilation:

gcc -DEXE="target.exe" -DCMD="target arg1 arg2" \
    -s -Os -nostdlib -ffreestanding -o alias.exe alias.c -lkernel32

The EXE string is the actual module name, so the .exe extension is required. The CMD string replaces the first complete token of the command line string (think argv[0]) and may contain arbitrary additional arguments (e.g. -std=c99). Both are handled as wide strings (L"...") since the alias program uses the wide Win32 API in order to be fully transparent. Though unfortunately at this time it makes no difference: All currently aliased programs use the “ANSI” API since the underlying C and C++ standard libraries only use the ANSI API. (As far as I know, nobody has ever written fully-functional C and C++ standard libraries for Windows, not even Microsoft.)

You might wonder why the heck I’m gluing strings together for the arguments. These will need to be parsed (word split, etc.) by someone else, so shouldn’t I construct an argv array instead? That’s not how it works on Windows: Programs receive a flat command string and are expected to parse it themselves following the format specification. When you write a C program, the C runtime does this for you to provide the usual argv array.

This is upside down. The caller creating the process already has arguments split into an argv array — or something like it — but Win32 requires the caller to encode the argv array as a string following a special format so that the recipient can immediately decode it. Why marshaling rather than pass structured data in the first place? Why does Win32 only supply a decoder (CommandLineToArgv) and not an encoder (e.g. the missing ArgvToCommandLine)? Hey, I don’t make the rules; I just have to live with them.

You can look at the original source for the details, but the summary is that I supply my own xstrlen(), xmemcpy(), and partial Win32 command line parser — just enough to identify the first token, even if that token is quoted. It glues the strings together, calls CreateProcessW, waits for it to exit (WaitForSingleObject), retrieves the exit code (GetExitCodeProcess), and exits with the same status. (The stuff that comes for free with exec(3).)

This all compiles to a 4kB executable, mostly padding, which is small enough for my purposes. These compress to an acceptable 1kB each in the .zip file. Smaller would be nicer, but this would require at minimum a custom linker script, and even smaller would require hand-crafted assembly.

This lingering issue solved, w64devkit now works better than ever. The alias.c source is included in the kit in case you need to make any of your own well-behaved alias commands.

w64devkit: (Almost) Everything You Need

2020-09-25T00:04:11Z

This article was discussed on Hacker News.

This past May I put together my own C and C++ development distribution for Windows called w64devkit. The entire release weighs under 80MB and requires no installation. Unzip and run it in-place anywhere. It’s also entirely offline. It will never automatically update, or even touch the network. In mere seconds any Windows system can become a reliable development machine. (To further increase reliability, disconnect it from the internet.) Despite its simple nature and small packaging, w64devkit is almost everything you need to develop any professional desktop application, from a command line utility to a AAA game.

I don’t mean this in some useless Turing-complete sense, but in a practical, get-stuff-done sense. It’s much more a matter of know-how than of tools or libraries. So then what is this “almost” about?

The distribution does not have WinAPI documentation. It’s notoriously difficult to obtain and, besides, unfriendly to redistribution. It’s essential for interfacing with the operating system and difficult to work without. Even a dead tree reference book would suffice.
Depending on what you’re building, you may still need specialized tools. For instance, game development requires tools for editing art assets.
There is no formal source control system. Git is excluded per the issues noted in the announcement, and my next option, Quilt, has similar limitations. However, diff and patch are included, and are sufficient for a kind of old-school, patch-based source control. I’ve used it successfully when dogfooding w64devkit in a fresh Windows installation.

Everything else

As I said in my announcement, w64devkit includes a powerful text editor that fulfills all text editing needs, from code to documentation. The editor includes a tutorial (vimtutor) and complete, built-in manual (:help) in case you’re not yet familiar with it.

What about navigation? Use the included ctags to generate a tags database (ctags -R), then jump instantly to any definition at any time. No need for that Language Server Protocol rubbish. This does not mean you must laboriously type identifiers as you work. Use built-in completion!

Build system? That’s also covered, via a Windows-aware unix-like environment that includes make. Learning how to use it is a breeze. Software is by its nature unavoidably complicated, so don’t make it more complicated than necessary.

What about debugging? Use the debugger, GDB. Performance problems? Use the profiler, gprof. Inspect compiler output either by asking for it (-S) or via the disassembler (objdump -d). No need to go online for the Godbolt Compiler Explorer, as slick as it is. If the compiler output is insufficient, use SIMD intrinsics. In the worst case there are two different assemblers available. Real time graphics? Use an operating system API like OpenGL, DirectX, or Vulkan.

w64devkit really is nearly everything you need in a single, no nonsense, fully-offline package! It’s difficult to emphasize this point as much as I’d like. When interacting with the broader software ecosystem, I often despair that software development has lost its way. This distribution is my way of carving out an escape from some of the insanity. As a C and C++ toolchain, w64devkit by default produces lean, sane, trivially-distributable, offline-friendly artifacts. All runtime components in the distribution are static link only, so no need to distribute DLLs with your application either.

Customize the distribution, own the toolchain

While most users would likely stick to my published releases, building w64devkit is a two-step process with a single build dependency, Docker. Anyone can easily customize it for their own needs. Don’t care about C++? Toss it to shave 20% off the distribution. Need to tune the runtime for a specific microarchitecture? Tweak the compiler flags.

One of the intended strengths of open source is users can modify software to suit their needs. With w64devkit, you own the toolchain itself. It is one of your dependencies after all. Unfortunately the build initially requires an internet connection even when working from source tarballs, but at least it’s a one-time event.

If you choose to take on dependencies, and you build those dependencies using w64devkit, all the better! You can tweak them to your needs and choose precisely how they’re built. You won’t be relying on the goodwill of internet randos nor the generosity of a free package registry.

Customization examples

Building existing software using w64devkit is probably easier than expected, particularly since much of it has already been “ported” to MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure scripts. They never work in w64devkit despite having everything they technically need. So other than that, here’s a demonstration of building some popular software.

One of my coworkers uses his own version of PuTTY patched to play more nicely with Emacs. If you wanted to do the same, grab the source tarball, unpack it using the provided tools, then in the unpacked source:

$ make -C windows -f Makefile.mgw

You’ll have a custom-built putty.exe, as well as the other tools. If you have any patches, apply those first!

Would you like to embed an extension language in your application? Lua is a solid choice, in part because it’s such a well-behaved dependency. After unpacking the source tarball:

$ make PLAT=mingw

This produces a complete Lua compiler, runtime, and library. It’s not even necessary to use the Makefile, as it’s nearly as simple as “cc *.c” — painless to integrate or embed into any project.

Do you enjoy NetHack? Perhaps you’d like to try a few of the custom patches. This one is a little more complicated, but I was able to build NetHack 3.6.6 like so:

$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"

NetHack has a bug necessitating -fcommon. If you have any patches, apply them with patch before the last step. I won’t belabor it here, but with just a little more effort I was also able to produce a NetHack binary with curses support via PDCurses — statically-linked of course.

How about my archive encryption tool, Enchive? The one that even works with 16-bit DOS compilers. It requires nothing special at all!

$ make

w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM. This means you can modify and recompile these tools without going through the Docker build. Sadly busybox-w32 cannot host itself, though it’s close. I’d love if w64devkit could fully host itself, and so Docker — and therefore an internet connection and such — would only be needed to bootstrap, but unfortunately that’s not realistic given the state of the GNU components.

Offline and reliable

Software development has increasingly become dependent on a constant internet connection. Robust, offline tooling and development is undervalued.

Consider: Does your current project depend on an external service? Do you pay for this service to ensure that it remains up? If you pull your dependencies from a repository, how much do you trust those who maintain the packages? Do you even know their names? What would be your project’s fate if that service went down permanently? It will someday, though hopefully only after your project is dead and forgotten. If you have the ability to work permanently offline, then you already have happy answers to all these questions.

w64devkit: a Portable C and C++ Development Kit for Windows

2020-05-15T03:43:04Z

This article was discussed on Hacker News.

As a computer engineer, my job is to use computers to solve important problems. Ideally my solutions will be efficient, and typically that means making the best use of the resources at hand. Quite often these resources are machines running Windows and, despite my misgivings about the platform, there is much to be gained by properly and effectively leveraging it.

Sometimes targeting Windows while working from another platform is sufficient, but other times I must work on the platform itself. There are various options available for C development, and I’ve finally formalized my own development kit: w64devkit.

For most users, the value is in the 78MiB .zip available in the “Releases” on GitHub. This (relatively) small package includes a state-of-the-art C and C++ compiler (latest GCC), a powerful text editor, debugger, a complete x86 assembler, and miniature unix environment. It’s “portable” in that there’s no installation. Just unzip it and start using it in place. With w64devkit, it literally takes a few seconds on any Windows to get up and running with a fully-featured, fully-equipped, first-class development environment.

The development kit is cross-compiled entirely from source using Docker, though Docker is not needed to actually use it. The repository is just a Dockerfile and some documentation. The only build dependency is Docker itself. It’s also easy to customize it for your own personal use, or to audit and build your own if, for whatever reason, you didn’t trust my distribution. This is in stark contrast to Windows builds of most open source software where the build process is typically undocumented, under-documented, obtuse, or very complicated.

From script to Docker

Publishing this is not necessarily a commitment to always keep w64devkit up to date, but this Dockerfile is derived from (and replaces) a shell script I’ve been using continuously for over two years now. In this period, every time GCC has made a release, I’ve built myself a new development kit, so I’m already in the habit.

I’ve been using Docker on and off for about 18 months now. It’s an oddball in that it’s something I learned on the job rather than my own time. I formed an early impression that still basically holds: The main purpose of Docker is to contain and isolate misbehaved software to improve its reliability. Well-behaved, well-designed software benefits little from containers.

My unusual application of Docker here is no exception. Most software builds are needlessly complicated and fragile, especially Autoconf-based builds. Ironically, the worst configure scripts I’ve dealt with come from GNU projects. They waste time on superfluous checks (“Does your compiler define size_t?”) then produce a build that doesn’t work anyway because you’re doing something slightly unusual. Worst of all, despite my best efforts, the build will be contaminated by the state of the system doing the build.

My original build script was fragile by extension. It would work on one system, but not another due to some subtle environment change — a slightly different system header that reveals a build system bug (example in GCC), or the system doesn’t have a file at a certain hard-coded absolute path that shouldn’t be hard-coded. Converting my script to a Dockerfile locks these problems in place and makes builds much more reliable and repeatable. The misbehavior is contained and isolated by Docker.

Unfortunately it’s not completely contained. In each case I use make’s -j option to parallelize the build since otherwise it would take hours. Some of the builds have subtle race conditions, and some bad luck in timing can cause a build to fail. Docker is good about picking up where it left off, so it’s just a matter of trying again.

In one case a build failed because Bison and flex were not installed even though they’re not normally needed. Some dependency isn’t expressed correctly, and unlucky ordering leads to an unused .y file having the wrong timestamp. Ugh. I’ve had this happen a lot more in Docker than out, probably because file system operations are slow inside Docker and it creates greater timing variance.

Other tools

The README explains some of my decisions, but I’ll summarize a few here:

Git. Important and useful, so I’d love to have it. But it has a weird installation (many .zip-unfriendly symlinks) tightly-coupled with msys2, and its build system does not support cross-compilation. I’d love to see a clean, straightforward rewrite of Git in a single, appropriate implementation language. Imagine installing the latest Git with go get git-scm.com/git. (Update: libgit2 is working on it!)
Bash. It’s a much nicer interactive shell than BusyBox-w32 ash. But the build system doesn’t support cross-compilation, and I’m not sure it supports Windows without some sort of compatibility layer anyway.
Emacs. Another powerful editor. But the build system doesn’t support cross-compilation. It’s also way too big.
Go. Tempting to toss it in, but Go already does this all correctly and effectively. It simply doesn’t require a specialized distribution. It’s trivial to manage a complete Go toolchain with nothing but Go itself on any system. People may say its language design comes from the 1970s, but the tooling is decades ahead of everyone else.

Alternatives

For a long, long time Cygwin filled this role for me. However, I never liked its bulky nature, the complete opposite of portable. Cygwin processes always felt second-class on Windows, particularly in that it has its own view of the file system compared to other Windows processes. They could never fully cooperate. I also don’t like that there’s no toolchain for cross-compiling with Cygwin as a target — e.g. compile Cygwin binaries from Linux. Finally it’s been essentially obsoleted by WSL which matches or surpasses it on every front.

There’s msys and msys2, which are a bit lighter. However, I’m still in an isolated, second-class environment with weird path translation issues. These tools do have important uses, and it’s the only way to compile most open source software natively on Windows. For those builds that don’t support cross-compilation, it’s the only path for producing Windows builds. It’s just not what I’m looking for when developing my own software.

Update: llvm-mingw is an eerily similar project using Docker the same way, but instead builds LLVM.

Using Docker for other builds

I also converted my GnuPG build script to a Dockerfile. Of course I don’t plan to actually use GnuPG on Windows. I just need it for passphrase2pgp, which I test against GnuPG. This tests the Windows build.

In the future I may extend this idea to a few other tools I don’t intend to include with w64devkit. If you have something in mind, you could use my Dockerfiles as a kind of starter template.

How to Read UTF-8 Passwords on the Windows Console

2020-05-04T02:14:34Z

This article was discussed on Hacker News.

Suppose you’re writing a command line program that prompts the user for a password or passphrase, and Windows is one of the supported platforms (even very old versions). This program uses UTF-8 for its string representation, as it should, and so ideally it receives the password from the user encoded as UTF-8. On most platforms this is, for the most part, automatic. However, on Windows finding the correct answer to this problem is a maze where all the signs lead towards dead ends. I recently navigated this maze and found the way out.

I knew it was possible because my passphrase2pgp tool has been using the golang.org/x/crypto/ssh/terminal package, which gets it very nearly perfect. Though they were still fixing subtle bugs as recently as 6 months ago.

The first step is to ignore just everything you find online, because it’s either wrong or it’s solving a slightly different problem. I’ll discuss the dead ends later and focus on the solution first. Ultimately I want to implement this on Windows:

// Display prompt then read zero-terminated, UTF-8 password.
// Return password length with terminator, or zero on error.
int read_password(char *buf, int len, const char *prompt);

I chose int for the length rather than size_t because it’s a password and should not even approach INT_MAX.

The correct way

For the impatient: complete, working, ready-to-use example

On a unix-like system, the program would:

open(2) the special /dev/tty file for reading and writing
write(2) the prompt
tcgetattr(3) and tcsetattr(3) to disable ECHO
read(2) a line of input
Restore the old terminal attributes with tcsetattr(3)
close(2) the file

A great advantage of this approach is that it doesn’t depend on standard input and standard output. Either or both can be redirected elsewhere, and this function still interacts with the user’s terminal. The Windows version will have the same advantage.

Despite some tempting shortcuts that don’t work, the steps on Windows are basically the same but with different names. There are a couple subtleties and extra steps. I’ll be ignoring errors in my code snippets below, but the complete example has full error handling.

Create console handles

Instead of /dev/tty, the program opens two files: CONIN$ and CONOUT$ using CreateFileA(). Note: The “A” stands for ANSI, as opposed to “W” for wide (Unicode). This refers to the encoding of the file name, not to how the file contents are encoded. CONIN$ is opened for both reading and writing because write permissions are needed to change the console’s mode.

HANDLE hi = CreateFileA(
    "CONIN$",
    GENERIC_READ | GENERIC_WRITE,
    0,
    0,
    OPEN_EXISTING,
    0,
    0
);
HANDLE ho = CreateFileA(
    "CONOUT$",
    GENERIC_WRITE,
    0,
    0,
    OPEN_EXISTING,
    0,
    0
);

Print the prompt

To write the prompt, call WriteConsoleA() on the output handle. On its own, this assumes the prompt is plain ASCII (i.e. "password: "), not UTF-8 (i.e. "contraseña: "):

WriteConsoleA(ho, prompt, strlen(prompt), 0, 0);

If the prompt may contain UTF-8 data, perhaps because it displays a username or isn’t in English, you have two options:

Convert the prompt to UTF-16 and call WriteConsoleW() instead.
Use SetConsoleOutputCP() with CP_UTF8 (65001). This is a global (to the console) setting and should be restored when done.

Disable echo

Next use GetConsoleMode() and SetConsoleMode() to disable echo. The console usually has ENABLE_PROCESSED_INPUT already set, which tells the console to handle CTRL-C and such, but I set it explicitly just in case. I also set ENABLE_LINE_INPUT so that the user can use backspace and so that the entire line is delivered at once.

DWORD orig = 0;
GetConsoleMode(hi, &orig);

DWORD mode = orig;
mode |= ENABLE_PROCESSED_INPUT;
mode &= ~ENABLE_ECHO_INPUT;
SetConsoleMode(hi, mode);

There are reports that ENABLE_LINE_INPUT limits reads to 254 bytes, but I was unable to reproduce it. My full example can read huge passwords without trouble.

The old mode is saved in orig so that it can be restored later.

Read the password

Here’s where you have to pay the piper. As of the date of this article, the Windows API offers no method for reading UTF-8 input from the console. Give up on that hope now. If you use the “ANSI” functions to read input under any configuration, they will to the usual Windows thing of silently mangling your input.

So you must use the UTF-16 API, ReadConsoleW(), and then encode it yourself. Fortunately Win32 provides a UTF-8 encoder, WideCharToMultiByte(), which will even handle surrogate pairs for all those people who like putting PILE OF POO (U+1F4A9) in their passwords:

SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0;  // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);

I use SecureZeroMemory() to erase the UTF-16 version of the password before freeing the buffer. The + 2 in the allocation is for the CRLF line ending that will later be chopped off. The error handling version checks that the input did indeed end with CRLF. Otherwise it was truncated (too long).

Clean up

Finally print a newline since the user-typed one wasn’t echoed, restore the old console mode, close the console handles, and return the final encoded length:

WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;

The error checking version doesn’t check for errors from any of these functions since either they cannot fail, or there’s nothing reasonable to do in the event of an error.

Dead ends

If you look around the Win32 API you might notice SetConsoleCP(). A reasonable person might think that setting the “code page” to UTF-8 (CP_UTF8) might configure the console to encode input in UTF-8. The good news is Windows will no longer mangle your input as before. The bad news is that it will be mangled differently.

You might think you can use the CRT function _setmode() with _O_U8TEXT on the FILE * connected to the console. This does nothing useful. (The only use for _setmode() is with _O_BINARY, to disable braindead character translation on standard input and output.) The best you’ll be able to do with the CRT is the same sort of wide character read using non-standard functions, followed by conversion to UTF-8.

CredUICmdLinePromptForCredentials() promises to be both a mouthful of a function name, and a prepacked solution to this problem. It only delivers on the first. This function seems to have broken some time ago and nobody at Microsoft noticed — probably because nobody has ever used this function. I couldn’t find a working example, nor a use in any real application. When I tried to use it, I got a nonsense error code it never worked. There’s a GUI version of this function that does work, and it’s a viable alternative for certain situations, though not mine.

At my most desperate, I hoped ENABLE_VIRTUAL_TERMINAL_PROCESSING would be a magical switch. On Windows 10 it magically enables some ANSI escape sequences. The documentation in no way suggests it would work, and I confirmed by experimentation that it does not. Pity.

I spent a lot of time searching down these dead ends until finally settling with ReadConsoleW() above. I hoped it would be more automatic, but I’m glad I have at least some solution figured out.

Fibers: the Most Elegant Windows API

2019-03-28T22:26:05Z

This article was discussed on Hacker News.

The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.

That’s why, when I recently revisited the Fibers API, I was pleasantly surprised. It’s one of the exceptions — much cleaner than the optional, deprecated, and now obsolete POSIX equivalent. It’s not quite an apples-to-apples comparison since the POSIX version is slightly more powerful, and more complicated as a result. I’ll cover the difference in this article.

For the last part of this article, I’ll walk through an async/await framework build on top of fibers. The framework allows coroutines in C programs to await on arbitrary kernel objects.

Fiber Async/await Demo

Fibers

Windows fibers are really just stackful, symmetric coroutines. From a different point of view, they’re cooperatively scheduled threads, which is the source of the analogous name, fibers. They’re symmetric because all fibers are equal, and no fiber is the “main” fiber. If any fiber returns from its start routine, the program exits. (Older versions of Wine will crash when this happens, but it was recently fixed.) It’s equivalent to the process’ main thread returning from main(). The initial fiber is free to create a second fiber, yield to it, then the second fiber destroys the first.

For now I’m going to focus on the core set of fiber functions. There are some additional capabilities I’m going to ignore, including support for fiber local storage. The important functions are just these five:

void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void  SwitchToFiber(void *fiber);
bool  ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void  DeleteFiber(void *fiber);

To emphasize its simplicity, I’ve shown them here with more standard prototypes than seen in their formal documentation. That documentation uses the clunky Windows API typedefs still burdened with its 16-bit heritage — e.g. LPVOID being a “long pointer” from the segmented memory of the 8086:

Fibers are represented using opaque, void pointers. Maybe that’s a little too simple since it’s easy to misuse in C, but I like it. The return values for CreateFiber() and ConvertThreadToFiber() are void pointers since these both create fibers.

The fiber start routine returns nothing and takes a void “user pointer”. That’s nearly what I’d expect, except that it would probably make more sense for a fiber to return int, which is more in line with main / WinMain / mainCRTStartup / WinMainCRTStartup. As I said, when any fiber returns from its start routine, it’s like returning from the main function, so it should probably have returned an integer.

A fiber may delete itself, which is the same as exiting the thread. However, a fiber cannot yield (e.g. SwitchToFiber()) to itself. That’s undefined behavior.

#include 
#include 
#include 

void
coup(void *king)
{
    puts("Long live the king!");
    DeleteFiber(king);
    ConvertFiberToThread(); /* seize the main thread */
    /* ... */
}

int
main(void)
{
    void *king = ConvertThreadToFiber(0);
    void *pretender = CreateFiber(0, coup, king);
    SwitchToFiber(pretender);
    abort(); /* unreachable */
}

Only fibers can yield to fibers, but when the program starts up, there are no fibers. At least one thread must first convert itself into a fiber using ConvertThreadToFiber(), which returns the fiber object that represents itself. It takes one argument analogous to the last argument of CreateFiber(), except that there’s no start routine to accept it. The process is reversed with ConvertFiberToThread().

Fibers don’t belong to any particular thread and can be scheduled on any thread if properly synchronized. Obviously one should never yield to the same fiber in two different threads at the same time.

Contrast with POSIX

The equivalent POSIX systems was context switching. It’s also stackful and symmetric, but it has just three important functions: getcontext(3), makecontext(3), and swapcontext.

int  getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int  swapcontext(ucontext_t *oucp, const ucontext_t *ucp);

These are roughly equivalent to GetCurrentFiber(), CreateFiber(), and SwitchToFiber(). There is no need for ConvertFiberToThread() since threads can context switch without preparation. There’s also no DeleteFiber() because the resources are managed by the program itself. That’s where POSIX contexts are a little bit more powerful.

The first argument to CreateFiber() is the desired stack size, with zero indicating the default stack size. The stack is allocated and freed by the operating system. The downside is that the caller doesn’t have a choice in managing the lifetime of this stack and how it’s allocated. If you’re frequently creating and destroying coroutines, those stacks are constantly being allocated and freed.

In makecontext(3), the caller allocates and supplies the stack. Freeing that stack is equivalent to destroying the context. A program that frequently creates and destroys contexts can maintain a stack pool or otherwise more efficiently manage their allocation. This makes it more powerful, but it also makes it a little more complicated. It would be hard to remember how to do all this without a careful reading of the documentation:

/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);

/* Destroy a context */
free(ctx.uc_stack.ss_sp);

Note how makecontext(3) is variadic (...), passing its arguments on to the start routine of the context. This seems like it might be better than a user pointer. Unfortunately it’s not, since those arguments are strictly limited to integers.

Ultimately I like the fiber API better. The first time I tried it out, I could guess my way through it without looking closely at the documentation.

Async / await with fibers

Why was I looking at the Fiber API? I’ve known about coroutines for years but I didn’t understand how they could be useful. Sure, the function can yield, but what other coroutine should it yield to? It wasn’t until I was recently bit by the async/await bug that I finally saw a “killer feature” that justified their use. Generators come pretty close, though.

Windows fibers are a coroutine primitive suitable for async/await in C programs, where it can also be useful. To prove that it’s possible, I built async/await on top of fibers in 95 lines of code.

The alternatives are to use a third-party coroutine library or to do it myself with some assembly programming. However, having it built into the operating system is quite convenient! It’s unfortunate that it’s limited to Windows. Ironically, though, everything I wrote for this article, including the async/await demonstration, was originally written on Linux using Mingw-w64 and tested using Wine. Only after I was done did I even try it on Windows.

Before diving into how it works, there’s a general concept about the Windows API that must be understood: All kernel objects can be in either a signaled or unsignaled state. The API provides functions that block on a kernel object until it is signaled. The two important ones are WaitForSingleObject() and WaitForMultipleObjects(). The latter behaves very much like poll(2) in POSIX.

Usually the signal is tied to some useful event, like a process or thread exiting, the completion of an I/O operation (i.e. asynchronous overlapped I/O), a semaphore being incremented, etc. It’s a generic way to wait for some event. However, instead of blocking the thread, wouldn’t it be nice to await on the kernel object? In my aio library for Emacs, the fundamental “wait” object was a promise. For this API it’s a kernel object handle.

So, the await function will take a kernel object, register it with the scheduler, then yield to the scheduler. The scheduler — which is a global variable, so there’s only one scheduler per process — looks like this:

struct {
    void *main_fiber;
    HANDLE handles[MAXIMUM_WAIT_OBJECTS];
    void *fibers[MAXIMUM_WAIT_OBJECTS];
    void *dead_fiber;
    int count;
} async_loop;

While fibers are symmetric, coroutines in my async/await implementation are not. One fiber is the scheduler, main_fiber, and the other fibers always yield to it.

There is an array of kernel object handles, handles, and an array of fibers. The elements in these arrays are paired with each other, but it’s convenient to store them separately, as I’ll show soon. fibers[0] is waiting on handles[0], and so on.

The array is a fixed size, MAXIMUM_WAIT_OBJECTS (64), because there’s a hard limit on the number of fibers that can wait at once. This pathetically small limitation is an unfortunate, hard-coded restriction of the Windows API. It kills most practical uses of my little library. Fortunately there’s no limit on the number of handles we might want to wait on, just the number of co-existing fibers.

When a fiber is about to return from its start routine, it yields one last time and registers itself on the dead_fiber member. The scheduler will delete this fiber as soon as it’s given control. Fibers never truly return since that would terminate the program.

With this, the await function, async_await(), is pretty simple. It registers the handle with the scheduler, then yields to the scheduler fiber.

void
async_await(HANDLE h)
{
    async_loop.handles[async_loop.count] = h;
    async_loop.fibers[async_loop.count] = GetCurrentFiber();
    async_loop.count++;
    SwitchToFiber(async_loop.main_fiber);
}

Caveat: The scheduler destroys this handle with CloseHandle() after it signals, so don’t try to reuse it. This made my demonstration simpler, but it might be better to not do this.

A fiber can exit at any time. Such an exit is inserted implicitly before a fiber actually returns:

void
async_exit(void)
{
    async_loop.dead_fiber = GetCurrentFiber();
    SwitchToFiber(async_loop.main_fiber);
}

The start routine given to async_start() is actually wrapped in the real start routine. This is how async_exit() is injected:

struct fiber_wrapper {
    void (*func)(void *);
    void *arg;
};

static void
fiber_wrapper(void *arg)
{
    struct fiber_wrapper *fw = arg;
    fw->func(fw->arg);
    async_exit();
}

int
async_start(void (*func)(void *), void *arg)
{
    if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
        return 0;
    } else {
        struct fiber_wrapper fw = {func, arg};
        SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
        return 1;
    }
}

The library provides a single awaitable function, async_sleep(). It creates a “waitable timer” object, starts the countdown, and returns it. (Notice how SetWaitableTimer() is a typically-ugly Win32 function with excessive parameters.)

HANDLE
async_sleep(double seconds)
{
    HANDLE promise = CreateWaitableTimer(0, 0, 0);
    LARGE_INTEGER t;
    t.QuadPart = (long long)(seconds * -10000000.0);
    SetWaitableTimer(promise, &t, 0, 0, 0, 0);
    return promise;
}

A more realistic example would be overlapped I/O. For example, you’d open a file (CreateFile()) in overlapped mode, then when you, say, read from that file (ReadFile()) you create an event object (CreateEvent()), populate an overlapped I/O structure with the event, offset, and length, then finally await on the event object. The fiber will be resumed when the operation is complete.

Side note: Unfortunately overlapped I/O doesn’t work correctly for files, and many operations can’t be done asynchronously, like opening files. When it comes to files, you’re better off using dedicated threads as libuv does instead of overlapped I/O. You can still await on these operations. You’d just await on the signal from the thread doing synchronous I/O, not from overlapped I/O.

The most complex part is the scheduler, and it’s really not complex at all:

void
async_run(void)
{
    while (async_loop.count) {
        /* Wait for next event */
        DWORD nhandles = async_loop.count;
        HANDLE *handles = async_loop.handles;
        DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);

        /* Remove event and fiber from waiting array */
        void *fiber = async_loop.fibers[r];
        CloseHandle(async_loop.handles[r]);
        async_loop.handles[r] = async_loop.handles[nhandles - 1];
        async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
        async_loop.count--;

        /* Run the fiber */
        SwitchToFiber(fiber);

        /* Destroy the fiber if it exited */
        if (async_loop.dead_fiber) {
            DeleteFiber(async_loop.dead_fiber);
            async_loop.dead_fiber = 0;
        }
    }
}

This is why the handles are in their own array. The array can be passed directly to WaitForMultipleObjects(). The return value indicates which handle was signaled. The handle is closed, the entry removed from the scheduler, and then the fiber is resumed.

That WaitForMultipleObjects() is what limits the number of fibers. It’s not possible to wait on more than 64 handles at once! This is hard-coded into the API. How? A return value of 64 is an error code, and changing this would break the API. Remember what I said about being locked into bad design decisions of the past?

To be fair, WaitForMultipleObjects() was a doomed API anyway, just like select(2) and poll(2) in POSIX. It scales very poorly since the entire array of objects being waited on must be traversed on each call. That’s terribly inefficient when waiting on large numbers of objects. This sort of problem is solved by interfaces like kqueue (BSD), epoll (Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this particular problem well — awaiting on kernel objects — so I couldn’t use it.

When the awaiting fiber count is zero and the scheduler has control, all fibers must have completed and there’s nothing left to do. However, the caller can schedule more fibers and then restart the scheduler if desired.

That’s all there is to it. Have a look at demo.c to see how the API looks in some trivial examples. On Linux you can see it in action with make check. On Windows, you just need to compile it, then run it like a normal program. If there was a better function than WaitForMultipleObjects() in the Windows API, I would have considered turning this demonstration into a real library.

Blast from the Past: Borland C++ on Windows 98

2018-04-13T20:01:31Z

My first exposure to C and C++ was a little over 20 years ago. I remember it being some version of Borland C++, either 4.x or 5.x, running on Windows 95. I didn’t have a mentor, so I did the best I could slowly working through what was probably a poorly written beginner C++ book, typing out the examples and exercises with little understanding. Since I didn’t learn much from the experience, there was a 7 or 8 year gap before I’d revisit C and C++ in college.

I thought it would be interesting to revisit this software, to reevaluate it from a far more experienced perspective. Keep in mind that C++ wasn’t even standardized yet, and the most recent C standard was from 1989. Given this, what was it like to be a professional software developer using a Borland toolchain on Windows 20 years ago? Was it miserable, made bearable only by ignorance of how much better the tooling could be? Or maybe it actually wasn’t so bad, and these tools are better than I expect?

Ultimately my conclusion is that it’s a little bit of both. There are some significant capability gaps compared to today, but the core toolchain itself is actually quite reasonable, especially for the mid 1990s.

The setup

Before getting into the evaluation, let’s discuss how I got it all up and running. While it’s technically possible to run Windows 95 on a modern x86-64 machine thanks to the architecture’s extreme backwards compatibility, it’s more compatible, simpler, and safer to virtualize it. Most importantly, I can emulate older hardware that will have better driver support.

Despite that early start in Windows all those years ago, I’m primarily a Linux user. The premier virtualization solution on Linux these days is KVM, a kernel module that turns Linux into a hypervisor and makes efficient use of hardware virtualization extensions. Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m using QEmu (with KVM disabled), a hardware emulator closely associated with KVM. Since it doesn’t take advantage of hardware virtualization extensions, it will be slower. This is fine since my goal is to emulate slow, 20+ year old hardware anyway.

There’s very little practical difference between Windows 95 and Windows 98. Since Windows 98 runs a lot smoother virtualized, I decided to go with that instead. This will be perfectly sufficient for my toolchain evaluation.

Software

To get started, I’ll need an installer for Windows 98. I thought this would be difficult to find, but there’s a copy available on the Internet Archive. I don’t know how “legitimate” this is, but it works. Since it’s running in a virtual machine without network access, I also don’t really care if this copy is somehow infected with malware.

Internet Archive: Windows 98 Second Edition

Also on the Internet Archive is a complete copy of Borland C++ 5.02, with the same caveats of legitimacy. It works, which is good enough for my purposes.

Internet Archive: Borland C++ 5.02

Thank you Internet Archive!

Hardware

I’ve got my software, now to set up the virtualized hardware. First I create a drive image:

$ qemu-image create -fqcow2 win98.img 8G

I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a virtual hard drive with modern sizes would probably break the installer. This sort of issue is a common theme among old software, where there may be complaints about negative available disk space due to signed integer overflow.

I decided to give the machine 256MB of memory (-m 256). This is also a little excessive, but I wanted to be sure memory didn’t limit Borland’s capabilities. This amount of memory is close to the upper bound, and going much beyond will likely cause problems with Windows 98.

For the CPU I settled on a Pentium (-cpu pentium). My original goal was to go a little simpler with a 486 (-cpu 486), but the Windows 98 installer kept crashing when I tried this.

I experimented with different configurations for the network card, but I couldn’t get anything to work. So I’ve disabled networking (-net none). The only reason I’d want this is that it would be easier to move files in and out of the virtual machine.

Finally, here’s how I ran QEmu. The last two lines are only needed when installing.

$ qemu-system-x86_64 \
    -localtime \
    -cpu pentium \
    -no-acpi \
    -no-hpet \
    -m 256 \
    -hda win98.img \
    -soundhw sb16 \
    -vga cirrus \
    -net none \
    -cdrom "Windows 98 Second Edition.iso" \
    -boot d

Installation

Installation is just a matter of following the instructions. You’ll need that product key listed on the Internet Archive site.

That copy of Borland is just a big .zip file. This presents two problems.

Without network access, I’ll need to figure out how to get this inside the virtual machine.
This version of Windows doesn’t come with software to unzip this file. I’d need to find and install an unzip tool first.

Fortunately I can kill two birds with one stone by converting that .zip archive into a .iso and mounting it in the virtual machine.

unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"

Then in the QEmu console (C-A-2) I attach it:

change ide1-cd0 borland.iso

This little trick of generating .iso files and mounting them is how I will be moving all the other files into the virtual machine.

Borland C++

The first thing I did was play around with with Borland IDE. This is what I would have been using 20 years ago.

Despite being Borland C++, I’m personally most interested in its ANSI C compiler. As I already pointed out, this software pre-dates C++’s standardization, and a lot has changed over the past two decades. On the other hand, C hasn’t really changed all that much. The 1999 update to the C standard (e.g. “C99”) was big and important, but otherwise little has changed. The biggest drawback is the lack of “declare anywhere” variables, including in for-loop initializers. Otherwise it’s the same as writing C today.

To test drive the IDE, I made a couple of test projects, built and ran them with different options, and poked around with the debugger. The debugger is actually pretty decent, especially for the 1990s. It can be operated via the IDE or standalone, so I could use it without firing up the IDE and making a project.

The toolchain includes an assembler, and I can inspect the compiler’s assembly output. To nobody’s surprise this is Intel-flavored assembly, which is very welcome. Imagining myself as a software developer in the mid 1990s, this means I can see exactly what the compiler’s doing as well as write some of the performance sensitive parts in assembly if necessary.

The built-in editor is the worst part of the IDE, which is unfortunate since it really spoils the whole experience. It’s easy to jump between warnings and errors, it has incremental search, and it has good syntax highlighting. But these are the only positive things I can say about it. If I had to work with this editor full-time, I’d spend my days pretty irritated.

Switch to command line tools

Like with the debugger, the Borland people did a good job modularizing their development tools. As part of the installation process, all of the Borland command line tools are added to the system PATH (reminder: this is a single-user system). This includes compiler, linker, assembler, debugger, and even an incomplete implementation of make.

With this, I can essentially pretend the IDE doesn’t exist and replace that crummy editor with something better: Vim.

The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3, released in 2010. I download those binaries, trim a few things from my .vimrc, and smuggle it all into my virtual machine via a virtual CD. I’ve now got a powerful text editor in Windows 98 and my situation has drastically improved.

Since I hardly use features added since Vim 7.3, this feels right at home to me. I can invoke the build from Vim, and it can populate the quickfix list from Borland’s output, so I could actually be fairly productive in these circumstances! I’m honestly really impressed with how well this all works together.

At this point I only have two significant annoyances:

Borland’s command line tools belong to that category of irritating programs that print their version banner on every invocation. There’s not even a command line switch to turn this off. All this noise is quickly tiresome. The Visual Studio toolchain does the same thing by default, though it can be turned off (-nologo). I dislike that some GNU tools also commit this sin, but at least GNU limits this to interactive programs.
The Windows/DOS command shell and console is even worse than it is today. I didn’t think that was possible. This is back when it was still genuinely DOS and not just pretending to be (e.g. in NT). The worst part by far is the lack of command history. There’s no using the up-arrow to get previous commands. There’s no tab completion. Forward slash is not a substitute for backslash in paths. If I wanted to improve my productivity, replacing this console and shell would be the first priority.

Update: In an email, Aristotle Pagaltzis informed me that Windows 98 comes with DOSKEY.COM, which provides command history for COMMAND.EXE. Alternatively there’s Enhanced DOSKEY.com, an open source, alternative implementation that also provides tab completion for commands and filesnames. This makes the console a lot more usable (and, honestly, in some ways better than the modern defaults).

Building Enchive with Borland

Last year I wrote a backup encryption tool called Enchive, and I still use it regularly. One of my design goals was high portability since it may be needed to decrypt something important in the distant future. It should be as bit-rot-proof as possible. In software, the best way to future-proof is to past-proof.

If I had a time machine that could send source code back in time, and I sent Enchive to a competant developer 20 years ago, would they be able to compile it and run it? If the answer is yes, then that means Enchive already has 20 years of future-proofing built into it.

To accomplish this, Enchive is 3,300 lines of strict ANSI C, 1989-style, with no dependencies other than the C standard library and a handful of operating system functions — e.g. functionality not in the C standard library. In practice, any ANSI C compiler targeting either POSIX, or Windows 95 or later, should be able to compile it.

My Windows 98 virtual machine includes an ANSI C compiler, and can be used to simulate this time machine. I generated an “amalgamation” build (make amalgamation) — essentially a concatenation of all the source files — and sent this into the virtual machine. Before Borland was able to compile it, I needed to make three small changes.

First, Enchive includes stdint.h to get fixed-width integers needed for the encryption routines. This header comes from C99, and C89 has no equivalent. I anticipated this problem from the beginning and made it easy for the person performing the build to correct it. This header is included exactly once, in config.h, and this is placed at the top of the amalgamation build. The include only needs to be replaced with a handful of manual typedefs. For Borland that looks like this:

typedef unsigned char    uint8_t;
typedef unsigned short   uint16_t;
typedef unsigned long    uint32_t;
typedef unsigned __int64 uint64_t;

typedef long             int32_t;
typedef __int64          int64_t;

#define INT8_C(n)   (n)
#define INT16_C(n)  (n)
#define INT32_C(n)  (n##U)

Second, in more recent versions of Windows, GetFileAttributes() can return the value INVALID_FILE_ATTRIBUTES. Checking for an error that cannot happen is harmless, but this value isn’t defined in Borland’s SDK. I only had to eliminate that check.

Third, the CryptGenRandom() interface isn’t defined in Borland’s SDK. This is used by Enchive to generate keys. MSDN reports this function wasn’t available until Windows XP, but it’s definitely there in Windows 98, exported by ADVAPI32.dll. I’m able to call it, though it always reports an error. Perhaps it’s been disabled in this version due to cryptographic export restrictions?

Regardless of what’s wrong, I ripped this out and replaced it with a fatal error. This version of Enchive can’t generate new keys — unless derived from a passphrase — nor encrypt files, including the use of a protection key to encrypt the secret key. However, it can decrypt files, which is the important part that needs to be future-proofed.

With this three changes — which took me about 10 minutes to sort out — Enchive builds and runs, and it correctly decrypts files I encrypted on Linux. So Enchive has at least 20 years of past-proofing! The screenshot at the top of this article shows it running successfully in an MS-DOS console window.

What’s wrong? What’s missing?

I mentioned that there were some gaps. The most obvious is the lack of the standard POSIX utilities, especially a decent shell. I don’t know if any had been ported to Windows in the mid 1990s. But that could be solved one way or another without too much trouble, even if it meant doing some of that myself.

No, the biggest capability I’d miss, and which wouldn’t be easily obtained, is Git, or a least a decent source control system. I really don’t want to work without proper source control. Git’s support for Windows is second tier, and the port to modern Windows is already a bit of a hack. Getting it to run in Windows 98 would probably be a challenge, especially if I had to compile it with Borland.

The other major issue is the lack of stability. In this experiment, I’ve been seeing this screen a lot:

I remember Windows crashing a lot back in those days, and it certainly had a bad reputation for being unstable, but this is far worse than I remembered. While the hardware emulator may be somewhat at fault here, keep in mind that I never installed third party drivers. Most of these crashes are Windows’ fault. I found I can reliably bring the whole system down with a single GetProcAddress() call on a system DLL. The only way I can imagine this instability was so tolerated back then was general ignorance that computing could be so much better.

I was tempted to write this article in Vim on Windows 98, but all this crashing made me too nervous. I didn’t want some stupid filesystem corruption to wipe out my work. Too risky.

A better alternative

If I was stuck working in Windows 98 — or was at least targeting it as a platform — but had access to a modern tooling ecosystem, could I do better than Borland? Yes! Programs built by Mingw-w64 can be run even as far back as Windows 95.

Now, there’s a catch. I thought it would be this simple:

$ i686-w64-mingw32-gcc -Os hello.c

But when I brought the resulting binary into the virtual machine it crashed when ran it: illegal instruction. Turns out it contained a conditional move (cmov) which is an instruction not available until the Pentium Pro (686). The “pentium” emulation is just a 586.

I tried to disable cmov by picking the specific architecture:

$ i686-w64-mingw32-gcc -march=pentium -Os hello.c

This still didn’t work because the statically-linked part of the CRT contained the cmov. I’d have to recompile that as well.

I could have switched the QEmu options to “upgrade” to a Pentium Pro, but remember that my goal was really the 486. Fortunately this was easy to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number of times before, so I knew it wouldn’t be difficult.

I could go step by step, but it’s all fairly well documented in the Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version), and for the target I picked “i486-w64-mingw32”. When it was done I could compile binaries on Linux to run in my Windows 98 virtual machine:

$ i486-w64-mingw32-gcc -Os hello.c

This should enable quite a bit of modern software to run inside my virtual machine if I so wanted. I didn’t actually try this (yet?), but, to take this concept all the way, I could use this cross-compiler to cross-compile Mingw-w64 itself to run inside the virtual machine, directly replacing Borland C++.

And the only thing I’d miss about Borland is its debugger.

Initial Evaluation of the Windows Subsystem for Linux

2017-11-30T21:03:53Z

Recently I had my first experiences with the Windows Subsystem for Linux (WSL), evaluating its potential as an environment for getting work done. This subsystem, introduced to Windows 10 in August 2016, allows Windows to natively run x86 and x86-64 Linux binaries. It’s essentially the counterpart to Wine, which allows Linux to natively run Windows binaries.

WSL interfaces with Linux programs only at the kernel level, servicing system calls the same way the Linux kernel would. The subsystem’s main job is translating Linux system calls into NT requests. There’s a series of articles about its internals if you’re interested in learning more.

I was honestly impressed by how well this all works, especially since Microsoft has long had an affinity for producing flimsy imitations (Windows console, PowerShell, Arial, etc.). WSL’s design allows Microsoft to dump an Ubuntu system wholesale inside Windows — and, more recently, other Linux distributions — bypassing a bunch of annoying issues, particularly in regards to glibc.

WSL processes can exec(2) Windows binaries, which then run in under their appropriate subsystem, similar to binfmt on Linux. In theory this nice interop should allow for some automation Linux-style even for Windows’ services and programs. More on that later.

There are some notable issues, though.

Lack of device emulation

No soundcard devices are exposed to the subsystem, so Linux programs can’t play sound. There’s a hack to talk PulseAudio with a Windows’ process that can access, but that’s about it. Generally there’s not much reason to be playing media or games under WSL, but this can be an annoyance if you’re, say, writing software that synthesizes audio.

Really, there’s almost no device emulation at all and /proc is pretty empty. You won’t see hard drives or removable media under /dev, nor will you see USB devices like webcams and joysticks. A lot of the useful things you might do on a Linux system aren’t available under WSL.

No Filesystem in Userspace (FUSE)

Microsoft hasn’t implemented any of the system calls for FUSE, so don’t expect to use your favorite userspace filesystems. The biggest loss for me is sshfs, which I use frequently.

If FUSE was supported, it would be interesting to see how the rest of Windows interacts with these mounted filesystems, if at all.

Fragile services

Services running under WSL are flaky. The big issue is that when the initial WSL shell process exits, all WSL processes are killed and the entire subsystem is torn down. This includes any services that are running. That’s certainly surprising to anyone with experience running services on any kind of unix system. This is probably the worst part of WSL.

While systemd is the standard for Linux these days and may even be “installed” in the WSL virtual filesystem, it’s not actually running and you can’t use systemctl to interact with services. Services can only be controlled the old fashioned way, and, per above, that initial WSL console window has to remain open while services are running.

That’s a bit of a damper if you’re intending to spend a lot of time remotely SSHing into your Windows 10 system. So yes, it’s trivial to run an OpenSSH server under WSL, but it won’t feel like a proper system service.

Limited graphics support

WSL doesn’t come with an X server, so you have to supply one separately (Xming, etc.) that runs outside WSL, as a normal Windows process. WSL processes can connect to that server (DISPLAY) allowing you to run most Linux graphical software.

However, this means there’s no hardware acceleration. There will be no GLX extensions available. If your goal is to run the Emacs or Vim GUIs, that’s not a big deal, but it might matter if you were interested in running a browser under WSL. It also means it’s not a suitable environment for developing software using OpenGL.

Filesystem woes

The filesystem manages to be both one of the smallest issues as well as one of the biggest.

Filename translation

On the small issue side is filename translation. Under most Linux filesystems — and even more broadly for unix — a filename is just a bytestring. They’re not necessarily UTF-8 or any other particular encoding, and that’s partly why filenames are case-sensitive — the meaning of case depends on the encoding.

However, Windows uses a pseudo-UTF-16 scheme for filenames, incompatible with bytestrings. Since WSL lives within a Windows’ filesystem, there must be some bijection between bytestring filenames and pseudo-UTF-16 filenames. It will also have to reject filenames that can’t be mapped. WSL does both.

I couldn’t find any formal documentation about how filename translation works, but most of it can be reverse engineered through experimentation. In practice, Linux filenames are UTF-8 encoded strings, and WSL’s translation takes advantage of this. Filenames are decoded as UTF-8 and re-encoded as UTF-16 for Windows. Any byte that doesn’t decode as valid UTF-8 is silently converted to REPLACEMENT CHARACTER (U+FFFD), and decoding continues from the next byte.

I wonder if there are security consequences for different filenames silently mapping to the same underlying file.

Exercise for the reader: How is an unmatched surrogate half from Windows translated to WSL, where it doesn’t have a UTF-8 equivalent? I haven’t tried this yet.

Even for valid UTF-8, there are many bytes that most Linux filesystems allow in filenames that Windows does not. This ranges from simple things like ASCII backslash and colon — special components of Windows’ paths — to unusual characters like newlines, escape, and other ASCII control characters. There are two different ways these are handled:

The C drive is available under /mnt/c, and WSL processes can access regular Windows files under this “mountpoint.” Attempting to access filenames with invalid characters under this mountpoint always results in ENOENT: “No such file or directory.”
Outside of /mnt/c is WSL territory, and Windows processes aren’t supposed to touch these files. This allows for more freedom when translating filenames. REPLACEMENT CHARACTER is still used for invalid UTF-8 sequences, but the forbidden characters, including backslashes, are all permitted. They’re translated to #XXXX where X is hexadecimal for the normally invalid character. For example, a:b becomes a#003Ab.

While WSL doesn’t let you get away with all the crazy, ill-advised filenames that Linux allows, it’s still quite reasonable. Since Windows and Linux filenames aren’t entirely compatible, there’s going to be some trade-off no matter how this translation is done.

Filesystem performance

On the other hand, filesystem performance is abysmal, and I doubt the subsystem is to blame. This isn’t a surprise to anyone who’s used moderately-sized Git repositories on Windows, where the large numbers of loose files brings things to a crawl. This has been a Windows issue for years, and that’s even before you start plugging in the typically “security” services — virus scanners, whitelists, etc. — that are typically present on a Windows system and make this even worse.

To test out WSL, I went around my normal business compiling tools and making myself at home, just as I would on Linux. Doing nearly anything in WSL was noticably slower than doing the same on Linux on the exact same hardware. I didn’t run any benchmarks, but I’d expect to see around an order of magnitude difference on average for filesystem operations. Building LLVM and Clang took a couple hours rather than the typical 20 minutes.

I don’t expect this issue to get fixed anytime soon, and it’s probably always going to be a notable limitation of WSL.

So is WSL useful?

One of my hopes for WSL appears to be unfeasible. I thought it might be a way to avoid porting software from POSIX to Win32. I could just supply Windows users with the same Linux binary and they’d be fine. However, WSL requires switching Windows into a special “developer mode,” putting it well out of reach of the vast majority of users, especially considering the typical corporate computing environment that will lock this down. In practice, WSL is only useful to developers. I’m sure this is no accident. (Developer mode is no longer required as of October 2017.)

Mostly I see WSL as a Cygwin killer. Unix is my IDE and, on Windows, Cygwin has been my preferred go to for getting a solid unix environment for software development. Unlike WSL, Cygwin processes can make direct Win32 calls, which is occasionally useful. But, in exchange, WSL will overall be better equipped. It has native Linux tools, including a better suite of debugging tools — even better than you get in Windows itself — Valgrind, strace, and properly-working GDB (always been flaky in Cygwin). WSL is not nearly as good as actual Linux, but it’s better than Cygwin if you can get access to it.

How to Write Portable C Without Complicating Your Build

2017-03-30T04:06:58Z

Suppose you’re writing a non-GUI C application intended to run on a number of operating systems: Linux, the various BSDs, macOS, classical unix, and perhaps even something as exotic as Windows. It might sound like a rather complicated problem. These operating systems have slightly different interfaces (or very different in one case), and they run different variants of the standard unix tools — a problem for portable builds.

With some up-front attention to detail, this is actually not terribly difficult. Unix-like systems are probably the least diverse and least buggy they’ve ever been. Writing portable code is really just a matter of coding to the standards and ignoring extensions unless absolutely necessary. Knowing what’s standard and what’s extension is the tricky part, but I’ll explain how to find this information.

You might be tempted to reach for an overly complicated solution such as GNU Autoconf. Sure, it creates a configure script with the familiar, conventional interface. This has real value. But do you really need to run a single-threaded gauntlet of hundreds of feature/bug tests for things that sometimes worked incorrectly in some weird unix variant back in the 1990s? On a machine with many cores (parallel build, -j), this may very well be the slowest part of the whole build process.

For example, the configure script for Emacs checks that the compiler supplies stdlib.h, string.h, and getenv — things that were standardized nearly 30 years ago. It also checks for a slew of POSIX functions that have been standard since 2001.

There’s a much easier solution: Document that the application requires, say, C99 and POSIX.1-2001. It’s the responsibility of the person building the application to supply these implementations, so there’s no reason to waste time testing for it.

How to code to the standards

Suppose there’s some function you want to use, but you’re not sure if it’s standard or an extension. Or maybe you don’t know what standard it comes from. Luckily the man pages document this stuff very well, especially on Linux. Check the friendly “CONFORMING TO” section. For example, look at getenv(3). Here’s what that section has to say:

CONFORMING TO
    getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.

    secure_getenv() is a GNU extension.

This says this function comes from the original C standard. It’s always available on anything that claims to be a C implementation. The man page also documents secure_getenv(), which is a GNU extension: to be avoided in anything intended to be portable.

What about sleep(3)?

CONFORMING TO
    POSIX.1-2001.

This function isn’t part of standard C, but it’s available on any system claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the program needs to run on an operating system not implementing this POSIX standard (i.e. Windows), you’ll need to call an alternative function, probably inside a different #if .. #endif branch. More on this in a moment.

If you’re coding to POSIX, you must define the _POSIX_C_SOURCE feature test macro to the standard you intend to use prior to any system header includes:

A POSIX-conforming application should ensure that the feature test macro _POSIX_C_SOURCE is defined before inclusion of any header.

For example, to properly access POSIX.1-2001 functions in your application, define _POSIX_C_SOURCE to 200112L. With this defined, it’s safe to assume access to all of C and everything from that standard of POSIX. You can do this at the top of your sources, but I personally like the tidiness of a global config.h that gets included before everything.

How to create a portable build

So you’ve written clean, portable C to the standards. How do you build this application? The natural choice is make. It’s available everywhere and it’s part of POSIX.

Again, the tricky part is teasing apart the standard from the extension. I’m a long-time sinner in this regard, having far too often written Makefiles that depend on GNU Make extensions. This is a real pain when building programs on systems without the GNU utilities. I’ve been making amends (and finding some bugs as a result).

No implementation makes the division clear in its documentation, and especially don’t bother looking at the GNU Make manual. Your best resource is the standard itself. If you’re already familiar with make, coding to the standard is largely a matter of unlearning the various extensions you know.

Outside of some hacks, this means you don’t get conditionals (if, else, etc.). With some practice, both with sticking to portable code and writing portable Makefiles, you’ll find that you don’t really need them. Following the macro conventions will cover most situations. For example:

CC: the C compiler program
CFLAGS: flags to pass to the C compiler
LDFLAGS: flags to pass to the linker (via the C compiler)
LDLIBS: libraries to pass to the linker

You don’t need to do anything weird with the assignments. The user invoking make can override them easily. For example, here’s part of a Makefile:

CC     = c99
CFLAGS = -Wall -Wextra -Os

But the user wants to use clang, and their system needs to explicitly link -lsocket (e.g. Solaris). The user can override the macro definitions on the command line:

$ make CC=clang LDLIBS=-lsocket

The same rules apply to the programs you invoke from the Makefile. Read the standards documents and ignore your system’s man pages as to avoid accidentally using an extension. It’s especially valuable to learn the Bourne shell language and avoid any accidental bashisms in your Makefiles and scripts. The dash shell is good for testing your scripts.

Makefiles conforming to the standard will, unfortunately, be more verbose than those taking advantage of a particular implementation. If you know how to code Bourne shell — which is not terribly difficult to learn — then you might even consider hand-writing a configure script to generate the Makefile (a la metaprogramming). This gives you a more flexible language with conditionals, and, being generated, redundancy in the Makefile no longer matters.

As someone who frequently dabbles with BSD systems, my life has gotten a lot easier since learning to write portable Makefiles and scripts.

But what about Windows

It’s the elephant in the room and I’ve avoided talking about it so far. If you want to build with Visual Studio’s command line tools — something I do on occasion — build portability goes out the window. Visual Studio has nmake.exe, which nearly conforms to POSIX make. However, without the standard unix utilities and with the completely foreign compiler interface for cl.exe, there’s absolutely no hope of writing a Makefile portable to this situation.

The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the unix utilities, though it has the problem of linking against msvcrt.dll. Another option is a separate Makefile dedicated to nmake.exe and the Visual Studio toolchain. Good luck defining a correctly working “clean” target with del.exe.

My preferred approach lately is an amalgamation build (as seen in Enchive): Carefully concatenate all the application’s sources into one giant source file. First concatenate all the headers in the right order, followed by all the C files. Use sed to remove and local includes. You can do this all on a unix system with the nice utilities, then point cl.exe at the amalgamation for the Visual Studio build. It’s not very useful for actual development (i.e. you don’t want to edit the amalgamation), but that’s what MinGW-w64 resolves.

What about all those POSIX functions? You’ll need to find Win32 replacements on MSDN. I prefer to do this is by abstracting those operating system calls. For example, compare POSIX sleep(3) and Win32 Sleep().

#if defined(_WIN32)
#include 

void
my_sleep(int s)
{
    Sleep(s * 1000);  // TODO: handle overflow, maybe
}

#else /* __unix__ */
#include 

void
my_sleep(int s)
{
    sleep(s);  // TODO: fix signal interruption
}
#endif

Then the rest of the program calls my_sleep(). There’s another example in the OpenMP article with pwrite(2) and WriteFile(). This demonstrates that supporting a bunch of different unix-like systems is really easy, but introducing Windows portability adds a disproportionate amount of complexity.

Caveat: paths and filenames

There’s one major complication with filenames for applications portable to Windows. In the unix world, filenames are null-terminated bytestrings. Typically these are Unicode strings encoded as UTF-8, but it’s not necessarily so. The kernel just sees bytestrings. A bytestring doesn’t necessarily have a formal Unicode representation, which can be a problem for languages that want filenames to be Unicode strings (also).

On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up being neither. They’re really null-terminated unsigned 16-bit integer arrays. It’s almost UTF-16 except that Windows allows unpaired surrogates. This means Windows filenames also don’t have a formal Unicode representation, but in a completely different way than unix. Some heroic efforts have gone into working around this issue.

As a result, it’s highly non-trivial to correctly support all possible filenames on both systems in the same program, especially when they’re passed as command line arguments.

Summary

The key points are:

Document the standards your application requires and strictly stick to them.
Ignore the vendor documentation if it doesn’t clearly delineate extensions.

This was all a discussion of non-GUI applications, and I didn’t really touch on libraries. Many libraries are simple to access in the build (just add it to LDLIBS), but some libraries — GUIs in particular — are particularly complicated to manage portably and will require a more complex solution (pkg-config, CMake, Autoconf, etc.).

OpenMP and pwrite()

2017-03-01T21:22:24Z

The most common way I introduce multi-threading to small C programs is with OpenMP (Open Multi-Processing). It’s typically used as compiler pragmas to parallelize computationally expensive loops — iterations are processed by different threads in some arbitrary order.

Here’s an example that computes the frames of a video in parallel. Despite being computed out of order, each frame is written in order to a large buffer, then written to standard output all at once at the end.

size_t size = sizeof(struct frame) * num_frames;
struct frame *output = malloc(size);
float beta = DEFAULT_BETA;

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    float theta = compute_theta(i);
    compute_frame(&output[i], theta, beta);
}

write(STDOUT_FILENO, output, size);
free(output);

Adding OpenMP to this program is much simpler than introducing low-level threading semantics with, say, Pthreads. With care, there’s often no need for explicit thread synchronization. It’s also fairly well supported by many vendors, even Microsoft (up to OpenMP 2.0), so a multi-threaded OpenMP program is quite portable without #ifdef.

There’s real value this pragma API: The above example would still compile and run correctly even when OpenMP isn’t available. The pragma is ignored and the program just uses a single core like it normally would. It’s a slick fallback.

When a program really does require synchronization there’s omp_lock_t (mutex lock) and the expected set of functions to operate on them. This doesn’t have the nice fallback, so I don’t like to use it. Instead, I prefer #pragma omp critical. It nicely maintains the OpenMP-unsupported fallback.

/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    struct frame *frame = malloc(sizeof(*frame));
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    #pragma omp critical
    {
        write(STDOUT_FILENO, frame, sizeof(*frame));
    }
    free(frame);
}

This would append the output to some output file in an arbitrary order. The critical section prevents interleaving of outputs.

There are a couple of problems with this example:

Only one thread can write at a time. If the write takes too long, other threads will queue up behind the critical section and wait.
The output frames will be out of order, which is probably inconvenient for consumers. If the output is seekable this can be solved with lseek(), but that only makes the critical section even more important.

There’s an easy fix for both, and eliminates the need for a critical section: POSIX pwrite().

ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

It’s like write() but has an offset parameter. Unlike lseek() followed by a write(), multiple threads and processes can, in parallel, safely write to the same file descriptor at different file offsets. The catch is that the output must be a file, not a pipe.

#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
    size_t size = sizeof(struct frame);
    struct frame *frame = malloc(size);
    float theta = compute_theta(i);
    compute_frame(frame, theta, beta);
    pwrite(STDOUT_FILENO, frame, size, size * i);
    free(frame);
}

There’s no critical section, the writes can interleave, and the output is in order.

If you’re concerned about standard output not being seekable (it often isn’t), keep in mind that it will work just fine when invoked like so:

$ ./compute_frames > frames.ppm

Windows Portability

I talked about OpenMP being really portable, then used POSIX functions. Fortunately the Win32 WriteFile() function has an “overlapped” parameter that works just like pwrite(). Typically rather than call either directly, I’d wrap the write like so:

#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include 

static int
write_frame(struct frame *f, int i)
{
    HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD written;
    OVERLAPPED offset = {.Offset = sizeof(*f) * i};
    return WriteFile(out, f, sizeof(*f), &written, &offset);
}

#else /* POSIX */
#include 

static int
write_frame(struct frame *f, int i)
{
    size_t count = sizeof(*f);
    size_t offset = sizeof(*f) * i;
    return pwrite(STDOUT_FILENO, buf, count, offset) == count;
}
#endif

Except for switching to write_frame(), the OpenMP part remains untouched.

Real World Example

Here’s an example in a real program:

julia.c

Notice because of pwrite() there’s no piping directly into ppmtoy4m:

$ ./julia > output.ppm
$ ppmtoy4m -F 60:1 < output.ppm > output.y4m
$ x264 -o output.mp4 output.y4m

output.mp4

Asynchronous Requests from Emacs Dynamic Modules

2017-02-14T02:30:00Z

A few months ago I had a discussion with Vladimir Kazanov about his Orgfuse project: a Python script that exposes an Emacs Org-mode document as a FUSE filesystem. It permits other programs to navigate the structure of an Org-mode document through the standard filesystem APIs. I suggested that, with the new dynamic modules in Emacs 25, Emacs itself could serve a FUSE filesystem. In fact, support for FUSE services in general could be an package of his own.

So that’s what he did: Elfuse. It’s an old joke that Emacs is an operating system, and here it is handling system calls.

However, there’s a tricky problem to solve, an issue also present my joystick module. Both modules handle asynchronous events — filesystem requests or joystick events — but Emacs runs the event loop and owns the main thread. The external events somehow need to feed into the main event loop. It’s even more difficult with FUSE because FUSE also wants control of its own thread for its own event loop. This requires Elfuse to spawn a dedicated FUSE thread and negotiate a request/response hand-off.

When a filesystem request or joystick event arrives, how does Emacs know to handle it? The simple and obvious solution is to poll the module from a timer.

struct queue requests;

emacs_value
Frequest_next(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
    emacs_value next = Qnil;
    queue_lock(requests);
    if (queue_length(requests) > 0) {
        void *request = queue_pop(requests, env);
        next = env->make_user_ptr(env, fin_empty, request);
    }
    queue_unlock(request);
    return next;
}

And then ask Emacs to check the module every, say, 10ms:

(defun request--poll ()
  (let ((next (request-next)))
    (when next
      (request-handle next))))

(run-at-time 0 0.01 #'request--poll)

Blocking directly on the module’s event pump with Emacs’ thread would prevent Emacs from doing important things like, you know, being a text editor. The timer allows it to handle its own events uninterrupted. It gets the job done, but it’s far from perfect:

It imposes an arbitrary latency to handling requests. Up to the poll period could pass before a request is handled.
Polling the module 100 times per second is inefficient. Unless you really enjoy recharging your laptop, that’s no good.

The poll period is a sliding trade-off between latency and battery life. If only there was some mechanism to, ahem, signal the Emacs thread, informing it that a request is waiting…

SIGUSR1

Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals, which is exactly the mechanism we need. The interface is a “key” binding on special-event-map, the keymap that handles these kinds of events. When the signal arrives, Emacs queues it up for the main event loop.

(define-key special-event-map [sigusr1]
  (lambda ()
    (interactive)
    (request-handle (request-next))))

The module blocks on its own thread on its own event pump. When a request arrives, it queues the request, rings the bell for Emacs to come handle it (raise()), and waits on a semaphore. For illustration purposes, assume the module reads requests from and writes responses to a file descriptor, like a socket.

int event_fd = /* ... */;
struct request request;
sem_init(&request.sem, 0, 0);

for (;;) {
    /* Blocking read for request event */
    read(event_fd, &request.event, sizeof(request.event));

    /* Put request on the queue */
    queue_lock(requests);
    queue_push(requests, &request);
    queue_unlock(requests);
    raise(SIGUSR1);  // TODO: Should raise() go inside the lock?

    /* Wait for Emacs */
    while (sem_wait(&request.sem))
        ;

    /* Reply with Emacs' response */
    write(event_fd, &request.response, sizeof(request.response));
}

The sem_wait() is in a loop because signals will wake it up prematurely. In fact, it may even wake up due to its own signal on the line before. This is the only way this particular use of sem_wait() might fail, so there’s no need to check errno.

If there are multiple module threads making requests to the same global queue, the lock is necessary to protect the queue. The semaphore is only for blocking the thread until Emacs has finished writing its particular response. Each thread has its own semaphore.

When Emacs is done writing the response, it releases the module thread by incrementing the semaphore. It might look something like this:

emacs_value
Frequest_complete(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
    struct request *request = env->get_user_ptr(env, args[0]);
    if (request)
        sem_post(&request->sem);
    return Qnil;
}

The top-level handler dispatches to the specific request handler, calling request-complete above when it’s done.

(defun request-handle (next)
  (condition-case e
      (cl-ecase (request-type next)
        (:open  (request-handle-open  next))
        (:close (request-handle-close next))
        (:read  (request-handle-read  next)))
    (error (request-respond-as-error next e)))
  (request-complete))

This SIGUSR1+semaphore mechanism is roughly how Elfuse currently processes requests.

Making it work on Windows

Windows doesn’t have signals. This isn’t a problem for Elfuse since Windows doesn’t have FUSE either. Nor does it matter for Joymacs since XInput isn’t event-driven and always requires polling. But someday someone will need this mechanism for a dynamic module on Windows.

Fortunately there’s a solution: input language change events, WM_INPUTLANGCHANGE. It’s also on special-event-map:

(define-key special-event-map [language-change]
  (lambda ()
    (interactive)
    (request-process (request-next))))

Instead of raise() (or pthread_kill()), broadcast the window event with PostMessage(). Outside of invoking the language-change key binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t belong to any particular window. We don’t really want to change the input language, after all.

PostMessageA(HWND_BROADCAST, WM_INPUTLANGCHANGE, 0, 0);

Naturally you’ll also need to replace the POSIX threading primitives with the Windows versions (CreateThread(), CreateSemaphore(), etc.). With a bit of abstraction in the right places, it should be pretty easy to support both POSIX and Windows in these asynchronous dynamic module events.

How to Read and Write Other Process Memory

2016-09-03T21:53:26Z

I recently put together a little game memory cheat tool called MemDig. It can find the address of a particular game value (score, lives, gold, etc.) after being given that value at different points in time. With the address, it can then modify that value to whatever is desired.

I’ve been using tools like this going back 20 years, but I never tried to write one myself until now. There are many memory cheat tools to pick from these days, the most prominent being Cheat Engine. These tools use the platform’s debugging API, so of course any good debugger could do the same thing, though a debugger won’t be specialized appropriately (e.g. locating the particular address and locking its value).

My motivation was bypassing an in-app purchase in a single player Windows game. I wanted to convince the game I had made the purchase when, in fact, I hadn’t. Once I had it working successfully, I ported MemDig to Linux since I thought it would be interesting to compare. I’ll start with Windows for this article.

Windows

Only three Win32 functions are needed, and you could almost guess at how it works.

It’s very straightforward ~~and, for this purpose, is probably the simplest API for any platform~~ (see update).

As you probably guessed, you first need to open the process, given its process ID (integer). You’ll need to select the desired access bit a bit set. To read memory, you need the PROCESS_VM_READ and PROCESS_QUERY_INFORMATION rights. To write memory, you need the PROCESS_VM_WRITE and PROCESS_VM_OPERATION rights. Alternatively you could just ask for all rights with PROCESS_ALL_ACCESS, but I prefer to be precise.

DWORD access = PROCESS_VM_READ |
               PROCESS_QUERY_INFORMATION |
               PROCESS_VM_WRITE |
               PROCESS_VM_OPERATION;
HANDLE proc = OpenProcess(access, FALSE, pid);

And then to read or write:

void *addr; // target process address
SIZE_T written;
ReadProcessMemory(proc, addr, &value, sizeof(value), &written);
// or
WriteProcessMemory(proc, addr, &value, sizeof(value), &written);

Don’t forget to check the return value and verify written. Finally, don’t forget to close it when you’re done.

CloseHandle(proc);

That’s all there is to it. For the full cheat tool you’d need to find the mapped regions of memory, via VirtualQueryEx. It’s not as simple, but I’ll leave that for another article.

Linux

Unfortunately there’s no standard, cross-platform debugging API for unix-like systems. Most have a ptrace() system call, though each works a little differently. Note that ptrace() is not part of POSIX, but appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere. The following will all be specific to Linux, though the procedure is similar on other unix-likes.

In typical Linux fashion, if it involves other processes, you use the standard file API on the /proc filesystem. Each process has a directory under /proc named as its process ID. In this directory is a virtual file called “mem”, which is a file view of that process’ entire address space, including unmapped regions.

char file[64];
sprintf(file, "/proc/%ld/mem", (long)pid);
int fd = open(file, O_RDWR);

The catch is that while you can open this file, you can’t actually read or write on that file without attaching to the process as a debugger. You’ll just get EIO errors. To attach, use ptrace() with PTRACE_ATTACH. This asynchronously delivers a SIGSTOP signal to the target, which has to be waited on with waitpid().

You could select the target address with lseek(), but it’s cleaner and more efficient just to do it all in one system call with pread() and pwrite(). I’ve left out the error checking, but the return value of each function should be checked:

ptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, NULL, 0);

off_t addr = ...; // target process address
pread(fd, &value, sizeof(value), addr);
// or
pwrite(fd, &value, sizeof(value), addr);

ptrace(PTRACE_DETACH, pid, 0, 0);

The process will (and must) be stopped during this procedure, so do your reads/writes quickly and get out. The kernel will deliver the writes to the other process’ virtual memory.

Like before, don’t forget to close.

close(fd);

To find the mapped regions in the real cheat tool, you would read and parse the virtual text file /proc/pid/maps. I don’t know if I’d call this stringly-typed method elegant — the kernel converts the data into string form and the caller immediately converts it right back — but that’s the official API.

Update: Konstantin Khlebnikov has pointed out the process_vm_readv() and process_vm_writev() system calls, available since Linux 3.2 (January 2012) and glibc 2.15 (March 2012). These system calls do not require ptrace(), nor does the remote process need to be stopped. They’re equivalent to ReadProcessMemory() and WriteProcessMemory(), except there’s no requirement to first “open” the process.

Automatic Deletion of Incomplete Output Files

2016-08-07T02:00:37Z

Conventionally, a program that creates an output file will delete its incomplete output should an error occur while writing the file. It’s risky to leave behind a file that the user may rightfully confuse for a valid file. They might not have noticed the error.

For example, compression programs such as gzip, bzip2, and xz when given a compressed file as an argument will create a new file with the compression extension removed. They write to this file as the compressed input is being processed. If the compressed stream contains an error in the middle, the partially-completed output is removed.

There are exceptions of course, such as programs that download files over a network. The partial result has value, especially if the transfer can be continued from where it left off. The convention is to append another extension, such as “.part”, to indicate a partial output.

The straightforward solution is to always delete the file as part of error handling. A non-interactive program would report the error on standard error, delete the file, and exit with an error code. However, there are at least two situations where error handling would be unable to operate: unhandled signals (usually including a segmentation fault) and power failures. A partial or corrupted output file will be left behind, possibly looking like a valid file.

A common, more complex approach is to name the file differently from its final name while being written. If written successfully, the completed file is renamed into place. This is already required for durable replacement, so it’s basically free for many applications. In the worst case, where the program is unable to clean up, the obviously incomplete file is left behind only wasting space.

Looking to be more robust, I had the following misguided idea: Rely completely on the operating system to perform cleanup in the case of a failure. Initially the file would be configured to be automatically deleted when the final handle is closed. This takes care of all abnormal exits, and possibly even power failures. The program can just exit on error without deleting the file. Once written successfully, the automatic-delete indicator is cleared so that the file survives.

The target application for this technique supports both Linux and Windows, so I would need to figure it out for both systems. On Windows, there’s the flag FILE_FLAG_DELETE_ON_CLOSE. I’d just need to find a way to clear it. On POSIX, file would be unlinked while being written, and linked into the filesystem on success. The latter turns out to be a lot harder than I expected.

Solution for Windows

I’ll start with Windows since the technique actually works fairly well here — ignoring the usual, dumb Win32 filesystem caveats. This is a little surprising, since it’s usually Win32 that makes these things far more difficult than they should be.

The primary Win32 function for opening and creating files is CreateFile. There are many options, but the key is FILE_FLAG_DELETE_ON_CLOSE. Here’s how an application might typically open a file for output.

DWORD access = GENERIC_WRITE;
DWORD create = CREATE_ALWAYS;
DWORD flags = FILE_FLAG_DELETE_ON_CLOSE;
HANDLE f = CreateFile("out.tmp", access, 0, 0, create, flags, 0);

This special flag asks Windows to delete the file as soon as the last handle to to file object is closed. Notice I said file object, not file, since these are different things. The catch: This flag is a property of the file object, not the file, and cannot be removed.

However, the solution is simple. Create a new link to the file so that it survives deletion. This even works for files residing on a network shares.

CreateHardLink("out", "out.tmp", 0);
CloseHandle(f);  // deletes out.tmp file

The gotcha is that the underlying filesystem must be NTFS. FAT32 doesn’t support hard links. Unfortunately, since FAT32 remains the least common denominator and is still widely used for removable media, depending on the application, your users may expect support for saving files to FAT32. A workaround is probably required.

Solution for Linux

This is where things really fall apart. It’s just barely possible on Linux, it’s messy, and it’s not portable anywhere else. There’s no way to do this for POSIX in general.

My initial thought was to create a file then unlink it. Unlike the situation on Windows, files can be unlinked while they’re currently open by a process. These files are finally deleted when the last file descriptor (the last reference) is closed. Unfortunately, using unlink(2) to remove the last link to a file prevents that file from being linked again.

Instead, the solution is to use the relatively new (since Linux 3.11), Linux-specific O_TMPFILE flag when creating the file. Instead of a filename, this variation of open(2) takes a directory and creates an unnamed, temporary file in it. These files are special in that they’re permitted to be given a name in the filesystem at some future point.

For this example, I’ll assume the output is relative to the current working directory. If it’s not, you’ll need to open an additional file descriptor for the parent directory, and also use openat(2) to avoid possible race conditions (since paths can change from under you). The number of ways this can fail is already rapidly multiplying.

int fd = open(".", O_TMPFILE|O_WRONLY, 0600);

The catch is that only a handful of filesystems support O_TMPFILE. It’s like the FAT32 problem above, but worse. You could easily end up in a situation where it’s not supported, and will almost certainly require a workaround.

Linking a file from a file descriptor is where things get messier. The file descriptor must be linked with linkat(2) from its name on the /proc virtual filesystem, constructed as a string. The following snippet comes straight from the Linux open(2) manpage.

char buf[64];
sprintf(buf, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, buf, AT_FDCWD, "out", AT_SYMLINK_FOLLOW);

Even on Linux, /proc isn’t always available, such as within a chroot or a container, so this part can fail as well. In theory there’s a way to do this with the Linux-specific AT_EMPTY_PATH and avoid /proc, but I couldn’t get it to work.

// Note: this doesn't actually work for me.
linkat(fd, "", AT_FDCWD, "out", AT_EMPTY_PATH);

Given the poor portability (even within Linux), the number of ways this can go wrong, and that a workaround is definitely needed anyway, I’d say this technique is worthless. I’m going to stick with the tried-and-true approach for this one.

Appending to a File from Multiple Processes

2016-08-03T16:17:44Z

Suppose you have multiple processes appending output to the same file without explicit synchronization. These processes might be working in parallel on different parts of the same problem, or these might be threads blocked individually reading different external inputs. There are two concerns that come into play:

1) The append must be atomic such that it doesn’t clobber previous appends by other threads and processes. For example, suppose a write requires two separate operations: first moving the file pointer to the end of the file, then performing the write. There would be a race condition should another process or thread intervene in between with its own write.

2) The output will be interleaved. The primary solution is to design the data format as atomic records, where the ordering of records is unimportant — like rows in a relational database. This could be as simple as a text file with each line as a record. The concern is then ensuring records are written atomically.

This article discusses processes, but the same applies to threads when directly dealing with file descriptors.

Appending

The first concern is solved by the operating system, with one caveat. On POSIX systems, opening a file with the O_APPEND flag will guarantee that writes always safely append.

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

However, this says nothing about interleaving. Two processes successfully appending to the same file will result in all their bytes in the file in order, but not necessarily contiguously.

The caveat is that not all filesystems are POSIX-compatible. Two famous examples are NFS and the Hadoop Distributed File System (HDFS). On these networked filesystems, appends are simulated and subject to race conditions.

On POSIX systems, fopen(3) with the a flag will use O_APPEND, so you don’t necessarily need to use open(2). On Linux this can be verified for any language’s standard library with strace.

#include 

int main(void)
{
    fopen("/dev/null", "a");
    return 0;
}

And the result of the trace:

$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3

For Win32, the equivalent is the FILE_APPEND_DATA access right, and similarly only applies to “local files.”

Interleaving and Pipes

The interleaving problem has two layers, and gets more complicated the more correct you want to be. Let’s start with pipes.

On POSIX, a pipe is unseekable and doesn’t have a file position, so appends are the only kind of write possible. When writing to a pipe (or FIFO), writes less than the system-defined PIPE_BUF are guaranteed to be atomic and non-interleaving.

Write requests of PIPE_BUF bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than PIPE_BUF bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, […]

The minimum value for PIPE_BUF for POSIX systems is 512 bytes. On Linux it’s 4kB, and on other systems it’s as high as 32kB. As long as each record is less than 512 bytes, a simple write(2) will due. None of this depends on a filesystem since no files are involved.

If more than PIPE_BUF bytes isn’t enough, the POSIX writev(2) can be used to atomically write up to IOV_MAX buffers of PIPE_BUF bytes. The minimum value for IOV_MAX is 16, but is typically 1024. This means the maximum safe atomic write size for pipes — and therefore the largest record size — for a perfectly portable program is 8kB (16✕512). On Linux it’s 4MB.

That’s all at the system call level. There’s another layer to contend with: buffered I/O in your language’s standard library. Your program may pass data in appropriately-sized pieces for atomic writes to the I/O library, but it may be undoing your hard work, concatenating all these writes into a buffer, splitting apart your records. For this part of the article, I’ll focus on single-threaded C programs.

Suppose you’re writing a simple space-separated format with one line per record.

int foo, bar;
float baz;
while (condition) {
    // ...
    printf("%d %d %f\n", foo, bar, baz);
}

Whether or not this works depends on how stdout is buffered. C standard library streams (FILE *) have three buffering modes: unbuffered, line buffered, and fully buffered. Buffering is configured through setbuf(3) and setvbuf(3), and the initial buffering state of a stream depends on various factors. For buffered streams, the default buffer is at least BUFSIZ bytes, itself at least 256 (C99 §7.19.2¶7). Note: threads share this buffer.

Since each record in the above program easily fits inside 256 bytes, if stdout is a line buffered pipe then this program will interleave correctly on any POSIX system without further changes.

If instead your output is comma-separated values (CSV) and your records may contain new line characters, there are two approaches. In each, the record must still be no larger than PIPE_BUF bytes.

Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3)) and output the entire buffer in a single fwrite(3). While I believe this will always work in practice, it’s not guaranteed by the C specification, which defines fwrite(3) as a series of fputc(3) calls (C99 §7.19.8.2¶2).
Fully buffered pipe: set a sufficiently large stream buffer and follow each record with a fflush(3). Unlike fwrite(3) on an unbuffered stream, the specification says the buffer will be “transmitted to the host environment as a block” (C99 §7.19.3¶3), so this should be perfectly correct on any POSIX system.

If your situation is more complicated than this, you’ll probably have to bypass your standard library buffered I/O and call write(2) or writev(2) yourself.

Practical Application

If interleaving writes to a pipe stdout sounds contrived, here’s the real life scenario: GNU xargs with its --max-procs (-P) option to process inputs in parallel.

xargs -n1 -P$(nproc) myprogram < inputs.txt | cat > outputs.csv

The | cat ensures the output of each myprogram process is connected to the same pipe rather than to the same file.

A non-portable alternative to | cat, especially if you’re dispatching processes and threads yourself, is the splice(2) system call on Linux. It efficiently moves the output from the pipe to the output file without an intermediate copy to userspace. GNU Coreutils’ cat doesn’t use this.

Win32 Pipes

On Win32, anonymous pipes have no semantics regarding interleaving. Named pipes have per-client buffers that prevent interleaving. However, the pipe buffer size is unspecified, and requesting a particular size is only advisory, so it comes down to trial and error, though the unstated limits should be comparatively generous.

Interleaving and Files

Suppose instead of a pipe we have an O_APPEND file on POSIX. Common wisdom states that the same PIPE_BUF atomic write rule applies. While this often works, especially on Linux, this is not correct. The POSIX specification doesn’t require it and there are systems where it doesn’t work.

If you know the particular limits of your operating system and filesystem, and you don’t care much about portability, then maybe you can get away with interleaving appends. For full portability, pipes are required.

On Win32, writes on local files up to the underlying drive’s sector size (typically 512 bytes to 4kB) are atomic. Otherwise the only options are deprecated Transactional NTFS (TxF), or manually synchronizing your writes. All in all, it’s going to take more work to get correct.

Conclusion

My true use case for mucking around with clean, atomic appends is to compute giant CSV tables in parallel, with the intention of later loading into a SQL database (i.e. SQLite) for analysis. A more robust and traditional approach would be to write results directly into the database as they’re computed. But I like the platform-neutral intermediate CSV files — good for archival and sharing — and the simplicity of programs generating the data — concerned only with atomic write semantics rather than calling into a particular SQL database API.

Four Ways to Compile C for Windows

2016-06-13T04:13:25Z

Update 2020: If you’re on Windows, just use w64devkit. It’s my own toolchain distribution, and it’s the best option available. Everything you need is in one package.

I primarily work on and develop for unix-like operating systems — Linux in particular. However, when it comes to desktop applications, most potential users are on Windows. Rather than develop on Windows, which I’d rather avoid, I’ll continue developing, testing, and debugging on Linux while keeping portability in mind. Unfortunately every option I’ve found for building Windows C programs has some significant limitations. These limitations advise my approach to portability and restrict the C language features used by the program for all platforms.

As of this writing I’ve identified four different practical ways to build C applications for Windows. This information will definitely become further and further out of date as this article ages, so if you’re visiting from the future take a moment to look at the date. Except for LLVM shaking things up recently, development tooling on unix-like systems has had the same basic form for the past 15 years (i.e. dominated by GCC). While Visual C++ has been around for more than two decades, the tooling on Windows has seen more churn by comparison.

Before I get into the specifics, let me point out a glaring problem common to all four: Unicode arguments and filenames. Microsoft jumped the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all worlds, being a variable length encoding (surrogate pairs), backwards incompatible (unlike UTF-8), and having byte-order issues (BOM). Most Win32 functions that accept strings generally come in two flavors, ANSI and UTF-16. The standard, portable C library functions wrap the ANSI-flavored functions. This means portable C programs can’t interact with Unicode filenames. (Update 2021: Now they can.) They must call the non-portable, Windows-specific versions. This includes main itself, which is only handed ANSI-truncated arguments.

Compare this to unix-like systems, which generally adopted UTF-8, but rather as a convention than as a hard rule. The operating system doesn’t know or care about Unicode. Program arguments and filenames are just zero-terminated bytestrings. Implicitly decoding these as UTF-8 would be a mistake anyway. What happens when the encoding isn’t valid?

This doesn’t have to be a problem on Windows. A Windows standard C library could connect to Windows’ Unicode-flavored functions and encode to/from UTF-8 as needed, allowing portable programs to maintain the bytestring illusion. It’s only that none of the existing standard C libraries do it this way.

Mingw-w64

Of course my first natural choice is MinGW, specifically the Mingw-w64 fork. It’s GCC ported to Windows. You can continue relying on GCC-specific features when you need them. It’s got all the core language features up through C11, plus the common extensions. It’s probably packaged by your Linux distribution of choice, making it trivial to cross-compile programs and libraries from Linux — and with Wine you can even execute them on x86. Like regular GCC, it outputs GDB-friendly DWARF debugging information, so you can debug applications with GDB.

If I’m using Mingw-w64 on Windows, ~~I prefer to do so from inside Cygwin~~. Since it provides a complete POSIX environment, it maximizes portability for the whole tool chain. This isn’t strictly required.

However, it has one big flaw. Unlike unix-like systems, Windows doesn’t supply a system standard C library. That’s the compiler’s job. But Mingw-w64 doesn’t have one. Instead it links against msvcrt.dll, ~~which isn’t officially supported by Microsoft. It just happens to exist on modern Windows installations. Since it’s not supported,~~ it’s way out of date and doesn’t support much of C99. A lot of these problems are patched over by the compiler, ~~but if you’re relying on Mingw-w64, you still have to stick to some C89 library features, such as limiting yourself to the C89 printf specifiers~~.

Update: Mārtiņš Možeiko has pointed out __USE_MINGW_ANSI_STDIO, an undocumented feature that fixes the printf family. I now use this by default in all of my Mingw-w64 builds. It fixes most of the formatted output issues, except that it’s incompatible with the format function attribute. (Update 2021: Mingw-w64 now does the right thing out of the box.)

Another problem is that position-independent code generation is broken, and so ASLR is not an option. This means binaries produced by Mingw-w64 are less secure than they should be. There are also a number of subtle code generation bugs that might arise if you’re doing something unusual. (Update 2021: Mingw-w64 makes PIE mandatory.)

Visual C++

The behemoth usually considered in this situation is Visual Studio and the Visual C++ build tools. I strongly prefer open source development tools, and Visual Studio obviously the least open source option, but at least it’s cost-free these days. Now, I have absolutely no interest in Visual Studio, but fortunately the Visual C++ compiler and associated build tools can be used standalone, supporting both C and C++.

Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that batch file in a cmd.exe console and the Visual C++ command line build tools will be made available in that console and in any programs executed from it (your editor). It includes the compiler (cl.exe), linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe), and more. It also includes a mostly POSIX-complete make called nmake.exe. All these tools are noisy and print a copyright banner on every invocation, so get used to passing -nologo every time, which suppresses some of it.

When I said behemoth, I meant it. In my experience it literally takes hours (unattended) to install Visual Studio 2015. The good news is you don’t actually need it all anymore. The build tools are available standalone. While it’s still a larger and slower installation process than it really should be, it’s is much more reasonable to install. It’s good enough that I’d even say I’m comfortable relying on it for Windows builds. (Update: The build tools are unfortunately no longer standalone.)

That being said, it’s not without its flaws. Microsoft has never announced any plans to support C99. They only care about C++, with C as a second class citizen. Since C++11 incorporated most of C99 and Microsoft supports C++11, Visual Studio 2015 supports most of C99. The only things missing as far as I can tell are variable length arrays (VLAs), complex numbers, and C99’s array parameter declarators, since none of these were adopted by C++. Some C99 features are considered extensions (as they would be for C89), so you’ll also get warnings about them, which can be disabled.

The command line interface (option flags, intermediates, etc.) isn’t quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so you’ll need separate Makefiles, or you’ll need to use a build system that generates Visual C++ Makefiles.

~~Debugging is a major problem.~~ (Update 2022: It’s actually quite good once you know how to do it.) Visual C++ outputs separate .pdb program database files, which aren’t usable from GDB. Visual Studio has a built-in debugger, though it’s not included in the standalone Visual C++ build tools. ~~I’m still searching for a decent debugging solution for this scenario. I tried WinDbg, but I can’t stand it.~~ (Update 2022: RemedyBG is amazing.)

In general the output code performance is on par with GCC and Clang, so you’re not really gaining or losing performance with Visual C++.

Clang

Unsurprisingly, Clang has been ported to Windows. It’s like Mingw-w64 in that you get the same features and interface across platforms.

Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead it relies directly on the official Windows SDK. You’ll basically need to install the Visual C++ build tools as if were going to build with Visual C++. This means no practical cross-platform builds and you’re still relying on the proprietary Microsoft toolchain. In the past you even had to use Microsoft’s linker, but LLVM now provides its own.

It generates GDB-friendly DWARF debug information (in addition to CodeView) so in theory you can debug with GDB again. I haven’t given this a thorough evaluation yet.

Pelles C

Finally there’s Pelles C. It’s cost-free but not open source. It’s a reasonable, small install that includes a full IDE with an integrated debugger and command line tools. It has its own C library and Win32 SDK with the most complete C11 support around. It also supports OpenMP 3.1. All in all it’s pretty nice and is something I wouldn’t be afraid to rely upon for Windows builds.

Like Visual C++, it has a couple of “povars” batch files to set up the right environment, which includes a C compiler, linker, assembler, etc. The compiler interface mostly mimics cl.exe, though there are far fewer code generation options. The make program, pomake.exe, mimics nmake.exe, but is even less POSIX-complete. The compiler’s output code performance is also noticeably poorer than GCC, Clang, and Visual C++. It’s definitely a less mature compiler.

It outputs CodeView debugging information, so GDB is of no use. The best solution is to simply use the compiler built into the IDE, which can be invoked directly from the command line. You don’t normally need to code from within the IDE just to use the debugger.

Like Visual C++, it’s Windows only, so cross-compilation isn’t really in the picture.

If performance isn’t of high importance, and you don’t require specific code generation options, then Pelles C is a nice choice for Windows builds.

Other Options

I’m sure there are a few other options out there, and I’d like to hear about them so I can try them out. I focused on these since they’re all cost free and easy to download. If I have to register or pay, then it’s not going to beat these options.

Mapping Multiple Memory Views in User Space

2016-04-10T21:59:16Z

Modern operating systems run processes within virtual memory using a piece of hardware called a memory management unit (MMU). The MMU contains a page table that defines how virtual memory maps onto physical memory. The operating system is responsible for maintaining this page table, mapping and unmapping virtual memory to physical memory as needed by the processes it’s running. If a process accesses a page that is not currently mapped, it will trigger a page fault and the execution of the offending thread will be paused until the operating system maps that page.

This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.

Some useful applications of this feature include:

An extremely fast, large memory “copy” by mapping the source memory overtop the destination memory.
Trivial interoperability between code instrumented with baggy bounds checking [PDF] and non-instrumented code. A few bits of each pointer are reserved to tag the pointer with the size of its memory allocation. For compactness, the stored size is rounded up to a power of two, making it “baggy.” Instrumented code checks this tag before making a possibly-unsafe dereference. Normally, instrumented code would need to clear (or set) these bits before dereferencing or before passing it to non-instrumented code. Instead, the allocation could be mapped simultaneously at each location for every possible tag, making the pointer valid no matter its tag bits.
Two responses to my last post on hotpatching suggested that, instead of modifying the instruction directly, memory containing the modification could be mapped over top of the code. I would copy the code to another place in memory, safely modify it in private, switch the page protections from write to execute (both for W^X and for other hardware limitations), then map it over the target. Restoring the original behavior would be as simple as unmapping the change.

Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.

POSIX Memory Mapping

On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions are shm_open(3), ftruncate(2), and mmap(2).

First, create a file descriptor to shared memory using shm_open. It has very similar semantics to open(2).

int shm_open(const char *name, int oflag, mode_t mode);

The name works much like a filesystem path, but is actually a different namespace (though on Linux it is a tmpfs mounted at /dev/shm). Resources created here (O_CREAT) will persist until explicitly deleted (shm_unlink(3)) or until the system reboots. It’s an oversight in POSIX that a name is required even if we never intend to access it by name. File descriptors can be shared with other processes via fork(2) or through UNIX domain sockets, so a name isn’t strictly required.

OpenBSD introduced shm_mkstemp(3) to solve this problem, but it’s not widely available. On Linux, as of this writing, the O_TMPFILE flag may or may not provide a fix (it’s undocumented).

The portable workaround is to attempt to choose a unique name, open the file with O_CREAT | O_EXCL (either atomically create the file or fail), shm_unlink the shared memory object as soon as possible, then cross our fingers. The shared memory object will still exist (the file descriptor keeps it alive) but will not longer be accessible by name.

int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
    handle_error(); // non-local exit
shm_unlink("/example");

The shared memory object is brand new (O_EXCL) and is therefore of zero size. ftruncate sets it to the desired size. This does not need to be a multiple of the page size. Failing to allocate memory will result in a bus error on access.

size_t size = sizeof(uint32_t);
ftruncate(fd, size);

Finally mmap the shared memory into place just as if it were a file. We can choose an address (aligned to a page) or let the operating system choose one for use (NULL). If we don’t plan on making any more mappings, we can also close the file descriptor. The shared memory object will be freed as soon as it completely unmapped (munmap(2)).

int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
close(fd);

At this point both a and b have different addresses but point (via the page table) to the same physical memory. Changes to one are reflected in the other. So this:

*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);

Will print out something like:

0x6ffffff0000 0x6fffffe0000 0xdeafbeef

It’s also possible to do all this only with open(2) and mmap(2) by mapping the same file twice, but you’d need to worry about where to put the file, where it’s going to be backed, and the operating system will have certain obligations about syncing it to storage somewhere. Using POSIX shared memory is simpler and faster.

Windows Memory Mapping

Windows is very similar, but directly supports anonymous shared memory. The key functions are CreateFileMapping, and MapViewOfFileEx.

First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.

size_t size = sizeof(uint32_t);
HANDLE h = CreateFileMapping(INVALID_HANDLE_VALUE,
                             NULL,
                             PAGE_READWRITE,
                             0, size,
                             NULL);

There’s no truncate step because the space is allocated at creation time via the two-part size argument.

Then, just like mmap:

uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
CloseHandle(h);

If I wanted to choose the target address myself, I’d call MapViewOfFileEx instead, which takes the address as additional argument.

From here on it’s the same as above.

Generalizing the API

Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.

int  memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);

Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.

It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.

Pass the same pointers back to memory_alias_unmap to free the mappings. When called correctly it cannot fail, so there’s no return value.

The full source is here: memalias.c

POSIX

Starting with the simpler of the two functions, the POSIX implementation looks like so:

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    for (size_t i = 0; i < naddr; i++)
        munmap(addrs[i], size);
}

The complex part is creating the mapping:

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    char path[128];
    snprintf(path, sizeof(path), "/%s(%lu,%p)",
             __FUNCTION__, (long)getpid(), addrs);
    int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
    if (fd == -1)
        return -1;
    shm_unlink(path);
    ftruncate(fd, size);
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = mmap(addrs[i], size,
                        PROT_READ | PROT_WRITE, MAP_SHARED,
                        fd, 0);
        if (addrs[i] == MAP_FAILED) {
            memory_alias_unmap(size, i, addrs);
            close(fd);
            return -1;
        }
    }
    close(fd);
    return 0;
}

The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.

Otherwise it just walks the array setting up the mappings.

Windows

The Windows version is very similar.

void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
    (void)size;
    for (size_t i = 0; i < naddr; i++)
        UnmapViewOfFile(addrs[i]);
}

Since Windows tracks the size internally, it’s unneeded and ignored.

int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
    HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
                                 NULL,
                                 PAGE_READWRITE,
                                 0, size,
                                 NULL);
    if (m == NULL)
        return -1;
    DWORD access = FILE_MAP_ALL_ACCESS;
    for (size_t i = 0; i < naddr; i++) {
        addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
        if (addrs[i] == NULL) {
            memory_alias_unmap(size, i, addrs);
            CloseHandle(m);
            return -1;
        }
    }
    CloseHandle(m);
    return 0;
}

In the future I’d like to find some unique applications of these multiple memory views.

Calling the Native API While Freestanding

2016-02-28T23:47:22Z

When developing minimal, freestanding Windows programs, it’s obviously beneficial to take full advantage of dynamic libraries that are already linked rather than duplicate that functionality in the application itself. Every Windows process automatically, and involuntarily, has kernel32.dll and ntdll.dll loaded into its process space before it starts. As discussed previously, kernel32.dll provides the Windows API (Win32). The other, ntdll.dll, provides the Native API for user space applications, and is the focus of this article.

The Native API is a low-level API, a foundation for the implementation of the Windows API and various components that don’t use the Windows API (drivers, etc.). It includes a runtime library (RTL) suitable for replacing important parts of the C standard library, unavailable to freestanding programs. Very useful for a minimal program.

Unfortunately, using the Native API is a bit of a minefield. Not all of the documented Native API functions are actually exported by ntdll.dll, making them inaccessible both for linking and GetProcAddress(). Some are exported, but not documented as such. Others are documented as exported but are not documented when (which release of Windows). If a particular function wasn’t exported until Windows 8, I don’t want to use when supporting Windows 7.

This is further complicated by the Microsoft Windows SDK, where many of these functions are just macros that alias C runtime functions. Naturally, MinGW closely follows suit. For example, in both cases, here is how the Native API function RtlCopyMemory is “declared.”

#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))

This is certainly not useful for freestanding programs, though it has a significant benefit for hosted programs: The C compiler knows the semantics of memcpy() and can properly optimize around it. Any C compiler worth its salt will replace a small or aligned, fixed-sized memcpy() or memmove() with the equivalent inlined code. For example:

    char buffer0[16];
    char buffer1[16];
    // ...
    memcpy(buffer0, buffer1, 16);
    // ...

On x86_64 (GCC 4.9.3, -Os), this memmove() call is replaced with two instructions. This isn’t possible when calling an opaque function in a non-standard dynamic library. The side effects could be anything.

    movaps  xmm0, [rsp + 48]
    movaps  [rsp + 32], xmm0

These Native API macro aliases are what have allowed certain Wine issues to slip by unnoticed for years. Very few user space applications actually call Native API functions, even when addressed directly by name in the source. The development suite is pulling a bait and switch.

Like last time I danced at the edge of the compiler, this has caused headaches in my recent experimentation with freestanding executables. The MinGW headers assume that the programs including them will link against a C runtime. Dirty hack warning: To work around it, I have to undo the definition in the MinGW headers and make my own. For example, to use the real RtlMoveMemory():

#include 

#undef RtlMoveMemory
__declspec(dllimport)
void RtlMoveMemory(void *, const void *, size_t);

Anywhere where I might have previously used memmove() I can instead use RtlMoveMemory(). Or I could trivially supply my own wrapper:

void *
memmove(void *d, const void *s, size_t n)
{
    RtlMoveMemory(d, s, n);
    return d;
}

As of this writing, the same approach is not reliable with RtlCopyMemory(), the cousin to memcpy(). As far as I can tell, it was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June 2015). Use RtlMoveMemory() instead. The overlap-handling overhead is negligible compared to the function call overhead anyway.

As a side note: one reason besides minimalism for not implementing your own memmove() is that it can’t be implemented efficiently in a conforming C program. According to the language specification, your implementation of memmove() would not be permitted to compare its pointer arguments with <, >, <=, or >=. That would lead to undefined behavior when pointing to unrelated objects (ISO/IEC 9899:2011 §6.5.8¶5). The simplest legal approach is to allocate a temporary buffer, copy the source buffer into it, then copy it into the destination buffer. However, buffer allocation may fail — i.e. NULL return from malloc() — introducing a failure case to memmove(), which isn’t supposed to fail.

Update July 2016: Alex Elsayed pointed out a solution to the memmove() problem in the comments. In short: iterate over the buffers bytewise (char *) using equality (==) tests to check for an overlap. In theory, a compiler could optimize away the loop and make it efficient.

I keep mentioning Wine because I’ve been careful to ensure my applications run correctly with it. So far it’s worked perfectly with both Windows API and Native API functions. Thanks to the hard work behind the Wine project, despite being written sharply against the Windows API, these tiny programs remain relatively portable (x86 and ARM). It’s a good fit for graphical applications (games), but I would never write a command line application like this. The command line has always been a second class citizen on Windows.

Now that I’ve got these Native API issues sorted out, I’ve significantly expanded the capabilities of my tiny, freestanding programs without adding anything to their size. Functions like RtlUnicodeToUTF8N() and RtlUTF8ToUnicodeN() will surely be handy.

Small, Freestanding Windows Executables

2016-01-31T22:53:03Z

Update: This is old and was updated in 2023!

Recently I’ve been experimenting with freestanding C programs on Windows. Freestanding refers to programs that don’t link, either statically or dynamically, against a standard library (i.e. libc). This is typical for operating systems and similar, bare metal situations. Normally a C compiler can make assumptions about the semantics of functions provided by the C standard library. For example, the compiler will likely replace a call to a small, fixed-size memmove() with move instructions. Since a freestanding program would supply its own, it may have different semantics.

My usual go to for C/C++ on Windows is Mingw-w64, which has greatly suited my needs the past couple of years. It’s packaged on Debian, and, when combined with Wine, allows me to fully develop Windows applications on Linux. Being GCC, it’s also great for cross-platform development since it’s essentially the same compiler as the other platforms. The primary difference is the interface to the operating system (POSIX vs. Win32).

However, it has one glaring flaw inherited from MinGW: it links against msvcrt.dll, an ancient version of the Microsoft C runtime library that currently ships with Windows. Besides being dated and quirky, it’s not an official part of Windows and never has been, despite its inclusion with every release since Windows 95. Mingw-w64 doesn’t have a C library of its own, instead patching over some of the flaws of msvcrt.dll and linking against it.

Since so much depends on msvcrt.dll despite its unofficial nature, it’s unlikely Microsoft will ever drop it from future releases of Windows. However, if strict correctness is a concern, we must ask Mingw-w64 not to link against it. An alternative would be PlibC, though the LGPL licensing is unfortunate. Another is Cygwin, which is a very complete POSIX environment, but is heavy and GPL-encumbered.

Sometimes I’d prefer to be more direct: skip the C standard library altogether and talk directly to the operating system. On Windows that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only links against system DLLs.

Linux vs. Windows

The most important benefit of a standard library like libc is a portable, uniform interface to the host system. So long as the standard library suits its needs, the same program can run anywhere. Without it, the programs needs an implementation of each host-specific interface.

On Linux, operating system requests at the lowest level are made directly via system calls. This requires a bit of assembly language for each supported architecture (int 0x80 on x86, syscall on x86-64, swi on ARM, etc.). The POSIX functions of the various Linux libc implementations are built on top of this mechanism.

For example, here’s a function for a 1-argument system call on x86-64.

long
syscall1(long n, long arg)
{
    long result;
    __asm__ volatile (
        "syscall"
        : "=a"(result)
        : "a"(n), "D"(arg)
    );
    return result;
}

Then exit() is implemented on top. Note: A real libc would do cleanup before exiting, like calling registered atexit() functions.

#include   // defines SYS_exit

void
exit(int code)
{
    syscall1(SYS_exit, code);
}

The situation is simpler on Windows. Its low level system calls are undocumented and unstable, changing across even minor updates. The formal, stable interface is through the exported functions in kernel32.dll. In fact, kernel32.dll is essentially a standard library on its own (making the term “freestanding” in this case dubious). It includes functions usually found only in user-space, like string manipulation, formatted output, font handling, and heap management (similar to malloc()). It’s not POSIX, but it has analogs to much of the same functionality.

Program Entry

The standard entry for a C program is main(). However, this is not the application’s true entry. The entry is in the C library, which does some initialization before calling your main(). When main() returns, it performs cleanup and exits. Without a C library, programs don’t start at main().

On Linux the default entry is the symbol _start. It’s prototype would look like so:

void _start(void);

Returning from this function leads to a segmentation fault, so it’s up to your application to perform the exit system call rather than return.

On Windows, the entry depends on the type of application. The two relevant subsystems today are the console and windows subsystems. The former is for console applications (duh). These programs may still create windows and such, but must always have a controlling console. The latter is primarily for programs that don’t run in a console, though they can still create an associated console if they like. In Mingw-w64, give -mconsole (default) or -mwindows to the linker to choose the subsystem.

The default entry for each is slightly different.

int WINAPI mainCRTStartup(void);
int WINAPI WinMainCRTStartup(void);

Unlike Linux’s _start, Windows programs can safely return from these functions, similar to main(), hence the int return. The WINAPI macro means the function may have a special calling convention, depending on the platform.

On any system, you can choose a different entry symbol or address using the --entry option to the GNU linker.

Disabling libgcc

One problem I’ve run into is Mingw-w64 generating code that calls __chkstk_ms() from libgcc. I believe this is a long-standing bug, since -ffreestanding should prevent these sorts of helper functions from being used. The workaround I’ve found is to disable the stack probe and pre-commit the whole stack.

-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000

Alternatively you could link against libgcc (statically) with -lgcc, but, again, I’m going for a tiny executable.

A freestanding example

Here’s an example of a Windows “Hello, World” that doesn’t use a C library.

#include 

int WINAPI
mainCRTStartup(void)
{
    char msg[] = "Hello, world!\n";
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(stdout, msg, sizeof(msg), (DWORD[]){0}, NULL);
    return 0;
}

To build it:

x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
    -nostdlib -ffreestanding -mconsole -Os \
    -mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
    -o example.exe example.c \
    -lkernel32

Notice I manually linked against kernel32.dll. The stripped final result is only 4kB, mostly PE padding. There are techniques to trim this down even further, but for a substantial program it wouldn’t make a significant difference.

From here you could create a GUI by linking against user32.dll and gdi32.dll (both also part of Win32) and calling the appropriate functions. I already ported my OpenGL demo to a freestanding .exe, dropping GLFW and directly using Win32 and WGL. It’s much less portable, but the final .exe is only 4kB, down from the original 104kB (static linking against GLFW).

I may go this route for the upcoming 7DRL 2016 in March.

Goblin-COM 7DRL 2015

2015-03-15T21:56:12Z

Yesterday I completed my third entry to the annual Seven Day Roguelike (7DRL) challenge (previously: 2013 and 2014). This year’s entry is called Goblin-COM.

Download/Source: Goblin-COM
Telnet play (no saves): telnet gcom.nullprogram.com
Video review by Akhier Dragonheart

As with previous years, the ideas behind the game are not all that original. The goal was to be a fantasy version of classic X-COM with an ANSI terminal interface. You are the ruler of a fledgling human nation that is under attack by invading goblins. You hire heroes, operate squads, construct buildings, and manage resource income.

The inspiration this year came from watching BattleBunny play OpenXCOM, an open source clone of the original X-COM. It had its major 1.0 release last year. Like the early days of OpenTTD, it currently depends on the original game assets. But also like OpenTTD, it surpasses the original game in every way, so there’s no reason to bother running the original anymore. I’ve also recently been watching One F Jef play Silent Storm, which is another turn-based squad game with a similar combat simulation.

As in X-COM, the game is broken into two modes of play: the geoscape (strategic) and the battlescape (tactical). Unfortunately I ran out of time and didn’t get to the battlescape part, though I’d like to add it in the future. What’s left is a sort-of city-builder with some squad management. You can hire heroes and send them out in squads to eliminate goblins, but rather than dropping to the battlescape, battles always auto-resolve in your favor. Despite this, the game still has a story, a win state, and a lose state. I won’t say what they are, so you have to play it for yourself!

Terminal Emulator Layer

My previous entries were HTML5 games, but this entry is a plain old standalone application. C has been my preferred language for the past few months, so that’s what I used. Both UTF-8-capable ANSI terminals and the Windows console are supported, so it should be perfectly playable on any modern machine. Note, though, that some of the poorer-quality terminal emulators that you’ll find in your Linux distribution’s repositories (rxvt and its derivatives) are not Unicode-capable, which means they won’t work with G-COM.

I didn’t make use of ncurses, instead opting to write my own terminal graphics engine. That’s because I wanted a single, small binary that was easy to build, and I didn’t want to mess around with PDCurses. I’ve also been studying the Win32 API lately, so writing my own terminal platform layer would rather easy to do anyway.

I experimented with a number of terminal emulators — LXTerminal, Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator — but the least capable “terminal” by far is the Windows console, so it was the one to dictate the capabilities of the graphics engine. Some ANSI terminals are capable of 256 colors, bold, underline, and strikethrough fonts, but a highly portable API is basically limited to 16 colors (RGBCMYKW with two levels of intensity) for each of the foreground and background, and no other special text properties.

ANSI terminals also have a concept of a default foreground color and a default background color. Most applications that output color (git, grep, ls) leave the background color alone and are careful to choose neutral foreground colors. G-COM always sets the background color, so that the game looks the same no matter what the default colors are. Also, the Windows console doesn’t really have default colors anyway, even if I wanted to use them.

I put in partial support for Unicode because I wanted to use interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported Unicode for a long time now, but since they added it too early, they’re locked into the outdated UTF-16. For me this wasn’t too bad, because few computers, Linux included, are equipped to render characters outside of the Basic Multilingual Plane anyway, so there’s no need to deal with surrogate pairs. This is especially true for the Windows console, which can only render a very small set of characters: another limit on my graphics engine. Internally individual codepoints are handled as uint16_t and strings are handled as UTF-8.

I said partial support because, in addition to the above, it has no support for combining characters, or any other situation where a codepoint takes up something other than one space in the terminal. This requires lookup tables and dealing with pitfalls, but since I get to control exactly which characters were going to be used I didn’t need any of that.

In spite of the limitations, I’m really happy with the graphical results. The waves are animated continuously, even while the game is paused, and it looks great. Here’s GNOME Terminal’s rendering, which I think looked the best by default.

I’ll talk about how G-COM actually communicates with the terminal in another article. The interface between the game and the graphics engine is really clean (device.h), so it would be an interesting project to write a back end that renders the game to a regular window, no terminal needed.

Color Directive

I came up with a format directive to help me colorize everything. It runs in addition to the standard printf directives. Here’s an example,

panel_printf(&panel, 1, 1, "Really save and quit? (Rk{y}/Rk{n})");

The color is specified by two characters, and the text it applies to is wrapped in curly brackets. There are eight colors to pick from: RGBCMYKW. That covers all the binary values for red, green, and blue. To specify an “intense” (bright) color, capitalize it. That means the Rk{...} above makes the wrapped text bright red.

Nested directives are also supported. (And, yes, that K means “high intense black,” a.k.a. dark gray. A w means “low intensity white,” a.k.a. light gray.)

panel_printf(p, x, y++, "Kk{♦}    wk{Rk{B}uild}     Kk{♦}");

And it mixes with the normal printf directives:

panel_printf(p, 1, y++, "(Rk{m}) Yk{Mine} [%s]", cost);

Single Binary

The GNU linker has a really nice feature for linking arbitrary binary data into your application. I used this to embed my assets into a single binary so that the user doesn’t need to worry about any sort of data directory or anything like that. Here’s what the make rule would look like:

$(LD) -r -b binary -o $@ $^

The -r specifies that output should be relocatable — i.e. it can be fed back into the linker later when linking the final binary. The -b binary says that the input is just an opaque binary file (“plain” text included). The linker will create three symbols for each input file:

_binary_filename_start
_binary_filename_end
_binary_filename_size

When then you can access from your C program like so:

extern const char _binary_filename_txt_start[];

I used this to embed the story texts, and I’ve used it in the past to embed images and textures. If you were to link zlib, you could easily compress these assets, too. I’m surprised this sort of thing isn’t done more often!

Dumb Game Saves

To save time, and because it doesn’t really matter, saves are just memory dumps. I took another page from Handmade Hero and allocate everything in a single, contiguous block of memory. With one exception, there are no pointers, so the entire block is relocatable. When references are needed, it’s done via integers into the embedded arrays. This allows it to be cleanly reloaded in another process later. As a side effect, it also means there are no dynamic allocations (malloc()) while the game is running. Here’s roughly what it looks like.

typedef struct game {
    uint64_t map_seed;
    map_t *map;
    long time;
    float wood, gold, food;
    long population;
    float goblin_spawn_rate;
    invader_t invaders[16];
    squad_t squads[16];
    hero_t heroes[128];
    game_event_t events[16];
} game_t;

The map pointer is that one exception, but that’s because it’s generated fresh after loading from the map_seed. Saving and loading is trivial (error checking omitted) and very fast.

void
game_save(game_t *game, FILE *out)
{
    fwrite(game, sizeof(*game), 1, out);
}

game_t *
game_load(FILE *in)
{
    game_t *game = malloc(sizeof(*game));
    fread(game, sizeof(*game), 1, in);
    game->map = map_generate(game->map_seed);
    return game;
}

The data isn’t important enough to bother with rename+fsync durability. I’ll risk the data if it makes savescumming that much harder!

The downside to this technique is that saves are generally not portable across architectures (particularly where endianness differs), and may not even portable between different platforms on the same architecture. I only needed to persist a single game state on the same machine, so this wouldn’t be a problem.

Final Results

I’m definitely going to be reusing some of this code in future projects. The G-COM terminal graphics layer is nifty, and I already like it better than ncurses, whose API I’ve always thought was kind of ugly and old-fashioned. I like writing terminal applications.

Just like the last couple of years, the final game is a lot simpler than I had planned at the beginning of the week. Most things take longer to code than I initially expect. I’m still enjoying playing it, which is a really good sign. When I play, I’m having enough fun to deliberately delay the end of the game so that I can sprawl my nation out over the island and generate crazy income.

Articles tagged win32 at null program

Frankenwine: Multiple personas in a Wine process

Application to u-config

Closures as Win32 window procedures

Allocating executable memory

Trampoline compiler

Better cases

Meet the new xxd for w64devkit: rexxd

The case for replacement

Platform layer

Other notes

Windows dynamic linking depends on the active code page

Slim Reader/Writer Locks are neato

Giving C++ std::regex a C makeover

Interface design

C++ implementation

Reasons against

An improved chkstk function on Windows

A gradually committed stack

Implementing chkstk

32-bit chkstk

Optimization in practice

“The first thing we do, let’s kill all the lawyers”.

How to use libchkstk

How to link identical function names from different DLLs

A malloc by any other name would allocate as well

Everything you never wanted to know about Win32 environment blocks

Hand-written Windows API prototypes: fast, flexible, and tedious

Precompiled headers

Artisan, handcrafted prototypes

CRT-free in 2023: tips and tricks

Entry point

Compilation

Stack probes

Built-in functions… ugh

Stack alignment on 32-bit x86

Putting it all together

SDL2 common mistakes and how to avoid them

Mistake 1: Not using sdl2-config

Mistake 2: Including SDL2/SDL.h

Mistake 3: Not surrendering main

Mistake 4: Using the SDL wiki for API documentation

Mistake 5: Using stdio streams

Mistake 6: Using SDL_RENDERER_ACCELERATED

Mistake 7: Not accounting for vsync

Mistake 8: Using assert.h instead of SDL_assert

SDL wishlist

How to build a WaitGroup from a 32-bit integer

The four elements (of synchronization)

Four elements: Linux

Four elements: Windows

My new debugbreak command

debugbreak on Linux

A missing feature

The wild west of Windows command line parsing

My GetCommandLineW

My CommandLineToArgvW

Building a command line string

Some sanity for C and C++ development on Windows

What exactly is broken?

How to mostly fix Unicode support

Text stream translation

libwinsane

More DLL fun with w64devkit: Go, assembly, and Python

Go: bootstrap and cgo

Vim suggestions

cgo DLLs

NASM assembly DLL

Call the DLLs from Python

How to build and use DLLs on Windows

Mingw-w64

Viewing exported symbols

Mingw-w64 (improved)

MSVC

Reversing directions

Tying it all together

Makefile ideas

A guide to Windows application development using w64devkit

Initial setup

Entering the development environment

Mistake 1: Not using `sdl2-config`

Mistake 2: Including `SDL2/SDL.h`

Mistake 3: Not surrendering `main`

Mistake 6: Using `SDL_RENDERER_ACCELERATED`

Mistake 8: Using `assert.h` instead of `SDL_assert`