nullprogram.com/blog/2019/03/28/
This article was discussed on Hacker News.
The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly,
and lacking good taste. Microsoft has done a pretty commendable job with
backwards compatibility, but the trade-off is that the API is filled to
the brim with historical cruft. Every hasty, poor design over the
decades is carried forward forever, and, in many cases, even built upon,
which essentially doubles down on past mistakes. POSIX certainly has its
own ugly corners, but those are the exceptions. In the Windows API,
elegance is the exception.
That’s why, when I recently revisited the Fibers API, I was
pleasantly surprised. It’s one of the exceptions — much cleaner than the
optional, deprecated, and now obsolete POSIX equivalent. It’s
not quite an apples-to-apples comparison since the POSIX version is
slightly more powerful, and more complicated as a result. I’ll cover the
difference in this article.
For the last part of this article, I’ll walk through an async/await
framework build on top of fibers. The framework allows coroutines in C
programs to await on arbitrary kernel objects.
Fiber Async/await Demo
Fibers
Windows fibers are really just stackful, symmetric coroutines.
From a different point of view, they’re cooperatively scheduled threads,
which is the source of the analogous name, fibers. They’re symmetric
because all fibers are equal, and no fiber is the “main” fiber. If any
fiber returns from its start routine, the program exits. (Older versions
of Wine will crash when this happens, but it was recently fixed.) It’s
equivalent to the process’ main thread returning from main()
. The
initial fiber is free to create a second fiber, yield to it, then the
second fiber destroys the first.
For now I’m going to focus on the core set of fiber functions. There are
some additional capabilities I’m going to ignore, including support for
fiber local storage. The important functions are just these five:
void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void SwitchToFiber(void *fiber);
bool ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void DeleteFiber(void *fiber);
To emphasize its simplicity, I’ve shown them here with more standard
prototypes than seen in their formal documentation. That documentation
uses the clunky Windows API typedefs still burdened with its 16-bit
heritage — e.g. LPVOID
being a “long pointer” from the segmented memory
of the 8086:
Fibers are represented using opaque, void pointers. Maybe that’s a little
too simple since it’s easy to misuse in C, but I like it. The return
values for CreateFiber()
and ConvertThreadToFiber()
are void pointers
since these both create fibers.
The fiber start routine returns nothing and takes a void “user pointer”.
That’s nearly what I’d expect, except that it would probably make more
sense for a fiber to return int
, which is more in line with
main
/ WinMain
/ mainCRTStartup
/ WinMainCRTStartup
. As I said,
when any fiber returns from its start routine, it’s like returning from
the main function, so it should probably have returned an integer.
A fiber may delete itself, which is the same as exiting the thread.
However, a fiber cannot yield (e.g. SwitchToFiber()
) to itself. That’s
undefined behavior.
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
void
coup(void *king)
{
puts("Long live the king!");
DeleteFiber(king);
ConvertFiberToThread(); /* seize the main thread */
/* ... */
}
int
main(void)
{
void *king = ConvertThreadToFiber(0);
void *pretender = CreateFiber(0, coup, king);
SwitchToFiber(pretender);
abort(); /* unreachable */
}
Only fibers can yield to fibers, but when the program starts up, there
are no fibers. At least one thread must first convert itself into a
fiber using ConvertThreadToFiber()
, which returns the fiber object
that represents itself. It takes one argument analogous to the last
argument of CreateFiber()
, except that there’s no start routine to
accept it. The process is reversed with ConvertFiberToThread()
.
Fibers don’t belong to any particular thread and can be scheduled on any
thread if properly synchronized. Obviously one should never yield to the
same fiber in two different threads at the same time.
Contrast with POSIX
The equivalent POSIX systems was context switching. It’s also stackful
and symmetric, but it has just three important functions:
getcontext(3)
, makecontext(3)
, and
swapcontext
.
int getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int swapcontext(ucontext_t *oucp, const ucontext_t *ucp);
These are roughly equivalent to GetCurrentFiber()
,
CreateFiber()
, and SwitchToFiber()
. There is no need for
ConvertFiberToThread()
since threads can context switch without
preparation. There’s also no DeleteFiber()
because the resources are
managed by the program itself. That’s where POSIX contexts are a little
bit more powerful.
The first argument to CreateFiber()
is the desired stack size, with
zero indicating the default stack size. The stack is allocated and freed
by the operating system. The downside is that the caller doesn’t have a
choice in managing the lifetime of this stack and how it’s allocated. If
you’re frequently creating and destroying coroutines, those stacks are
constantly being allocated and freed.
In makecontext(3)
, the caller allocates and supplies the stack. Freeing
that stack is equivalent to destroying the context. A program that
frequently creates and destroys contexts can maintain a stack pool or
otherwise more efficiently manage their allocation. This makes it more
powerful, but it also makes it a little more complicated. It would be hard
to remember how to do all this without a careful reading of the
documentation:
/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);
/* Destroy a context */
free(ctx.uc_stack.ss_sp);
Note how makecontext(3)
is variadic (...
), passing its arguments on
to the start routine of the context. This seems like it might be better
than a user pointer. Unfortunately it’s not, since those arguments are
strictly limited to integers.
Ultimately I like the fiber API better. The first time I tried it out, I
could guess my way through it without looking closely at the
documentation.
Async / await with fibers
Why was I looking at the Fiber API? I’ve known about coroutines for
years but I didn’t understand how they could be useful. Sure, the
function can yield, but what other coroutine should it yield to? It
wasn’t until I was recently bit by the async/await bug that I
finally saw a “killer feature” that justified their use. Generators come
pretty close, though.
Windows fibers are a coroutine primitive suitable for async/await in C
programs, where it can also be useful. To prove that it’s
possible, I built async/await on top of fibers in 95 lines of code.
The alternatives are to use a third-party coroutine library or to
do it myself with some assembly programming. However, having it
built into the operating system is quite convenient! It’s unfortunate
that it’s limited to Windows. Ironically, though, everything I wrote for
this article, including the async/await demonstration, was originally
written on Linux using Mingw-w64 and tested using Wine. Only
after I was done did I even try it on Windows.
Before diving into how it works, there’s a general concept about the
Windows API that must be understood: All kernel objects can be in
either a signaled or unsignaled state. The API provides functions that
block on a kernel object until it is signaled. The two important ones
are WaitForSingleObject()
and WaitForMultipleObjects()
.
The latter behaves very much like poll(2)
in POSIX.
Usually the signal is tied to some useful event, like a process or
thread exiting, the completion of an I/O operation (i.e. asynchronous
overlapped I/O), a semaphore being incremented, etc. It’s a generic way
to wait for some event. However, instead of blocking the thread,
wouldn’t it be nice to await on the kernel object? In my aio
library for Emacs, the fundamental “wait” object was a promise. For this
API it’s a kernel object handle.
So, the await function will take a kernel object, register it with the
scheduler, then yield to the scheduler. The scheduler — which is a
global variable, so there’s only one scheduler per process — looks like
this:
struct {
void *main_fiber;
HANDLE handles[MAXIMUM_WAIT_OBJECTS];
void *fibers[MAXIMUM_WAIT_OBJECTS];
void *dead_fiber;
int count;
} async_loop;
While fibers are symmetric, coroutines in my async/await implementation
are not. One fiber is the scheduler, main_fiber
, and the other fibers
always yield to it.
There is an array of kernel object handles, handles
, and an array of
fibers
. The elements in these arrays are paired with each other, but
it’s convenient to store them separately, as I’ll show soon. fibers[0]
is waiting on handles[0]
, and so on.
The array is a fixed size, MAXIMUM_WAIT_OBJECTS
(64), because there’s
a hard limit on the number of fibers that can wait at once. This
pathetically small limitation is an unfortunate, hard-coded restriction
of the Windows API. It kills most practical uses of my little library.
Fortunately there’s no limit on the number of handles we might want to
wait on, just the number of co-existing fibers.
When a fiber is about to return from its start routine, it yields one
last time and registers itself on the dead_fiber
member. The scheduler
will delete this fiber as soon as it’s given control. Fibers never
truly return since that would terminate the program.
With this, the await function, async_await()
, is pretty simple. It
registers the handle with the scheduler, then yields to the scheduler
fiber.
void
async_await(HANDLE h)
{
async_loop.handles[async_loop.count] = h;
async_loop.fibers[async_loop.count] = GetCurrentFiber();
async_loop.count++;
SwitchToFiber(async_loop.main_fiber);
}
Caveat: The scheduler destroys this handle with CloseHandle()
after it
signals, so don’t try to reuse it. This made my demonstration simpler,
but it might be better to not do this.
A fiber can exit at any time. Such an exit is inserted implicitly before
a fiber actually returns:
void
async_exit(void)
{
async_loop.dead_fiber = GetCurrentFiber();
SwitchToFiber(async_loop.main_fiber);
}
The start routine given to async_start()
is actually wrapped in the
real start routine. This is how async_exit()
is injected:
struct fiber_wrapper {
void (*func)(void *);
void *arg;
};
static void
fiber_wrapper(void *arg)
{
struct fiber_wrapper *fw = arg;
fw->func(fw->arg);
async_exit();
}
int
async_start(void (*func)(void *), void *arg)
{
if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
return 0;
} else {
struct fiber_wrapper fw = {func, arg};
SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
return 1;
}
}
The library provides a single awaitable function, async_sleep()
. It
creates a “waitable timer” object, starts the countdown, and returns it.
(Notice how SetWaitableTimer()
is a typically-ugly Win32 function with
excessive parameters.)
HANDLE
async_sleep(double seconds)
{
HANDLE promise = CreateWaitableTimer(0, 0, 0);
LARGE_INTEGER t;
t.QuadPart = (long long)(seconds * -10000000.0);
SetWaitableTimer(promise, &t, 0, 0, 0, 0);
return promise;
}
A more realistic example would be overlapped I/O. For example, you’d
open a file (CreateFile()
) in overlapped mode, then when you, say,
read from that file (ReadFile()
) you create an event object
(CreateEvent()
), populate an overlapped I/O structure with the event,
offset, and length, then finally await on the event object. The fiber
will be resumed when the operation is complete.
Side note: Unfortunately overlapped I/O doesn’t work correctly for
files, and many operations can’t be done asynchronously, like
opening files. When it comes to files, you’re better off using
dedicated threads as libuv does instead of overlapped I/O.
You can still await on these operations. You’d just await on the signal
from the thread doing synchronous I/O, not from overlapped I/O.
The most complex part is the scheduler, and it’s really not complex at
all:
void
async_run(void)
{
while (async_loop.count) {
/* Wait for next event */
DWORD nhandles = async_loop.count;
HANDLE *handles = async_loop.handles;
DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);
/* Remove event and fiber from waiting array */
void *fiber = async_loop.fibers[r];
CloseHandle(async_loop.handles[r]);
async_loop.handles[r] = async_loop.handles[nhandles - 1];
async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
async_loop.count--;
/* Run the fiber */
SwitchToFiber(fiber);
/* Destroy the fiber if it exited */
if (async_loop.dead_fiber) {
DeleteFiber(async_loop.dead_fiber);
async_loop.dead_fiber = 0;
}
}
}
This is why the handles are in their own array. The array can be passed
directly to WaitForMultipleObjects()
. The return value indicates which
handle was signaled. The handle is closed, the entry removed from the
scheduler, and then the fiber is resumed.
That WaitForMultipleObjects()
is what limits the number of fibers.
It’s not possible to wait on more than 64 handles at once! This is
hard-coded into the API. How? A return value of 64 is an error code, and
changing this would break the API. Remember what I said about being
locked into bad design decisions of the past?
To be fair, WaitForMultipleObjects()
was a doomed API anyway, just
like select(2)
and poll(2)
in POSIX. It scales very poorly since the
entire array of objects being waited on must be traversed on each call.
That’s terribly inefficient when waiting on large numbers of objects.
This sort of problem is solved by interfaces like kqueue (BSD), epoll
(Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this
particular problem well — awaiting on kernel objects — so I
couldn’t use it.
When the awaiting fiber count is zero and the scheduler has control, all
fibers must have completed and there’s nothing left to do. However, the
caller can schedule more fibers and then restart the scheduler if
desired.
That’s all there is to it. Have a look at demo.c
to see how
the API looks in some trivial examples. On Linux you can see it in
action with make check
. On Windows, you just need to compile
it, then run it like a normal program. If there was a better
function than WaitForMultipleObjects()
in the Windows API, I would
have considered turning this demonstration into a real library.