nullprogram.com/blog/2023/05/31/
I love fast builds, and for years I’ve been bothered by the build penalty
for translation units including windows.h
. This header has an enormous
number of definitions and declarations and so, for C programs, it tends to
dominate the build time of those translation units. Most programs,
especially systems software, only needs a tiny portion of it. For example,
when compiling u-config with GCC, two thirds of the debug build was
spent processing windows.h
just for 4 types, 16 definitions, and 16
prototypes.
To give a sense of the numbers, here’s empty.c
, which does nothing but
include windows.h
.
With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):
$ gcc -E empty.c | grep -vc '^$'
82041
With w64devkit this takes my system ~450ms to compile with GCC:
$ time gcc -c empty.c
real 0m 0.45s
user 0m 0.00s
sys 0m 0.00s
Compiling an actually empty source file takes ~10ms, so it really is
spending practically all that time processing headers. MSVC is a faster
compiler, and this extends to processing an even larger windows.h
that
crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:
$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real 0m 0.12s
user 0m 0.09s
sys 0m 0.01s
That’s just low enough to be tolerable, but I’d like the situation with
GCC to be better. Defining WIN32_LEAN_AND_MEAN
reduces the number of
included headers, which has a significant effect:
$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real 0m 0.30s
user 0m 0.00s
sys 0m 0.00s
$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real 0m 0.07s
user 0m 0.01s
sys 0m 0.01s
The official solution is precompiled headers. Put all the system header
includes, or similar, into a dedicated header, then compile that
header into a special format. For example, headers.h
:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
Then main.c
includes windows.h
through this header:
#include "headers.h"
int mainCRTStartup(void)
{
return 0;
}
If I ask GCC to compile headers.h
:
It produces headers.h.gch
. When a source includes headers.h
, GCC first
searches for an appropriate .gch
. Not only must the name match, but so
must all the definitions at the moment of inclusion: headers.h
should
always be the first included header, otherwise it may not work. Now when I
compile main.c
:
$ time gcc -c main.c
real 0m 0.04s
user 0m 0.00s
sys 0m 0.00s
Much better! MSVC has a conventional name for this header recognizable to
every Visual Studio user: stdafx.h
. It works a bit differently, and I’ve
never used it myself, but I trust it has similar results.
Precompiled headers requires some extra steps that vary by toolchain. Can
we do better? That depends on your definition of “better!”
Artisan, handcrafted prototypes
As mentioned, systems software tends to need only a few declarations:
open, read, write, stat, etc. What if I wrote these out manually? A bit
tedious, but it doesn’t require special precompiled header handling. It
also creates some new possibilities. To illustrate, a CRT-free
“hello world” program:
#include <windows.h>
int mainCRTStartup(void)
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
char message[] = "Hello, world!\n";
DWORD len;
return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}
This takes my system half a second to compile — quite long to produce just
26 assembly instructions:
$ time cc -nostartfiles -o hello.exe hello.c
real 0m 0.50s
user 0m 0.00s
sys 0m 0.00s
$ ./hello.exe
Hello, world!
The program requires prototypes only for GetStdHandle and WriteFile, a
definition for STD_OUTPUT_HANDLE
, and some typedefs. Starting with the
easy stuff, the definition and types look like this:
#define STD_OUTPUT_HANDLE ((DWORD)-11)
typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;
By the way, here’s a cheat code for quickly finding preprocessor
definitions, faster than looking them up elsewhere:
$ echo '#include <windows.h>' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)
Did you catch the pattern? It’s -10 - fd
, where fd
is the conventional
unix file descriptor number: a kind of mnemonic.
Prototypes are a little trickier, especially if you care about 32-bit. The
Windows API uses the “stdcall” calling convention, which is distinct from
the “cdecl” calling convention on x86, though the same on x64. Of course,
you must already be aware of this merely using the API, as your own
callbacks must usually be stdcall themselves. Further, API functions are
DLL imports and should be declared as such. Putting it together,
here’s GetStdHandle:
__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);
This works with both Mingw-w64 and MSVC. MSVC requires __stdcall
between
the return type and function name, so don’t get clever about it. If you
only care about GCC then you can declare both at once using attributes:
HANDLE GetStdHandle(DWORD)
__attribute__((dllimport,stdcall));
I like to hide all this behind a macro, with a “table” of all my imports
listed just below:
#define W32(r) __declspec(dllimport) r __stdcall
W32(HANDLE) GetStdHandle(DWORD);
W32(BOOL) WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);
In WriteFile you may have noticed I’m taking shortcuts. The “official”
definition uses an ugly pointer typedef, LPCVOID
, instead of pointer
syntax, but I skipped that type definition. I also replaced the last
argument, an OVERLAPPED
pointer, with a generic pointer. I only need to
pass null. I can keep sanding it down to something more ergonomic:
W32(int) WriteFile(void *, void *, int, int *, void *);
That’s how I typically write these prototypes. I dropped the const
because it doesn’t help me. I used signed sizes because I like them better
and it’s what I’m usually holding at the call site. But doesn’t
changing the signedness potentially break compatibility? It makes no
difference to any practical ABI: It’s passed the same way. In general,
signedness is a matter for operators, and only some of them — mainly
comparisons (<
, >
, etc.) and division. It’s a similar story for
pointers starting with the 32-bit era, so I can choose whatever pointer
types are convenient.
In general, I can do anything I want so long as I know my compiler will
produce an appropriate function call. These are not standard functions,
like printf
or memcpy
, which are implemented in part by the compiler
itself, but foreign functions. It’s no different than teaching an
FFI how to make a call. This is also, in essence, how OpenGL and
Vulkan work, with applications defining the API for themselves.
Considering all this, my new hello world:
#define W32(r) __declspec(dllimport) r __stdcall
W32(void *) GetStdHandle(int);
W32(int) WriteFile(void *, void *, int, int *, void *);
int mainCRTStartup(void)
{
void *stdout = GetStdHandle(-10 - 1);
char message[] = "Hello, world!\n";
int len;
return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}
You know, there’s a kind of beauty to a program that requires no external
definitions. It builds quickly and produces a binary bit-for-bit identical
to the original:
$ time cc -nostartfiles -o hello.exe main.c
real 0m 0.04s
user 0m 0.00s
sys 0m 0.00s
$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real 0m 0.03s
user 0m 0.00s
sys 0m 0.00s
I’ve also been using this to patch over API rough edges. For example,
WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus
takes OVERLAPPED. These types are explicitly compatible, and only
defined separately for annoying technical reasons. I must use the same
overlapped object with both APIs at once, meaning I would normally need
ugly pointer casts on my Winsock calls, or vice versa with I/O completion
ports. But because I’m writing all these definitions myself, I can define
a common overlapped structure for both!
Perhaps you’re worried that this would be too fragile. Well, as a legacy
software aficionado, I enjoy building and running my programs on old
platforms. So far these programs still work properly going back
30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag,
it’s always been a bug (now long fixed) in the old operating system, not
in my programs or these prototypes. So, in effect, this technique has
worked well for the past 30 years!
Writing out these definitions is a bit of a chore, but after paying that
price I’ve been quite happy with the results. I will likely continue doing
it in the future, at least for non-graphical applications.