nullprogram.com/blog/2023/02/15/
Seven years ago I wrote about “freestanding” Windows executables.
After an additional seven years of practical experience both writing and
distributing such programs, half using a custom-built toolchain,
it’s time to revisit these cabalistic incantations and otherwise scant
details. I’ve tweaked my older article over the years as I’ve learned, but
this is a full replacement and does not assumes you’ve read it. The “why”
has been covered and the focus will be on the “how”. Both the GNU
and MSVC toolchains will be considered.
I no longer call these “freestanding” programs since that term is, at
best, inaccurate. In fact, we will be actively avoiding GCC
features associated with that label. Instead I call these CRT-free
programs, where CRT stands for the C runtime the Windows-oriented term
for libc. This term communicates both intent and scope.
Entry point
You should already know that main
is not the program’s entry point, but
a C application’s entry point. The CRT provides the entry point, where it
initializes the CRT, including parsing command line options, then
calls the application’s main
. The real entry point doesn’t have a name.
It’s just the address of the function to be called by the loader without
arguments.
You might naively assume you could continue using the name main
and tell
the linker to use it as the entry point. You would be wrong. Avoid the
name main
! It has a special meaning in C gets special treatment. Using
it without a conventional CRT will confuse your tools an may cause build
issues.
While you can use almost any other name you like, the conventional names
are mainCRTStartup
(console subsystem) and WinMainCRTStartup
(windows
subsystem). It’s easy to remember: Append CRTStartup
to the name you’d
use in a normal CRT-linking application. I strongly recommend using these
names because it reduces friction. Your tools are already familiar with
them, so you won’t need to do anything special.
int mainCRTStartup(void); // console subsystem
int WinMainCRTStartup(void); // windows subsystem
The MSVC linker documentation says the entry point uses the __stdcall
calling convention. Ignore this and do not use __stdcall
for your
entry point! Since entry points take no arguments, there is no practical
difference from the __cdecl
calling convention, so it does not actually
matter. Rather, the goal is to avoid __stdcall
function decorations.
In particular, the GNU linker --entry
option does not understand them,
nor can it find decorated entry points on its own. If you use __stdcall
,
then the 32-bit GNU linker will silently (!) choose the beginning of your
.text
section as the entry point.
If you’re using C++, then of course you will also need to use extern "C"
so that it’s not name-mangled. Otherwise the results are similarly bad.
If using -fwhole-program
, you will need to mark your entry point as
externally visible for GCC so that it knows its an entry point. While
linkers are familiar with conventional entry point names, GCC the
compiler is not. Normally you do not need to worry about this.
__attribute((externally_visible)) // for -fwhole-program
int mainCRTStartup(void)
{
return 0;
}
The entry point returns int
. If there are no other threads then the
process will exit with the returned value as its exit status. In practice
this is only useful for console programs. Windows subsystem programs have
threads started automatically, without warning, and it’s almost certain
your main thread is not the last thread. You probably want to use
ExitProcess
or even TerminateProcess
instead of returning. The latter
exits more abruptly and can avoid issues with certain subsystems, like
DirectSound, not shutting down gracefully: It doesn’t even let them try.
int WinMainCRTStartup(void)
{
// ...
TerminateProcess(GetCurrentProcess(), 0);
}
Compilation
Starting with the GNU toolchain, you have two ways to get into “CRT-free
mode”: -nostartfiles
and -nostdlib
. The former is more dummy-proof,
and it’s what I use in build documentation. The latter can be a more
complicated, but when it succeeds you get guarantees about the result. I
use it in build scripts I intend to run myself, which I want to fail if
they don’t do exactly what I expect. To illustrate, consider this trivial
program:
#include <windows.h>
int mainCRTStartup(void)
{
ExitProcess(0);
}
This program uses ExitProcess
from kernel32.dll
. Compiling is easy:
$ cc -nostartfiles example.c
The -nostartfiles
prevents it from linking the CRT entry point, but it
still implicitly passes other “standard” linker flags, including libraries
-lmingw32
and -lkernel32
. Programs can use kernel32.dll
functions
without explicitly linking that DLL. But, hey, isn’t -lmingw32
the CRT,
the thing we’re avoiding? It is, but it wasn’t actually linked because the
program didn’t reference it.
$ objdump -p a.exe | grep -Fi .dll
DLL Name: KERNEL32.dll
However, -nostdlib
does not pass any of these libraries, so you need to
do so explicitly.
$ cc -nostdlib example.c -lkernel32
The MSVC toolchain behaves a little like -nostartfiles
, not linking a
CRT unless you need it, semi-automatically. However, you’ll need to list
kernel32.dll
and tell it which subsystem you’re using.
$ cl example.c /link /subsystem:console kernel32.lib
However, MSVC has a handy little feature to list these arguments in the
source file.
#ifdef _MSC_VER
#pragma comment(linker, "/subsystem:console")
#pragma comment(lib, "kernel32.lib")
#endif
This information must go somewhere, and I prefer the source file rather
than a build script. Then anyone can point MSVC at the source without
worrying about options.
I try to make all my Windows programs so simply built.
Stack probes
On Windows, it’s expected that stacks will commit dynamically. That is,
the stack is merely reserved address space, and it’s only committed when
the stack actually grows into it. This made sense 30 years ago as a memory
saving technique, but today it no longer makes sense. However, programs
are still built to use this mechanism.
To function properly, programs must touch each stack page for the first
time in order. Normally that’s not an issue, but if your stack frame
exceeds the page size, there’s a chance it might step over a page. When a
function has a large stack frame, GCC inserts a call to a “stack probe” in
libgcc
that touches its pages in the prologue. It’s not unlike stack
clash protection.
For example, if I have a 4kiB local variable:
int mainCRTStartup(void)
{
char buf[1<<12] = {0};
return 0;
}
When I compile with -nostdlib
:
$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'
It’s trying to link the CRT stack probe. You can disable this behavior
with -mno-stack-arg-probe
.
$ cc -mno-stack-arg-probe -nostdlib example.c
Or you can just link -lgcc
to provide a definition:
$ cc -nostdlib example.c -lgcc
Had you used -nostartfiles
, you wouldn’t have noticed because it passes
-lgcc
automatically. It’s “dummy-proof” because this sort of issue goes
away before it comes up, though for the same reason it’s harder to tell
exactly what went into a program.
If you disable the probe altogether — my preference — you’ve only solved
the linker problem, but the underlying stack commit problem remains and
your program may crash. You can solve that by telling the linker to ask
the loader to commit a larger stack up front rather than grow it at run
time. Say, 2MiB:
$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c
Of course, I wish that this was simply the default behavior because it’s
far more sensible! A much better option is to avoid large stack frames in
the first place. Allocate locals larger than, say, 1KiB in a scratch arena
instead of on the stack.
MSVC doesn’t have libgcc
of course, but it still generates stack probes
both for growing the stack and for security checks. The latter requires
kernel32.dll
, so if I compile the same program with MSVC, I get a bunch
of linker failures:
$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...
Using /Gs1000000000
turns off the stack probes, /GS-
turns off the
checks, /stack
commits a larger stack:
$ cl /GS- /Gs1000000000 example.c /link
/subsystem:console /stack:0x200000,200000
Though, as before, better to avoid large stack frames in the first place.
Built-in functions… ugh
The three major C and C++ compilers — GCC, MSVC, Clang — share a common,
evil weakness: “built-in” functions. No matter what, they each assume
you will supply definitions for standard string functions at link time,
particularly memset
and memcpy
. They do this no matter how many
“seriously now, do not use standard C functions” options you pass. When
you don’t link a CRT, you may need to define them yourself.
With GCC there’s a catch: it will transform your memset
definition —
that is, in a function named memset
— into a call to itself. After
all, it looks an awful lot like memset
! This typically manifests as an
infinite loop. Use -fno-builtin
to prevent GCC from mis-compiling
built-in functions.
Even with -fno-builtin
, both GCC and Clang will continue inserting calls
to built-in functions elsewhere. For example, making an especially large
local variable (and using volatile
to prevent it from being optimized
out):
int mainCRTStartup(void)
{
volatile char buf[1<<14] = {0};
return 0;
}
As of this writing, the latest GCC and Clang will generate a memset
call
despite -fno-builtin
:
$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...
To be absolutely pure, you will need to address this in just about any
non-trivial program. On the other hand, -nostartfiles
will grab a
definition from msvcrt.dll
for you:
$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
DLL Name: msvcrt.dll
To be clear, this is a completely legitimate and pragmatic route! You
get the benefits of both worlds: the CRT is still out of the way, but
there’s also no hassle from misbehaving compilers. If this sounds like a
good deal, then do it! (For on-lookers feeling smug: there is no such
easy, general solution for this problem on Linux.)
When you write your own definitions, I suggest putting each definition in
its own section so that they can be discarded via -Wl,--gc-sections
when
unused:
__attribute((section(".text.memset")))
void *memset(void *d, int c, size_t n)
{
// ...
}
So far, for all three compilers, I’ve only needed to provide definitions
for memset
and memcpy
.
Stack alignment on 32-bit x86
GCC expects a 16-byte aligned stack and generates code accordingly. Such
is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However,
the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal
with it, there will likely be unaligned loads. Some may not be valid (e.g.
SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a
function attribute for this:
__attribute((force_align_arg_pointer))
int mainCRTStartup(void)
{
// ...
}
GCC will now align the stack in this function’s prologue. Adjustment is
only necessary at entry points, as GCC will maintain alignment through its
own frames. This includes all entry points, not just the program entry
point, particularly thread start functions. Rule of thumb for i686 GCC:
If WINAPI
or __stdcall
appears in a definition, the stack likely
requires alignment.
__attribute((force_align_arg_pointer))
DWORD WINAPI mythread(void *arg)
{
// ...
}
It’s harmless to use this attribute on x64. The prologue will just be a
smidge larger. If you’re worried about it, use #ifdef __i686__
to limit
it to 32-bit builds.
Putting it all together
If I’ve written a graphical application with WinMainCRTStartup
, used
large stack frames, marked my entry point as externally visible, plan to
support 32-bit builds, and defined a couple of needed string functions, my
optimal entry point may look something like:
#ifdef __GNUC__
__attribute((externally_visible))
#endif
#ifdef __i686__
__attribute((force_align_arg_pointer))
#endif
int WinMainCRTStartup(void)
{
// ...
}
Then my “optimize all the things” release build may look something like:
$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
-fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32
Or with MSVC:
$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows
Or if I’m taking it easy maybe just:
$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c
Or with MSVC (linker flags in source):