nullprogram.com/blog/2024/02/05/
If you’ve spent much time developing with Mingw-w64 you’ve likely seen the
symbol ___chkstk_ms
, perhaps in an error message. It’s a little piece of
runtime provided by GCC via libgcc which ensures enough of the stack is
committed for the caller’s stack frame. The “function” uses a custom ABI
and is implemented in assembly. So is the subject of this article, a
slightly improved implementation soon to be included in w64devkit as
libchkstk (-lchkstk
).
The MSVC toolchain has an identical (x64) or similar (x86) function named
__chkstk
. We’ll discuss that as well, and w64devkit will include x86 and
x64 implementations, useful when linking with MSVC object files. The new
x86 __chkstk
in particular is also better than the MSVC definition.
A note on spelling: ___chkstk_ms
is spelled with three underscores, and
__chkstk
is spelled with two. On x86, cdecl
functions are
decorated with a leading underscore, and so may be rendered, e.g. in error
messages, with one fewer underscore. The true name is undecorated, and the
raw symbol name is identical on x86 and x64. Further complicating matters,
libgcc defines a ___chkstk
with three underscores. As far as I can tell,
this spelling arose from confusion regarding name decoration, but nobody’s
noticed for the past 28 years. libgcc’s x64 ___chkstk
is obviously and
badly broken, so I’m sure nobody has ever used it anyway, not even by
accident thanks to the misspelling. I’ll touch on that below.
When referring to a particular instance, I will use a specific spelling.
Otherwise the term “chkstk” refers to the family. If you’d like to skip
ahead to the source for libchkstk: libchkstk.S
.
A gradually committed stack
The header of a Windows executable lists two stack sizes: a reserve size
and an initial commit size. The first is the largest the main thread
stack can grow, and the second is the amount committed when the
program starts. A program gradually commits stack pages as needed up to
the reserve size. Binutils objdump
option -p
lists the sizes. Typical
output for a Mingw-w64 program:
$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve 0000000000200000
SizeOfStackCommit 0000000000001000
The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB
initially committed. With the Binutils linker, ld
, you can set them at
link time using --stack
. Via gcc
, use -Xlinker
. For example, to
reserve an 8MiB stack and commit half of it:
$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...
MSVC link.exe
similarly has /stack
.
The purpose of this mechanism is to avoid paying the commit charge for
unused stack. It made sense 30 years ago when stacks were a potentially
large portion of physical memory. These days it’s a rounding error and
silly we’re still dealing with it. Using the above options you can choose
to commit the entire stack up front, at which point a chkstk helper is no
longer needed (-mno-stack-arg-probe
, /Gs2147483647
). This
requires link-time control of the main module, which isn’t always an
option, like when supplying a DLL for someone else to run.
The program grows the stack by touching the singular guard page
mapped between the committed and uncommitted portions of the stack. This
action triggers a page fault, and the default fault handler commits the
guard page and maps a new guard page just below. In other words, the stack
grows one page at a time, in order.
In most cases nothing special needs to happen. The guard page mechanism is
transparent and in the background. However, if a function stack frame
exceeds the page size then there’s a chance that it might leap over the
guard page, crashing the program. To prevent this, compilers insert a
chkstk call in the function prologue. Before local variable allocation,
chkstk walks down the stack — that is, towards lower addresses — nudging
the guard page with each step. (As a side effect it provides stack clash
protection — the only security aspect of chkstk.) For example:
void callee(char *);
void example(void)
{
char large[1<<20];
callee(large);
}
Compiled with 64-bit gcc -O
:
example:
movl $1048616, %eax
call ___chkstk_ms
subq %rax, %rsp
leaq 32(%rsp), %rcx
call callee
addq $1048616, %rsp
ret
I used GCC, but this is practically identical to the code generated by
MSVC and Clang. Note the call to ___chkstk_ms
in the function prologue
before allocating the stack frame (subq
). Also note that it sets eax
.
As a volatile register, this would normally accomplish nothing because
it’s done just before a function call, but recall that ___chkstk_ms
has
a custom ABI. That’s the argument to chkstk. Further note that it uses
rax
on the return. That’s not the value returned by chkstk, but rather
that x64 chkstk preserves all registers.
Well, maybe. The official documentation says that registers r10 and r11
are volatile, but that information conflicts with Microsoft’s own
implementation. Just in case, I choose a conservative interpretation that
all registers are preserved.
Implementing chkstk
In a high level language, chkstk might look something like so:
// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
volatile char frame[frame_size]; // NOTE: variable-length array
for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
frame[i] = 0; // touch the guard page
}
}
This wouldn’t work for a number of reasons, but if it did, volatile
would serve two purposes. First, forcing the side effect to occur. The
second is more subtle: The loop must happen in exactly this order, from
high to low. Without volatile
, loop iterations would be independent — as
there are no dependencies between iterations — and so a compiler could
reverse the loop direction.
The store can happen anywhere within the guard page, so it’s not necessary
to align frame
to the page. Simply touching at least one byte per page
is enough. This is essentially the definition of libgcc ___chkstk_ms
.
How many iterations occur? In example
above, the stack frame will be
around 1MiB (220). With pages of 4KiB (212) that’s
256 iterations. The loop happens unconditionally, meaning every function
call requires 256 iterations of this loop. Wouldn’t it be better if the
loop ran only as needed, i.e. the first time? MSVC x64 __chkstk
skips
iterations if possible, and the same goes for my new ___chkstk_ms
. Much
like the command line string, the low address of the current
thread’s guard page is accessible through the Thread Information
Block (TIB). A chkstk can cheaply query this address, only looping
during initialization or so. (In contrast to Linux, a thread’s
stack is fundamentally managed by the operating system.)
Taking that into account, an improved algorithm:
- Push registers that will be used
- Compute the low address of the new stack frame (F)
- Retrieve the low address of the committed stack (C)
- Go to 7
- Subtract the page size from C
- Touch memory at C
- If C > F, go to 5
- Pop registers to restore them and return
A little unusual for an unconditional forward jump in pseudo-code, but
this closely matches my assembly. The loop causes page faults, and it’s
the slow, uncommon path. The common, fast path never executes 5–6. I’d
also chose smaller instructions in order to keep the function small and
reduce instruction cache pressure. My x64 implementation as of this
writing:
___chkstk_ms:
push %rax // 1.
push %rcx // 1.
neg %rax // 2. rax = frame low address
add %rsp, %rax // 2. "
mov %gs:(0x10), %rcx // 3. rcx = stack low address
jmp 1f // 4.
0: sub $0x1000, %rcx // 5.
test %eax, (%rcx) // 6. page fault (very slow!)
1: cmp %rax, %rcx // 7.
ja 0b // 7.
pop %rcx // 8.
pop %rax // 8.
ret // 8.
I’ve labeled each instruction with its corresponding pseudo-code. Step 6
is unusual among chkstk implementations: It’s not a store, but a load,
still sufficient to fault the page. That test
instruction is just two
bytes, and unlike other two-byte options, doesn’t write garbage onto the
stack — which would be allowed — nor use an extra register. I searched
through single byte instructions that can page fault, all of which involve
implicit addressing through rdi
or rsi
, but they increment rdi
or
rsi
, and would would require another instruction to correct it.
Because of the return address and two push
operations, the low stack
frame address is technically too low by 24 bytes. That’s fine. If this
exhausts the stack, the program is really cutting it close and the stack
is too small anyway. I could be more precise — which, as we’ll soon see,
is required for x86 __chkstk
— but it would cost an extra instruction
byte.
On x64, ___chkstk_ms
and __chkstk
have identical semantics, so name it
__chkstk
— which I’ve done in libchkstk — and it works with MSVC. The
only practical difference between my chkstk and MSVC __chkstk
is that
mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking
the optimization, is libgcc ___chkstk_ms
, weighing 50 bytes, or in
practice, due to an unfortunate Binutils default of padding sections, 64
bytes.
I’m no assembly guru, and I bet this can be even smaller without hurting
the fast path, but this is the best I could come up with at this time.
Update: Stefan Kanthak, who has extensively explored this
topic, points out that large stack frame requests might overflow
my low frame address calculation at (3), effectively disabling the probe.
Such requests might occur from alloca calls or variable-length arrays
(VLAs) with untrusted sizes. As far as I’m concerned, such programs are
already broken, but it only cost a two-byte instruction to deal with it. I
have not changed this article, but the source in w64devkit has been
updated.
32-bit chkstk
On x86 ___chkstk_ms
has identical semantics to x64. Mine is a copy-paste
of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC
was ahead of the curve on this design.
However, x86 __chkstk
is bonkers. It not only commits the stack, but
also allocates the stack frame. That is, it returns with a different stack
pointer. The return pointer is initially inside the new stack frame, so
chkstk must retrieve it and return by other means. It must also precisely
compute the low frame address.
__chkstk:
push %ecx // 1.
neg %eax // 2.
lea 8(%esp,%eax), %eax // 2.
mov %fs:(0x08), %ecx // 3.
jmp 1f // 4.
0: sub $0x1000, %ecx // 5.
test %eax, (%ecx) // 6. page fault (very slow!)
1: cmp %eax, %ecx // 7.
ja 0b // 7.
pop %ecx // 8.
xchg %eax, %esp // ?. allocate frame
jmp *(%eax) // 8. return
The main differences are:
eax
is treated as volatile, so it is not saved
- The low frame address is precisely computed with
lea
(2)
- The frame is allocated at step (?) by swapping F and the stack pointer
- Post-swap F now points at the return address, so jump through it
MSVC x86 __chkstk
does not query the TIB (3), and so unconditionally
runs the loop. So there’s an advantage to my implementation besides size.
libgcc x86 ___chkstk
has this behavior, and so it’s also a suitable
__chkstk
aside from the misspelling. Strangely, libgcc x64 ___chkstk
also allocates the stack frame, which is never how chkstk was supposed
to work on x64. I can only conclude it’s never been used.
Optimization in practice
Does the skip-the-loop optimization matter in practice? Consider a
function using a large-ish, stack-allocated array, perhaps to process
environment variables or long paths, each of which max out
around 64KiB.
_Bool path_contains(wchar_t *name, wchar *path)
{
wchar_t var[1<<15];
GetEnvironmentVariableW(name, var, countof(var));
// ... search for path in var ...
}
int64_t getfilesize(char *path)
{
wchar_t wide[1<<15];
MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
// ... look up file size via wide path ...
}
void example(void)
{
if (path_contains(L"PATH", L"c:\\windows\\system32")) {
// ...
}
int64_t size = getfilesize("π.txt");
// ...
}
Each call to these functions with such large local arrays is also a call
to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely
detectable in a benchmark. If the function touches the file system, which
is likely when processing paths, then chkstk doesn’t matter at all. My
starting example had a 1MiB array, or 256 chkstk iterations. That starts
to become measurable, though it’s also pushing the limits. At that point
you ought to be using a scratch arena.
So ultimately after writing an improved ___chkstk_ms
I could only
measure a tiny difference in contrived programs, and none in any real
application. Though there’s still one more benefit I haven’t yet
mentioned…
My original motivation for this project wasn’t the optimization — which I
didn’t even discover until after I had started — but licensing. I hate
software licenses, and the tools I’ve written for w64devkit
are dedicated to the public domain. Both source and binaries (as
distributed). I can do so because I don’t link runtime components,
not even libgcc. Not even header files. Every byte of code in those
binaries is my work or the work of my collaborators.
Every once in awhile ___chkstk_ms
rears its ugly head, and I have to
make a decision. Do I re-work my code to avoid it? Do I take the reigns of
the linker and disable stack probes? I haven’t necessarily allocated a
large local array: A bit of luck with function inlining can combine
several smaller stack frames into one that’s just large enough to require
chkstk.
Since libgcc falls under the GCC Runtime Library Exception, if it’s
linked into my program through an “Eligible Compilation Process” — which I
believe includes w64devkit — then the GPL-licensed functions embedded in
my binary are legally siloed and the GPL doesn’t infect the rest of the
program. These bits are still GPL in isolation, and if someone were to
copy them out of the program then they’d be normal GPL code again. In
other words, it’s not a 100% public domain binary if libgcc was linked!
(If some FSF lawyer says I’m wrong, then this is an escape hatch through
which anyone can scrub the GPL from GCC runtime code, and then ignore the
runtime exception entirely.)
MSVC is worse. Hardly anyone follows its license, but fortunately for most
the license is practically unenforced. Its chkstk, which currently resides
in a loose chkstk.obj
, falls into what Microsoft calls “Distributable
Code.” Its license requires “external end users to agree to terms that
protect the Distributable Code.” In other words, if you compile a program
with MSVC, you’re required to have a EULA including the relevant terms
from the Visual Studio license. You’re not legally permitted to distribute
software in the manner of w64devkit — no installer, just a portable zip
distribution — if that software has been built with MSVC. At least not
without special care which nobody does. (Don’t worry, I won’t tell.)
How to use libchkstk
To avoid libgcc entirely you need -nostdlib
. Otherwise it’s implicitly
offered to the linker, and you’d need to manually check if it picked up
code from libgcc. If ld
complains about a missing chkstk, use -lchkstk
to get a definition. If you use -lchkstk
when it’s not needed, nothing
happens, so it’s safe to always include.
I also recently added a libmemory to w64devkit, providing tiny,
public domain definitions of memset
, memcpy
, memmove
, memcmp
, and
strlen
. All compilers fabricate calls to these five functions even if
you don’t call them yourself, which is how they were selected. (Not
because I like them. I really don’t.). If a -nostdlib
build
complains about these, too, then add -lmemory
.
$ gcc -nostdlib ... -lchkstk -lmemory
In MSVC the equivalent option is /nodefaultlib
, after which you may see
missing chkstk errors, and perhaps more. libchkstk.a
is compatible with
MSVC, and link.exe
doesn’t care that the extension is .a
rather than
.lib
, so supply it at link time. Same goes for libmemory.a
if you need
any of those, too.
$ cl ... /link /nodefaultlib libchkstk.a libmemory.a
While I despise licenses, I still take them seriously in the software I
distribute. With libchkstk I have another tool to get it under control.
Big thanks to Felipe Garcia for reviewing and correcting mistakes in this
article before it was published!