As usual, it helps to begin with a concrete example of the problem. The
following is a conventional .pc
file much like you’d find on your own
system:
prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: Example
Version: 1.0
Description: An example .pc file
Cflags: -I${includedir}
Libs: -L${libdir} -lexample
It begins by defining the library’s installation prefix from which it
derives additional paths, which are finally used in the package fields
that generate build flags (Cflags
, Libs
). If I run u-config against
this configuration:
$ pkg-config --cflags --libs example
-I/usr/include -L/usr/lib -lexample
Typically prefix
is populated by the library’s build system, which knows
where the library is to be installed. In some situations that’s not
possible, and there is no opportunity to set prefix
to a meaningful
path. In that case, pkg-config can automatically override it
(--define-prefix
) with a path relative to the .pc
file, making the
installation relocatable. This works quite well on Windows, where it’s
the default:
$ pkg-config --cflags --libs example
-IC:/Users/me/example/include -LC:/Users/me/example/lib -lexample
This just works… so long as the path does not contain spaces. If so, it
risks splitting into separate fields. The .pc
format supports quoting to
control how such output is escaped. Regions between quotes are escaped in
the output so that they retain their spaces when field split in the shell.
If a .pc
file author is careful, they’d write it with quotes:
Cflags: -I"${includedir}"
Libs: -L"${libdir}" -lexample
The paths are carefully placed within quoted regions so that they come out properly:
$ pkg-config --cflags example
-IC:/Program\ Files/example/include
Almost nobody writes their .pc
files this way! The convention is not
to quote. My original solution was to implicitly wrap prefix
in quotes
on assignment, which fixes the vast majority of .pc
files. That
effectively looks like this in the “virtual” .pc
file:
prefix="C:/Program Files/example"
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
So the important region is quoted, its spaces preserved. However, the
occasional library author actively supporting Windows inevitably runs into
this problem, and their system’s pkg-config implementation does not quote
prefix
. They soon figure out explicit quoting and apply it, which then
undermines u-config’s implicit quoting. The quotes essentially cancel out:
"$includedir" -> ""C:/Program Files/example"/include"
The quoted regions are inverted and nothing happens. Though this is a
small minority, the libraries that do this and the ones you’re likely to
use on Windows are correlated. I was stumped: How to support quoted and
unquoted .pc
files simultaneously?
I recently had the thought: What if somehow u-config tracked which spans
of string were paths. prefix
is initially a path span, and then track it
through macro-expansion and concatenation. Soon after that I realized it’s
even simpler: Encode the spaces in a path as a value other than space,
but also a value that cannot appear in the input. Recall that certain
octets can never appear in UTF-8 text: the 8 values whose highest 5
bits are set. That would be the first octet of 5-octet, or longer, code
point, but those are forbidden.
11111xxx
When paths enter the macro system, special characters are encoded as one of these 8 values. They’re converted back to their original ASCII values during output encoding, escaped. It doesn’t interact with the pkg-config quoting mechanism, so there’s no quote cancellation. Both quoting cases are supported equally.
For example, if space is mapped onto \xff
(255), then:
in: C:/Program Files/foo -> C:/Program\xffFiles/foo
out: C:/Program\xffFiles/foo -> C:/Program\ Files/foo
Which prints the same regardless of ${includedir}
or "${includedir}"
.
Problem solved!
That’s not the only complication. Outputs may deliberately include shell
metacharacters, though typically these are Makefile fragments. For
example, the default value of ${pc_top_builddir}
is $(top_builddir)
,
which make
will later expand. While these characters are special to a
shell, and certainly special to make
, they must not be escaped.
What if a path contains these characters? The pkg-config quoting mechanism
won’t help. It’s only concerned with spaces, and $(...)
prints the same
quoted nor not. As before, u-config must track provenance — whether or not
such characters originated from a path.
If $PKG_CONFIG_TOP_BUILD_DIR
is set, then pc_top_builddir
is set to
this environment variable, useful when the result isn’t processed by
make
. In this case it’s a path, and $(...)
ought to be escaped. Even
without $
it must be quoted, because the parentheses would still invoke
a subshell. But who would put parenthesis in a path? Lo and behold!
C:/Program Files (x86)/example
Again, extending UTF-8 solves this as well: Encode $
, (
, and )
in
paths using three of those forbidden octets, and escape them on the way
out, allowing unencoded instances to go straight through.
in: C:/Program\xffFiles\xff\xfdx86\xfe/example
out: C:/Program\ Files\ \(x86\)/example
This makes pc_top_builddir
straightforward: default to a raw string,
otherwise a path-encoded environment variable (note: s8
is a string
type and upsert
is a hash map):
s8 top_builddir = s8("$(top_builddir)");
if (envvar_set) {
top_builddir = s8pathencode(envvar, perm);
}
*upsert(&global, s8("pc_top_builddir"), perm) = top_builddir;
For a particularly wild case, consider deliberately using a uname -m
command substitution to construct a path, i.e. the path contains the
target machine architecture (i686
, x86_64
, etc.):
Cflags: -I${prefix}/$(uname -m)/include
(Not that condone such nonsense. This is merely a reality of real world
.pc
files.) With prefix
automatically set as above, this will print:
-IC:/Program\ Files\ \(x86\)/example/$(uname -m)/include
Path parentheses are escaped because they came from a path, but command
substitution passes through because it came from the .pc
source. Quite
cool!
___chkstk_ms
, perhaps in an error message. It’s a little piece of
runtime provided by GCC via libgcc which ensures enough of the stack is
committed for the caller’s stack frame. The “function” uses a custom ABI
and is implemented in assembly. So is the subject of this article, a
slightly improved implementation soon to be included in w64devkit as
libchkstk (-lchkstk
).
The MSVC toolchain has an identical (x64) or similar (x86) function named
__chkstk
. We’ll discuss that as well, and w64devkit will include x86 and
x64 implementations, useful when linking with MSVC object files. The new
x86 __chkstk
in particular is also better than the MSVC definition.
A note on spelling: ___chkstk_ms
is spelled with three underscores, and
__chkstk
is spelled with two. On x86, cdecl
functions are
decorated with a leading underscore, and so may be rendered, e.g. in error
messages, with one fewer underscore. The true name is undecorated, and the
raw symbol name is identical on x86 and x64. Further complicating matters,
libgcc defines a ___chkstk
with three underscores. As far as I can tell,
this spelling arose from confusion regarding name decoration, but nobody’s
noticed for the past 28 years. libgcc’s x64 ___chkstk
is obviously and
badly broken, so I’m sure nobody has ever used it anyway, not even by
accident thanks to the misspelling. I’ll touch on that below.
When referring to a particular instance, I will use a specific spelling.
Otherwise the term “chkstk” refers to the family. If you’d like to skip
ahead to the source for libchkstk: libchkstk.S
.
The header of a Windows executable lists two stack sizes: a reserve size
and an initial commit size. The first is the largest the main thread
stack can grow, and the second is the amount committed when the
program starts. A program gradually commits stack pages as needed up to
the reserve size. Binutils objdump
option -p
lists the sizes. Typical
output for a Mingw-w64 program:
$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve 0000000000200000
SizeOfStackCommit 0000000000001000
The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB
initially committed. With the Binutils linker, ld
, you can set them at
link time using --stack
. Via gcc
, use -Xlinker
. For example, to
reserve an 8MiB stack and commit half of it:
$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...
MSVC link.exe
similarly has /stack
.
The purpose of this mechanism is to avoid paying the commit charge for
unused stack. It made sense 30 years ago when stacks were a potentially
large portion of physical memory. These days it’s a rounding error and
silly we’re still dealing with it. Using the above options you can choose
to commit the entire stack up front, at which point a chkstk helper is no
longer needed (-mno-stack-arg-probe
, /Gs2147483647
). This
requires link-time control of the main module, which isn’t always an
option, like when supplying a DLL for someone else to run.
The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.
In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:
void callee(char *);
void example(void)
{
char large[1<<20];
callee(large);
}
Compiled with 64-bit gcc -O
:
example:
movl $1048616, %eax
call ___chkstk_ms
subq %rax, %rsp
leaq 32(%rsp), %rcx
call callee
addq $1048616, %rsp
ret
I used GCC, but this is practically identical to the code generated by
MSVC and Clang. Note the call to ___chkstk_ms
in the function prologue
before allocating the stack frame (subq
). Also note that it sets eax
.
As a volatile register, this would normally accomplish nothing because
it’s done just before a function call, but recall that ___chkstk_ms
has
a custom ABI. That’s the argument to chkstk. Further note that it uses
rax
on the return. That’s not the value returned by chkstk, but rather
that x64 chkstk preserves all registers.
Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.
In a high level language, chkstk might look something like so:
// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
volatile char frame[frame_size]; // NOTE: variable-length array
for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
frame[i] = 0; // touch the guard page
}
}
This wouldn’t work for a number of reasons, but if it did, volatile
would serve two purposes. First, forcing the side effect to occur. The
second is more subtle: The loop must happen in exactly this order, from
high to low. Without volatile
, loop iterations would be independent — as
there are no dependencies between iterations — and so a compiler could
reverse the loop direction.
The store can happen anywhere within the guard page, so it’s not necessary
to align frame
to the page. Simply touching at least one byte per page
is enough. This is essentially the definition of libgcc ___chkstk_ms
.
How many iterations occur? In example
above, the stack frame will be
around 1MiB (220). With pages of 4KiB (212) that’s
256 iterations. The loop happens unconditionally, meaning every function
call requires 256 iterations of this loop. Wouldn’t it be better if the
loop ran only as needed, i.e. the first time? MSVC x64 __chkstk
skips
iterations if possible, and the same goes for my new ___chkstk_ms
. Much
like the command line string, the low address of the current
thread’s guard page is accessible through the Thread Information
Block (TIB). A chkstk can cheaply query this address, only looping
during initialization or so. (In contrast to Linux, a thread’s
stack is fundamentally managed by the operating system.)
Taking that into account, an improved algorithm:
A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:
___chkstk_ms:
push %rax // 1.
push %rcx // 1.
neg %rax // 2. rax = frame low address
add %rsp, %rax // 2. "
mov %gs:(0x10), %rcx // 3. rcx = stack low address
jmp 1f // 4.
0: sub $0x1000, %rcx // 5.
test %eax, (%rcx) // 6. page fault (very slow!)
1: cmp %rax, %rcx // 7.
ja 0b // 7.
pop %rcx // 8.
pop %rax // 8.
ret // 8.
I’ve labeled each instruction with its corresponding pseudo-code. Step 6
is unusual among chkstk implementations: It’s not a store, but a load,
still sufficient to fault the page. That test
instruction is just two
bytes, and unlike other two-byte options, doesn’t write garbage onto the
stack — which would be allowed — nor use an extra register. I searched
through single byte instructions that can page fault, all of which involve
implicit addressing through rdi
or rsi
, but they increment rdi
or
rsi
, and would would require another instruction to correct it.
Because of the return address and two push
operations, the low stack
frame address is technically too low by 24 bytes. That’s fine. If this
exhausts the stack, the program is really cutting it close and the stack
is too small anyway. I could be more precise — which, as we’ll soon see,
is required for x86 __chkstk
— but it would cost an extra instruction
byte.
On x64, ___chkstk_ms
and __chkstk
have identical semantics, so name it
__chkstk
— which I’ve done in libchkstk — and it works with MSVC. The
only practical difference between my chkstk and MSVC __chkstk
is that
mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking
the optimization, is libgcc ___chkstk_ms
, weighing 50 bytes, or in
practice, due to an unfortunate Binutils default of padding sections, 64
bytes.
I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.
Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.
On x86 ___chkstk_ms
has identical semantics to x64. Mine is a copy-paste
of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC
was ahead of the curve on this design.
However, x86 __chkstk
is bonkers. It not only commits the stack, but
also allocates the stack frame. That is, it returns with a different stack
pointer. The return pointer is initially inside the new stack frame, so
chkstk must retrieve it and return by other means. It must also precisely
compute the low frame address.
__chkstk:
push %ecx // 1.
neg %eax // 2.
lea 8(%esp,%eax), %eax // 2.
mov %fs:(0x08), %ecx // 3.
jmp 1f // 4.
0: sub $0x1000, %ecx // 5.
test %eax, (%ecx) // 6. page fault (very slow!)
1: cmp %eax, %ecx // 7.
ja 0b // 7.
pop %ecx // 8.
xchg %eax, %esp // ?. allocate frame
jmp *(%eax) // 8. return
The main differences are:
eax
is treated as volatile, so it is not savedlea
(2)MSVC x86 __chkstk
does not query the TIB (3), and so unconditionally
runs the loop. So there’s an advantage to my implementation besides size.
libgcc x86 ___chkstk
has this behavior, and so it’s also a suitable
__chkstk
aside from the misspelling. Strangely, libgcc x64 ___chkstk
also allocates the stack frame, which is never how chkstk was supposed
to work on x64. I can only conclude it’s never been used.
Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.
_Bool path_contains(wchar_t *name, wchar *path)
{
wchar_t var[1<<15];
GetEnvironmentVariableW(name, var, countof(var));
// ... search for path in var ...
}
int64_t getfilesize(char *path)
{
wchar_t wide[1<<15];
MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
// ... look up file size via wide path ...
}
void example(void)
{
if (path_contains(L"PATH", L"c:\\windows\\system32")) {
// ...
}
int64_t size = getfilesize("π.txt");
// ...
}
Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.
So ultimately after writing an improved ___chkstk_ms
I could only
measure a tiny difference in contrived programs, and none in any real
application. Though there’s still one more benefit I haven’t yet
mentioned…
My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.
Every once in awhile ___chkstk_ms
rears its ugly head, and I have to
make a decision. Do I re-work my code to avoid it? Do I take the reigns of
the linker and disable stack probes? I haven’t necessarily allocated a
large local array: A bit of luck with function inlining can combine
several smaller stack frames into one that’s just large enough to require
chkstk.
Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!
(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)
MSVC is worse. Hardly anyone follows its license, but fortunately for most
the license is practically unenforced. Its chkstk, which currently resides
in a loose chkstk.obj
, falls into what Microsoft calls “Distributable
Code.” Its license requires “external end users to agree to terms that
protect the Distributable Code.” In other words, if you compile a program
with MSVC, you’re required to have a EULA including the relevant terms
from the Visual Studio license. You’re not legally permitted to distribute
software in the manner of w64devkit — no installer, just a portable zip
distribution — if that software has been built with MSVC. At least not
without special care which nobody does. (Don’t worry, I won’t tell.)
To avoid libgcc entirely you need -nostdlib
. Otherwise it’s implicitly
offered to the linker, and you’d need to manually check if it picked up
code from libgcc. If ld
complains about a missing chkstk, use -lchkstk
to get a definition. If you use -lchkstk
when it’s not needed, nothing
happens, so it’s safe to always include.
I also recently added a libmemory to w64devkit, providing tiny,
public domain definitions of memset
, memcpy
, memmove
, memcmp
, and
strlen
. All compilers fabricate calls to these five functions even if
you don’t call them yourself, which is how they were selected. (Not
because I like them. I really don’t.). If a -nostdlib
build
complains about these, too, then add -lmemory
.
$ gcc -nostdlib ... -lchkstk -lmemory
In MSVC the equivalent option is /nodefaultlib
, after which you may see
missing chkstk errors, and perhaps more. libchkstk.a
is compatible with
MSVC, and link.exe
doesn’t care that the extension is .a
rather than
.lib
, so supply it at link time. Same goes for libmemory.a
if you need
any of those, too.
$ cl ... /link /nodefaultlib libchkstk.a libmemory.a
While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.
Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!
]]>The assert
macro in typical C implementations leaves a lot to be
desired, as does raise
and abort
, so I’ve suggested
alternative definitions that behave better under debuggers:
#define assert(c) while (!(c)) __builtin_trap()
#define assert(c) while (!(c)) __builtin_unreachable()
#define assert(c) while (!(c)) *(volatile int *)0 = 0
Each serves a slightly different purpose but still has the most important property: Immediately halt the program directly on the defect. None have an occasionally useful secondary property: Optionally allow the program to continue through the defect. If the program reaches the body of any of these macros then there is no reliable continuation. Even manually nudging the instruction pointer over the assertion isn’t enough. Compilers assume that the program cannot continue through the condition and generate code accordingly.
The MSVC ecosystem has a solution for this on x86: int3
. The portable
name is __debugbreak
, a name I’ve borrowed elsewhere.
#define assert(c) do if (!(c)) __debugbreak(); while (0)
On x86 it inserts an int3
instruction, which fires an interrupt,
trapping in the attached debugger, or otherwise abnormally terminating the
program. Because it’s an interrupt, it’s expected that the program might
continue. It even leaves the instruction pointer on the next instruction.
As of this writing, GCC has no matching intrinsic, but Clang recently
added __builtin_debugtrap
. In GCC you need some less portable inline
assembly: asm ("int3")
.
However, regardless of how you get an int3
in your program, GDB does not
currently understand it. The problem is that feature I mentioned: The
instruction pointer does not point at the int3
but the next instruction.
This confuses GDB, causing it to break in the wrong places, possibly even
in the wrong scope. For example:
for (int i = 0; i < n; i++) {
// ...
int3_assert(...);
}
With int3
at the very end of the loop, GDB will break at the top of
the next loop iteration, because that’s where the instruction pointer
lands by the time GDB is involved. It’s a similar story when placed at the
end of a function, leaving GDB to break in the caller. To resolve this, we
need the instruction pointer to still be “inside” the breakpoint after the
interrupt fires. Easy! Add a nop
:
#define breakpoint() asm ("int3; nop")
This behaves beautifully, eliminating all the problems GDB has with a
plain int3
. Not only is this a solid basis for a continuable assertion,
it’s also useful as a fast conditional breakpoint, where conventional
conditional breakpoints are far too slow.
for (int i = 0; i < 1000000000; i++) {
if (/* rare condition */) breakpoint();
// ...
}
Could GDB handle int3
better? Yes! Visual Studio, for instance, does not
require the nop
instruction. As far as I know there is no ARM equivalent
compatible with GDB (or even LLDB). The closest instruction, brk #0x1
,
does not behave as needed.
GDB’s built-in user interface understands three classes of breakpoint positions: symbols, context-free line numbers, and absolute addresses. When you set some breakpoints and (re)start a program under GDB, each kind of breakpoint is handled differently:
Resolve each symbol, placing a breakpoint on its run-time address.
Map each file+lineno tuple to a run-time address, and place a breakpoint on that address. If the line does not exist (i.e. the file is shorter), skip it.
Place breakpoints exactly on each absolute address. If it’s not a mapped address, don’t start the program.
The first is the best case because it adapts to program changes. Modify the code, recompile, and the breakpoint generally remains where you want it.
The third is the least useful. These breakpoints rarely survive across rebuilds, and sometimes not even across reruns.
The second is in the middle between useful and useless. If you edit the source file which has the breakpoint — likely, because you placed the breakpoint there for a reason — chances are high that the line number is no longer correct. Instead it drifts, requiring manual replacement. This is tedious and GDB ought to do better. Think that’s unreasonable? The Visual Studio debugger does exactly that quite effectively through external code edits! GDB front ends tend to handle it better, especially when they’re also the code editor and so directly observe all edits.
As a workaround we can get the first kind by temporarily naming a line number. This requires editing the source, but remember, the very reason we need it is because the source in question is actively changing. How to name a line? C and C++ labels give a name to program position:
void example(double *nums, int n, ...)
{
for (int i = 0; i < n; i++) {
loop: // named position at the start of the loop
// ...
}
}
The name loop
is local to example
, but the qualified example:loop
is
a global name, as suitable as any other symbol. I could, say, reliably
trace the progress of this loop despite changes to its position in the
source.
(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]
One downside is dealing with -Wunused-label
(enabled by -Wall
), and so
I’ve considered disabling the warning in my defaults. Update:
Matthew Fernandez pointed out that the unused
label attribute eliminates
the warning, solving my problem:
for (int i = 0; i < n; i++) {
loop: __attribute((unused))
// ...
}
More often I use an assembly label, usually named b
for convenience:
for (int i = 0; i < n; i++) {
asm ("b:");
// ...
}
Like int3
, sometimes it’s necessary to give it a nop
so that GDB has
something on which to break. “Enabling” it at any time is quick:
(gdb) b b
Because it’s not .globl
, it’s a weak symbol, and I can place up to
one per translation unit, all covered by the same GDB breakpoint item
(less useful than it sounds). I haven’t actually checked, but I probably
more often use dprintf
with such named lines than actual breakpoints.
If you have similar tips and tricks of your own, I’d like to learn about them!
]]>Users of mature C libraries conventionally get to choose how memory is allocated — that is, when it cannot be avoided entirely. The C standard never laid down a convention — perhaps for the better — so each library re-invents an allocator interface. Not all are created equal, and most repeat a few fundamental mistakes. Often the interface is merely a token effort, to check off that it’s “supported” without actual consideration to its use. This article describes the critical features of a practical allocator interface, and demonstrates why they’re important.
Before diving into the details, here’s the checklist for library authors:
The standard library allocator keeps its state in global variables. This makes for a simple interface, but comes with significant performance and complexity costs. These costs likely motivate custom allocator use in the first place, in which case slavishly duplicating the standard interface is essentially the worst possible option. Unfortunately this is typical:
#define LIB_MALLOC malloc
#define LIB_FREE free
I could observe the library’s allocations, and I could swap in a library functionality equivalent to the standard library allocator — jemalloc, mimalloc, etc. — but that’s about it. Better than nothing, I suppose, but only just so. Function pointer callbacks are slightly better:
typedef struct {
void *(*malloc)(size_t);
void (*free)(void *);
} allocator;
session *session_new(..., allocator);
At least I could use different allocators at different times, and there are even tricks to bind a context pointer to the callback. It also works when the library is dynamically linked.
Either case barely qualifies as custom allocator support, and they’re useless when it matters most. Only a small ingredient is needed to make these interfaces useful: a context pointer.
// NOTE: Better, but still not great
typedef struct {
void *(*malloc)(size_t, void *ctx);
void (*free)(void *, void *ctx);
void *ctx;
} allocator;
Users can choose from where the library will allocate at at given time. It liberates the allocator from global variables (or janky workarounds), and multithreading woes. The default can still hook up to the standard library through stubs that fit these interfaces.
static void *lib_malloc(size_t size, void *ctx)
{
(void)ctx;
return malloc(size);
}
static void *lib_free(void *ptr, void *ctx)
{
(void)ctx;
free(ptr);
}
static allocator lib_allocator = {lib_malloc, lib_free, 0};
Note that the context pointer came after the “standard” arguments. All things being equal, “extra” arguments should go after standard ones. But don’t sweat it! In the most common calling conventions this allows stub implementations to be merely an unconditional jump. It’s as though the stubs are a kind of subtype of the original functions.
lib_malloc:
jmp malloc
lib_free:
jmp free
Typically the decision is completely arbitrary, and so this minutia tips the balance.
So what’s the big deal? It means we can trivially plug in, say, a tiny
arena allocator. To demonstrate, consider this fictional string
set and partial JSON API, each of which supports a custom allocator. For
simplicity — I’m attempting to balance substance and brevity — they share
an allocator interface. (Note: Because subscripts and sizes should be
signed, and we’re now breaking away from the standard library
allocator, I will use ptrdiff_t
for the rest of the examples.)
typedef struct {
void *(*malloc)(ptrdiff_t, void *ctx);
void (*free)(void *, void *ctx);
void *ctx;
} allocator;
typedef struct set set;
set *set_new(allocator *);
set *set_free(set *);
bool set_add(set *, char *);
typedef struct json json;
json *json_load(char *buf, ptrdiff_t len, allocator *);
json *json_free(json *);
ptrdiff_t json_length(json *);
json *json_subscript(json *, ptrdiff_t i);
json *json_getfield(json *, char *field);
double json_getnumber(json *);
char *json_getstring(json *);
set
and json
objects retain a copy of the allocator
object for all
allocations made through that object. Given nothing, they default to the
standard library using the pass-through definitions above. Used together
with the standard library allocator:
typedef struct {
double sum;
bool ok;
} sum_result;
sum_result sum_unique(char *json, ptrdiff_t len)
{
sum_result r = {0};
json *namevals = json_load(json, len, 0);
if (!namevals) {
return r; // parse error
}
ptrdiff_t arraylen = json_length(namevals);
if (arraylen < 0) {
json_free(namevals);
return r; // not an array
}
set *seen = set_new(0);
for (ptrdiff_t i = 0; i < arraylen; i++) {
json *element = json_subscript(namevals, i);
char *name = json_getfield(element, "name");
char *value = json_getfield(element, "value");
if (!name || !value) {
set_free(set);
json_free(namevals);
return r; // invalid element
} else if (set_add(set, name)) {
r.sum += json_getnumber(value);
}
}
set_free(set);
json_free(namevals);
r.ok = 1;
return r;
}
Which given as JSON input:
[
{"name": "foo", "value": 123},
{"name": "bar", "value": 456},
{"name": "foo", "value": 1000}
]
Would return 579.0
. Because it’s using standard library allocation, it
must carefully clean up before returning. There’s also no out-of-memory
handling because, in practice, programs typically do not get to observe
and respond to the standard allocator running out of memory.
We can improve and simplify it with an arena allocator:
typedef struct {
char *beg;
char *end;
jmp_buf *oom;
} arena;
void *arena_malloc(ptrdiff_t size, void *ctx)
{
arena *a = ctx;
ptrdiff_t available = a->end - a->beg;
ptrdiff_t alignment = -size & 15;
if (size > available-alignment) {
longjmp(*a->oom);
}
return a->end -= size + alignment;
}
void arena_free(void *ptr, void *ctx)
{
// nothing to do (yet!)
}
I’m allocating from the end rather than the beginning because it will make a later change simpler. Applying that to the function:
sum_result sum_unique(char *json, ptrdiff_t len, arena scratch)
{
sum_result r = {0};
allocator a = {0};
a.malloc = arena_malloc;
a.free = arena_free;
a.ctx = &scratch;
json *namevals = json_load(json, len, &a);
if (!namevals) {
return r; // parse error
}
ptrdiff_t arraylen = json_length(namevals);
if (arraylen < 0) {
return r; // not an array
}
set *seen = set_new(&a);
for (ptrdiff_t i = 0; i < arraylen; i++) {
json *element = json_subscript(namevals, i);
char *name = json_getfield(element, "name");
char *value = json_getfield(element, "value");
if (!name || !value) {
return r; // invalid element
} else if (set_add(set, name)) {
r.sum += json_getnumber(value);
}
}
r.ok = 1;
return r;
}
Calls to set_free
and json_free
are no longer necessary because the
arena automatically frees these on any return, in O(1). I almost feel bad
the library authors bothered to write them! It also handles allocation
failure without introducing it to sum_unique
. We may even deliberately
restrict the memory available to this function — perhaps because the input
is untrusted, and we want to quickly abort denial-of-service attacks — by
giving it a small arena, relying on out-of-memory to reject pathological
inputs.
There are so many possibilities unlocked by the context pointer.
When an application frees an object it always has the original, requested allocation size on hand. After all, it’s a necessary condition to use the object correctly. In the simplest case it’s the size of the freed object’s type: a static quantity. If it’s an array, then it’s a multiple of the tracked capacity: a dynamic quantity. In any case the size is either known statically or tracked dynamically by the application.
Yet free()
does not accept a size, meaning that the allocator must track
the information redundantly! That’s a needless burden on custom
allocators, and with a bit of care a library can lift it.
This was noticed in C++, and WG21 added sized deallocation in
C++14. It’s now the default on two of the three major implementations (and
probably not the two you’d guess). In other words, object size is so
readily available that it can mostly be automated away. Notable exception:
operator new[]
and operator delete[]
with trivial destructors. With
non-trivial destructors, operator new[]
must track the array length for
its its own purposes on top of libc bookkeeping. In other words, array
allocations have their size stored in at least three different places!
That means the “free” interface should look like this:
void *lib_free(void *ptr, ptrdiff_t len, void *ctx);
And calls inside the library might look like:
lib_free(p, sizeof(*p), ctx);
lib_free(a, sizeof(*a)*len, ctx);
Now that arena_free
has size information, it can free an allocation if
it was the most recent:
void arena_free(void *ptr, ptrdiff_t size, void *ctx)
{
arena *a = ctx;
if (ptr == a->end) {
ptrdiff_t alignment = -size & 15;
a->end += size + alignment;
}
}
If the library allocates short-lived objects to compute some value, then discards in reverse order, the memory can be reused. The arena doesn’t have to do anything special. The library merely needs to share its knowledge with the allocator.
Beyond arena allocation, an allocator could use the size to locate the allocation’s size class and, say, push it onto a freelist of its size class. Size-class freelists compose well with arenas, and an implementation is short and simple when the caller of “free” communicates object size.
Another idea: During testing, use a debug allocator that tracks object size and validates the reported size against its own bookkeeping. This can help catch mistakes sooner.
Resizing an allocation requires a lot from an allocator, and it should be avoided if possible. At the very least it cannot be done at all without knowing the original allocation size. An allocator can’t simply no-op it like it can with “free.” With the standard library interface, allocators have no choice but to redundantly track object sizes when “realloc” is required.
So, just as with “free,” the allocator should be given the old object size!
void *lib_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx);
At the very least, an allocator could implement “realloc” with “malloc”
and memcpy
:
void arena_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx)
{
assert(new > old);
void *r = arena_malloc(new, ctx);
return memcpy(r, ptr, old);
}
Of the three checklist items, this is the most neglected. Exercise for the
reader: The last-allocated object can be resized in place, instead using
memmove
. If this is frequently expected, allocate from the front, adjust
arena_free
as needed, and extend the allocation in place as discussed a
previous addendum, without any copying.
Let’s examine real world examples to see how well they fit the checklist. First up is uthash, a popular, easy-to-use, intrusive hash table:
#define uthash_malloc(sz) my_malloc(sz)
#define uthash_free(ptr, sz) my_free(ptr)
No “realloc” so it trivially checks (3). It optionally provides the old size to “free” which checks (2). However it misses (1) which is the most important, greatly limiting its usefulness.
Next is the venerable zlib. It has function pointers with these
prototypes on its z_stream
object.
void *zlib_malloc(void *ctx, unsigned items, unsigned size);
void zlib_free(void *ctx, void *ptr);
The context pointer checks (1), and I can confirm from experience that it’s genuinely useful with a custom allocator. No “realloc” so it passes (3) automatically. It misses (2), but in practice this hardly matters: It allocates everything up front, and frees at the very end, meaning a no-op “free” is quite sufficient.
Finally there’s the Lua programming language with this economical, single-function interface:
void *lua_Alloc(void *ctx, void *ptr, size_t old, size_t new);
It packs all three allocator functions into one function. It includes a context pointer (1), a free size (2), and two realloc sizes (3). It’s a simple allocator’s best friend!
]]>This has been a ground-breaking year for my C skills, and paradigm shifts in my technique has provoked me to reconsider my habits and coding style. It’s been my largest personal style change in years, so I’ve decided to take a snapshot of its current state and my reasoning. These changes have produced significant productive and organizational benefits, so while most is certainly subjective, it likely includes a few objective improvements. I’m not saying everyone should write C this way, and when I contribute code to a project I follow their local style. This is about what works well for me.
Starting with the fundamentals, I’ve been using short names for primitive
types. The resulting clarity was more than I had expected, and it’s made
my code more enjoyable to review. These names appear frequently throughout
a program, so conciseness pays. Also, now that I’ve gone without, _t
suffixes are more visually distracting than I had realized.
typedef uint8_t u8;
typedef char16_t c16;
typedef int32_t b32;
typedef int32_t i32;
typedef uint32_t u32;
typedef uint64_t u64;
typedef float f32;
typedef double f64;
typedef uintptr_t uptr;
typedef char byte;
typedef ptrdiff_t size;
typedef size_t usize;
Some people prefer an s
prefix for signed types. I prefer i
, plus as
you’ll see, I have other designs for s
. For sizes, isize
would be more
consistent, and wouldn’t hog the identifier, but signed sizes are the
way and so I want them in a place of privilege. usize
is niche,
mainly for interacting with external interfaces where it might matter.
b32
is a “32-bit boolean” and communicates intent. I could use _Bool
,
but I’d rather stick to a natural word size and stay away from its weird
semantics. To beginners it might seem like “wasting memory” by using a
32-bit boolean, but in practice that’s never the case. It’s either in a
register (return value, local variable) or would be padded anyway (struct
field). When it actually matters, I pack booleans into a flags
variable,
and a 1-byte boolean rarely important.
While UTF-16 might seem niche, it’s a necessary evil when dealing with
Win32, so c16
(“16-bit character”) has made a frequent appearance. I
could have based it on uint16_t
, but putting the name char16_t
in its
“type hierarchy” communicates to debuggers, particularly GDB, that for
display purposes these variables hold character data. Officially Win32
uses a type named wchar_t
, but I like being explicit about UTF-16.
u8
is for octets, usually UTF-8 data. It’s distinct from byte
, which
represents raw memory and is a special aliasing type. In theory these
can be distinct types with differing semantics, though I’m not aware of
any implementation that does so (yet?). For now it’s about intent.
What about systems that don’t support fixed width types? That’s academic,
and far too much time has been wasted worrying about it. That includes
time wasted on typing out int_fast32_t
and similar nonsense. Virtually
no existing software would actually work correctly on such systems — I’m
certain nobody’s testing it after all — so it seems nobody else cares
either.
I don’t intend to use these names in isolation, such as in code snippets
(outside of this article). If I did, examples would require the typedefs
to give readers the complete context. That’s not worth extra explanation.
Even in the most recent articles I’ve used ptrdiff_t
instead of size
.
Next, some “standard” macros:
#define countof(a) (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s) (countof(s) - 1)
#define new(a, t, n) (t *)alloc(a, sizeof(t), _Alignof(t), n)
While I still prefer ALL_CAPS
for constants, I’ve adopted lowercase for
function-like macros because it’s nicer to read. They don’t have the same
namespace problems as other macro definitions: I can have a macro named
new()
and also variables and fields named new
because they don’t look
like function calls.
For GCC and Clang, my favorite assert
macro now looks like this:
#define assert(c) while (!(c)) __builtin_unreachable()
It has useful properties beyond the usual benefits:
It does not require separate definitions for debug and release builds. Instead it’s controlled by the presence of Undefined Behavior Sanitizer (UBSan), which is already present/absent in these circumstances. That includes fuzz testing.
libubsan
provides a diagnostic printout with a file and line number.
In release builds it turns into a practical optimization hint.
To enable assertions in release builds, put UBSan in trap mode with
-fsanitize-trap
and then enable at least -fsanitize=unreachable
. In
theory this can also be done with -funreachable-traps
, but as of this
writing it’s been broken for the past few GCC releases.
No const
. It serves no practical role in optimization, and I cannot
recall an instance where it caught, or would have caught, a mistake. I
held out for awhile as prototype documentation, but on reflection I found
that good parameter names were sufficient. Dropping const
has made me
noticeably more productive by reducing cognitive load and eliminating
visual clutter. I now believe its inclusion in C was a costly mistake.
(One small exception: I still like it as a hint to place static tables in
read-only memory closer to the code. I’ll cast away the const
if needed.
This is only of minor importance.)
Literal 0
for null pointers. Short and sweet. This is not new, but a
style I’ve used for about 7 years now, and has appeared all over my
writing since. There are some theoretical edge cases where it may cause
defects, and lots of ink has been spilled on the subject, but
after a couple 100K lines of code I’ve yet to see it happen.
restrict
when necessary, but better to organize code so that it’s not,
e.g. don’t write to “out” parameters in loops, or don’t use out parameters
at all (more on that momentarily). I don’t bother with inline
because I
compile everything as one translation unit anyway.
typedef
all structures. I used to shy away from it, but eliminating the
struct
keyword makes code easier to read. If it’s a recursive structure,
use a forward declaration immediately above so that such fields can use
the short name:
typedef struct map map;
struct map {
map *child[4];
// ...
};
Declare all functions static
except for entry points. Again, with
everything compiled as a single translation unit there’s no reason to do
otherwise. It was probably a mistake for C not to default to static
,
though I don’t have a strong opinion on the matter. With the clutter
eliminated through short types, no const
, no struct
, etc. functions
fit comfortably on the same line as their return type. I used to break
them apart so that the function name began on its own line, but that’s no
longer necessary.
In my writing I sometimes omit static
to simplify, and because outside
the context of a complete program it’s mostly irrelevant. However, I will
use it below to emphasize this style.
For awhile I capitalized type names as that effectively put them in a kind of namespace apart from variables and functions, but I eventually stopped. I may try this idea in different way in the future.
One of my most productive changes this year has been the total rejection of null terminated strings — another of those terrible mistakes — and the embrace of this basic string type:
#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
u8 *data;
size len;
} s8;
I’ve used a few names for it, but this is my favorite. The s
is for
string, and the 8
is for UTF-8 or u8
. The s8
macro (sometimes just
spelled S
) wraps a C string literal, making a s8
string out of it. A
s8
is handled like a fat pointer, passed and returned by copy.
s8
makes for a great function prefix, unlike str
, all of which are
reserved. Some examples:
static s8 s8span(u8 *, u8 *);
static b32 s8equals(s8, s8);
static size s8compare(s8, s8);
static u64 s8hash(s8);
static s8 s8trim(s8);
static s8 s8clone(s8, arena *);
Then when combined with the macro:
if (s8equals(tagname, s8("body"))) {
// ...
}
You might be tempted to use a flexible array member to pack the size and array together as one allocation. Tried it. Its inflexibility is totally not worth whatever benefits it might have. Consider, for instance, how you’d create such a string out of a literal, and how it would be used.
A few times I’ve thought, “This program is simple enough that I don’t need
a string type for this data.” That thought is nearly always wrong. Having
it available helps me think more clearly, and makes for simpler programs.
(C++ got it only a few years ago with std::string_view
and std::span
.)
It has a natural UTF-16 counterpart, s16
:
#define s16(s) (s16){u##s, lengthof(u##s)}
typedef struct {
c16 *data;
size len;
} s16;
I’m not entirely sold on gluing u
to the literal in the macro, versus
writing it out on the string literal.
Another change has been preferring structure returns instead of out parameters. It’s effectively a multiple value return, though without destructuring. A great organizational change. For example, this function returns two values, a parse result and a status:
typedef struct {
i32 value;
b32 ok;
} i32parsed;
static i32parsed i32parse(s8);
Worried about the “extra copying?” Have no fear, because in practice
calling conventions turn this into a hidden, restrict
-qualified out
parameter — if it’s not inlined such that any return value overhead would
be irrelevant anyway. With this return style I’m less tempted to use
in-band signals like special null returns to indicate errors, which is
less clear.
It’s also led to a style of defining a zero-initialized return value at
the top of the function, i.e. ok
is false, and then use it for all
return
statements. On error, it can bail out with an immediate return.
The success path sets ok
to true before the return.
static i32parsed i32parse(s8 s)
{
i32parsed r = {0};
for (size i = 0; i < s.len; i++) {
u8 digit = s.data[i] - '0';
// ...
if (overflow) {
return r;
}
r.value = r.value*10 + digit;
}
r.ok = 1;
return r;
}
Aside from static data, I’ve also moved away from initializers except the
conventional zero initializer. (Notable exception: s8
and s16
macros.)
This includes designated initializers. Instead I’ve been initializing with
assignments. For example, this buffered output “constructor”:
typedef struct {
u8 *buf;
i32 len;
i32 cap;
i32 fd;
b32 err;
} u8buf;
static u8buf newu8buf(arena *perm, i32 cap, i32 fd)
{
u8buf r = {0};
r.buf = new(perm, u8, cap);
r.cap = cap;
r.fd = fd;
return r;
}
I like how this reads, but it also eliminates a cognitive burden: The assignments are separated by sequence points, giving them an explicit order. It doesn’t matter here, but in other cases it does:
example e = {
.name = randname(&rng),
.age = randage(&rng),
.seat = randseat(&rng),
};
There are 6 possible values for e
from the same seed. I like no longer
thinking about these possibilities.
Prefer __attribute
to __attribute__
. The __
suffix is excessive and
unnecessary.
__attribute((malloc, alloc_size(2, 4)))
For Win32 systems programming, which typically only requires a modest
number of declarations and definitions, rather than include windows.h
,
write the prototypes out by hand using custom types. It reduces
build times, declutters namespaces, and interfaces more cleanly with the
program (no more DWORD
/BOOL
/ULONG_PTR
, but u32
/b32
/uptr
).
#define W32(r) __declspec(dllimport) r __stdcall
W32(void) ExitProcess(u32);
W32(i32) GetStdHandle(u32);
W32(byte *) VirtualAlloc(byte *, usize, u32, u32);
W32(b32) WriteConsoleA(uptr, u8 *, u32, u32 *, void *);
W32(b32) WriteConsoleW(uptr, c16 *, u32, u32 *, void *);
For inline assembly, treat the outer parentheses like braces, put a space
before the opening parenthesis, just like if
, and start each constraint
line with its colon.
static u64 rdtscp(void)
{
u32 hi, lo;
asm volatile (
"rdtscp"
: "=d"(hi), "=a"(lo)
:
: "cx", "memory"
);
return (u64)hi<<32 | lo;
}
There’s surely a lot more to my style than this, but unlike the above,
those details haven’t changed this year. To see most of the mentioned
items in action in a small program, see wordhist.c
, one of my
testing grounds for hash-tries, or for a slightly larger program,
asmint.c
, a mini programming language implementation.
Unlike a hash map or linked list, a dynamic array — a data buffer with a
size that varies during run time — is more difficult to square with arena
allocation. They’re contiguous by definition, and we cannot resize objects
in the middle of an arena, i.e. realloc
. So while convenient, they come
with trade-offs. At least until they stop growing, dynamic arrays are more
appropriate for shorter-lived, temporary contexts, where you would use a
scratch arena. On average they consume about twice the memory of a fixed
array of the same size.
As before, I begin with a motivating example of its use. The guts of the
generic dynamic array implementation are tucked away in a push()
macro,
which is essentially the entire interface.
typedef struct {
int32_t *data;
ptrdiff_t len;
ptrdiff_t cap;
} int32s;
int32s fibonacci(int32_t max, arena *perm)
{
static int32_t init[] = {0, 1};
int32s fib = {0};
fib.data = init;
fib.len = fib.cap = countof(init);
for (;;) {
int32_t a = fib.data[fib.len-2];
int32_t b = fib.data[fib.len-1];
if (a+b > max) {
return fib;
}
*push(&fib, perm) = a + b;
}
}
Anyone familiar with Go will quickly notice a pattern: int32s
looks an
awful lot like a Go slice. That was indeed my inspiration, and
there is enough context that you could infer similar semantics. I
will even call these “slice headers.” Initially I tried a design based on
stretchy buffers, but I didn’t like the macros nor the ergonomics.
I wouldn’t write a fibonacci
this way in practice, but it’s useful for
highlighting certain features. Of particular note:
The dynamic array initially wraps a static array, yet I can append to it as though it were a dynamic allocation. If I don’t append at all, it still works. (Though of course the caller then shouldn’t modify the elements.)
push()
operates on any object which is slice-shaped. That is it has
a pointer field named data
, a ptrdiff_t
length field named len
, a
ptrdiff_t
capacity field named cap
, and all in that order.
push()
evaluates to a pointer to the newly-pushed element. In my
example I immediately dereference and assign a value.
An element is zero-initialized the first time it’s pushed. I say “first
time” because you can truncate an array by reducing len
, and “pushing”
afterward will simply reveal the original elements.
The name int32s
is intended to evoke plurality. I’ll use this
convention again in a moment.
The arena passed to push()
is only used if the array needs to grow.
The new backing array will be allocated out of this arena regardless of
the original backing array.
Resizes always change the backing array address, and the old array remains valid. This is also just like slices in Go.
Despite the name perm
, I expect it points to the caller’s scratch
arena. It’s “permanent” only relative to the fibonacci
call. Otherwise
I might build the array in a scratch arena, then create a final copy in
a permanent arena.
For a slightly more realistic example: rendering triangles. Suppose we need data in array format for OpenGL, but we don’t know the number of vertices ahead of time. A dynamic array is convenient, especially if we discard the array as soon as OpenGL is done with it. We could build up entire scenes like this for each display frame.
typedef struct {
GLfloat x, y, z;
} GLvert;
typedef struct {
GLvert *data;
ptrdiff_t len;
ptrdiff_t cap;
} GLverts;
void renderobj(char *buf, ptrdiff_t len, arena scratch)
{
GLverts vs = {0};
objparser parser = newobjparser(buf, len);
for (...) {
*push(&vs, &scratch) = nextvert(&parser);
}
glVertexPointer(3, GL_FLOAT, 0, vs.data);
glDrawArrays(GL_TRIANGLES, 0, vs.len);
}
As before, GLverts
is slice-shaped. This time it’s zero-initialized,
which is a valid empty dynamic array. As with maps, that means any object
with such a field comes with a ready-to-use empty dynamic array. Putting
it together, here’s an example that gradually appends vertices to named
dynamic arrays, randomly accessed by string name:
typedef struct {
map *child[4];
str name;
GLverts verts;
} map;
verts *upsert(map **, str, arena *); // from the last article
map *example(..., arena *perm)
{
map *m = 0;
for (...) {
str name = ...;
vert v = ...;
verts *vs = upsert(&m, name, perm);
*push(vs, perm) = v;
}
return m;
}
That’s what Go would call map[str][]vert
, but allocated entirely out of
an arena. Ever thought C could do this so simply and conveniently? The
memory allocator (~15 lines), map (~30 lines), dynamic array (~30 lines),
constructors (0 lines), and destructors (0 lines) that power this total to
~75 lines of zero-dependency code!
I despise macro abuse, and programs substantially implemented in macros are annoying. They’re difficult to understand and debug. A good dynamic array implementation will require a macro, and one of my goals was to keep it as simple and minimal as possible. The macro’s job is to:
sizeof
) to that function.Here’s what I came up with:
#define push(s, arena) \
((s)->len >= (s)->cap \
? grow(s, sizeof(*(s)->data), arena), \
(s)->data + (s)->len++ \
: (s)->data + (s)->len++)
The macro will be used as an expression, so it cannot use statements like
if
. The condition is therefore a ternary operator. If it’s full, it
calls the supporting grow
function. In either case, it computes the
result from data
. In particular, note that the grow
branch uses a
comma operator to sequence growth before pointer derivation, as grow
will change the value of data
as a side effect.
To be generic, the grow
function uses memcpy
-based type punning:
static void grow(void *slice, ptrdiff_t size, arena *a)
{
struct {
void *data;
ptrdiff_t len;
ptrdiff_t cap;
} replica;
memcpy(&replica, slice, sizeof(replica));
replica.cap = replica.cap ? replica.cap : 1;
ptrdiff_t align = 16;
void *data = alloc(a, 2*size, align, replica.cap);
replica.cap *= 2;
if (replica.len) {
memcpy(data, replica.data, size*replica.len);
}
replica.data = data;
memcpy(slice, &replica, sizeof(replica));
}
The slice header is copied over a local replica, avoiding conflicts with strict aliasing. This is the archetype slice header. It still requires that different pointers have identical memory representation. That’s virtually always true, and certainly true anywhere I’d use an arena.
If the capacity was zero, it behaves as though it was one, and so, through
doubling, zero-capacity arrays become capacity-2 arrays on the first push.
It’s better to let alloc
— whose definition, you may recall, included an
overflow check — handle size overflow so that it can invoke the out of
memory policy, so instead of doubling cap
, which would first require an
overflow check, it doubles the object size. This is a small constant
(i.e. from sizeof
), so doubling it is always safe.
Copying over old data includes a special check for zero-length inputs,
because, quite frustratingly, memcpy
does not accept null even
when the length is zero. I check for zero length instead of null so that
it’s more sensitive to defects. If the pointer is null with a non-zero
length, it will trip Undefined Behavior Sanitizer, or at least crash the
program, rather than silently skip copying.
Finally the updated replica is copied over the original slice header,
updating it with the new data
pointer and capacity. The original backing
array is untouched but is no longer referenced through this slice header.
Old slice headers will continue to function with the old backing array,
such as when the arena is reset to a point where the dynamic array was
smaller.
int32s vals = {0};
*push(&vals, &scratch) = 1; // resize: cap=2
*push(&vals, &scratch) = 2;
*push(&vals, &scratch) = 3; // resize: cap=4
{
arena tmp = scratch; // scoped arena
int32s extended = vals;
*push(&extended, &tmp) = 4;
*push(&extended, &tmp) = 5; // resize: cap=8
example(extended);
}
// vals still works, cap=4, extension freed
In practice, a dynamic array comes from old backing arrays whose total size adds up just shy of the current array capacity. For example, if the current capacity is 16, old arrays are size 2+4+8 = 14.
If you’re worried about misuse, such as slice header fields being in the
wrong order, a couple of assertions can quickly catch such mistakes at run
time, typically under the lightest of testing. In fact, I planned for this
by using the more-sensitive len>=cap
instead of just len==cap
, so that
it would direct execution towards assertions in grow
:
assert(replica.len >= 0);
assert(replica.cap >= 0);
assert(replica.len <= replica.cap);
This also demonstrates another benefit of signed sizes: Exactly half the range is invalid and so defects tend to quickly trip these assertions.
Alignment is unfortunately fixed, and I picked a “safe” value of 16. In my
new()
macro I used _Alignof
to pass type information to alloc
. Due
to an oversight, unlike sizeof
, _Alignof
cannot be applied
to expressions, and so it cannot be used in dynamic arrays. GCC and Clang
support _Alignof
on expressions just like sizeof
, as it’s such an
obvious idea, but Microsoft chose to strictly follow the oversight in the
standard. To support MSVC, I’ve deliberately limited the capabilities of
push
. If that doesn’t matter, fixing it is easy:
--- a/example.c
+++ b/example.c
@@ -2,3 +2,3 @@
((s)->len >= (s)->cap \
- ? grow(s, sizeof(*(s)->data), arena), \
+ ? grow(s, sizeof(*(s)->data), _Alignof(*(s)->data), arena), \
(s)->data + (s)->len++ \
@@ -6,3 +6,3 @@
-static void grow(void *slice, ptrdiff_t size, arena *a)
+static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
@@ -16,3 +16,2 @@
replica.cap = replica.cap ? replica.cap : 1;
- ptrdiff_t align = 16;
void *data = alloc(a, 2*size, align, replica.cap);
Though while you’re at it, if you’re already using extensions you might
want to switch push
to a statement expression so that the slice
header s
does not get evaluated more than once — i.e. so that upsert()
in my example above could be used inside the push()
expession.
#define push(s, a) ({ \
typeof(s) s_ = (s); \
typeof(a) a_ = (a); \
if (s_->len >= s_->cap) { \
grow(s_, sizeof(*s_->data), _Alignof(*s_->data), a_); \
} \
s_->data + s_->len++; \
})
So far this approach to dynamic arrays has been useful on a number of occasions, and I’m quite happy with the results. As with arena-friendly hash maps, I’ve no doubt they’ll become a staple in my C programs.
Dennis Schön suggests a check if the array ends at the next arena
allocation and, if so, extend the array into the arena in place. grow()
already has the necessary information on hand, so it needs only the
additional check:
static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
struct {
char *data;
ptrdiff_t len;
ptrdiff_t cap;
} replica;
memcpy(&replica, slice, sizeof(replica));
if (!replica.data) {
replica.cap = 1;
replica.data = alloc(a, 2*size, align, replica.cap);
} else if (a->beg == replica.data + size*replica.cap) {
alloc(a, size, 1, replica.cap);
} else {
void *data = alloc(a, 2*size, align, replica.cap);
memcpy(data, replica.data, size*replica.len);
replica.data = data;
}
replica.cap *= 2;
memcpy(slice, &replica, sizeof(replica));
}
Because that’s yet another check for null, I’ve split it out into an independent third case:
Not quite as simple, but it improves the most common case.
]]>I’ve written before about MSI hash tables, a simple, very fast map that can be quickly implemented from scratch as needed, tailored to the problem at hand. The trade off is that one must know the upper bound a priori in order to size the base array. Scaling up requires resizing the array — an impedance mismatch with arena allocation. Search trees scale better, as there’s no underlying array, but tree balancing tends to be finicky and complex, unsuitable to rapid, on-demand implementation. We want the ease of an MSI hash table with the scaling of a tree.
I’ll motivate the discussion with example usage. Suppose we have an array of pointer+length strings, as defined last time:
typedef struct {
uint8_t *data;
ptrdiff_t len;
} str;
And we need a function that removes duplicates in place, but (for the
moment) we’re not worried about preserving order. This could be done
naively in quadratic time. Smarter is to sort, then look for runs.
Instead, I’ve used a hash map to track seen strings. It maps str
to
bool
, and it is represented as type strmap
and one insert+lookup
function, upsert
.
// Insert/get bool value for given str key.
bool *upsert(strmap **, str key, arena *);
ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
ptrdiff_t count = 0;
strmap *seen = 0;
while (count < len) {
bool *b = upsert(&seen, strings[count], &scratch);
if (*b) {
// previously seen (discard)
strings[count] = strings[--len];
} else {
// newly-seen (keep)
count++;
*b = 1;
}
}
return count;
}
In particular, note:
A null pointer is an empty hash map and initialization is trivial. As discussed in the last article, one of my arena allocation principles is default zero-initializion. Put together, that means any data structure containing a map comes with a ready-to-use, empty map.
The map is allocated out of the scratch arena so it’s automatically freed upon any return. It’s as care-free as garbage collection.
The map directly uses strings in the input array as keys, without making copies nor worrying about ownership. Arenas own objects, not references. If I wanted to carve out some fixed keys ahead of time, I could even insert static strings.
upsert
returns a pointer to a value. That is, a pointer into the map.
This is not strictly required, but usually makes for a simple interface.
When an entry is new, this value will be false (zero-initialized).
So, what is this wonderful data structure? Here’s the basic shape:
typedef struct {
hashmap *child[4];
keytype key;
valtype value;
} hashmap;
They child
and key
fields are essential to the map. Adding a child
to any data structure turns it into a hash map over whatever field you
choose as the key. In other words, a hash-trie can serve as an intrusive
hash map. In several programs I’ve combined intrusive lists and hash maps
to create an insert-ordered hash map. Going the other direction, omitting
value
turns it into a hash set. (Which is what unique
really needs!)
As you probably guessed, this hash-trie is a 4-ary tree. It can easily be
2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but
4-ary strikes a good balance, if a bit bulky. In the example above,
keytype
would be str
and valtype
would be bool
. The most general
form of upsert
looks like this:
valtype *upsert(hashmap **m, keytype key, arena *perm)
{
for (uint64_t h = hash(key); *m; h <<= 2) {
if (equals(key, (*m)->key)) {
return &(*m)->value;
}
m = &(*m)->child[h>>62];
}
if (!perm) {
return 0;
}
*m = new(perm, hashmap);
(*m)->key = key;
return &(*m)->value;
}
This will take some unpacking. The first argument is a pointer to a
pointer. That’s the destination for any newly-allocated element. As it
travels down the tree, this points into the parent’s child
array. If
it points to null, then it’s an empty tree which, by definition, does not
contain the key.
We need two “methods” for keys: hash
and equals
. The hash function
should return a uniformly distributed integer. As is usually the case,
less uniform fast hashes generally do better than highly-uniform slow
hashes. For hash maps under ~100K elements a 32-bit hash is fine, but
larger maps should use a 64-bit hash state and result. Hash collisions
revert to linear, linked list performance and, per the birthday paradox,
that will happen often with 32-bit hashes on large hash maps.
If you’re worried about pathological inputs, add a seed parameter to
upsert
and hash
. Or maybe even use the address m
as a seed. The
specifics depend on your security model. It’s not an issue for most hash
maps, so I don’t demonstrate it here.
The top two bits of the hash are used to select a branch. These tend to be higher quality for multiplicative hash functions. At each level two bits are shifted out. This is what gives it its name: a trie of the hash bits. Though it’s un-trie-like in the way it deposits elements at the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a time.
I initially tried a Multiplicative Congruential Generator (MCG) to select the next branch at each trie level, instead of bit shifting, but NRK noticed it was consistently slower than shifting.
While “delete” could be handled using gravestones, many deletes would not work well. After all, the underlying allocator is an arena. A combination of uniformly distributed branching and no deletion means that rebalancing is unnecessary. This is what grants it its simplicity!
If no arena is provided, it reverts to a lookup and returns null when the
key is not found. It allows one function to flexibly serve both modes. In
unique
, pure lookups are unneeded, so this condition could be skipped in
its strmap
.
Sometimes it’s useful to return the entire hashmap
object itself rather
than an internal pointer, particularly when it’s intrusive. Use whichever
works best for the situation. Regardless, exploit zero-initialization to
detect newly-allocated elements when possible.
In some cases we may deep copy the key in its arena before inserting it
into the map. The provided key may be a temporary (e.g. sprintf
) which
the map outlives, and the caller doesn’t want to allocate a longer-lived
key unless it’s needed. It’s all part of tailoring the map to the problem,
which we can do because it’s so short and simple!
Putting it all together, unique
could look like the following, with
strmap
/upsert
renamed to strset
/ismember
:
uint64_t hash(str s)
{
uint64_t h = 0x100;
for (ptrdiff_t i = 0; i < s.len; i++) {
h ^= s.data[i];
h *= 1111111111111111111u;
}
return h;
}
bool equals(str a, str b)
{
return a.len==b.len && !memcmp(a.data, b.data, a.len);
}
typedef struct {
strset *child[4];
str key;
} strset;
bool ismember(strset **m, str key, arena *perm)
{
for (uint64_t h = hash(key); *m; h <<= 2) {
if (equals(key, (*m)->key)) {
return 1;
}
m = &(*m)->child[h>>62];
}
*m = new(perm, strset);
(*m)->key = key;
return 0;
}
ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
ptrdiff_t count = 0;
for (strset *seen = 0; count < len;) {
if (ismember(&seen, strings[count], &scratch)) {
strings[count] = strings[--len];
} else {
count++;
}
}
return count;
}
The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with
an xorshift finalizer because the bits are used most-significant first.
Exercise for the reader: Support retaining the original input order using
an intrusive linked list on strset
.
As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes these hash-tries a bit heavier than average. It’s not an issue for smaller hash maps, but has practical consequences for huge hash maps.
In attempt to address this, I experimented with relative pointers
(example: markov.c
). That is, instead of pointers I use signed
integers whose value indicates an offset relative to itself. Because
relative pointers can only refer to nearby memory, a custom allocator is
imperative, and arenas fit the bill perfectly. Range can be extended by
exploiting memory alignment. In particular, 32-bit relative pointers can
reference up to 8GiB in either direction. Zero is reserved to represent a
null pointer, and relative pointers cannot refer to themselves.
As a bonus, data structures built out of relative pointers are position independent. A collection of them — perhaps even a whole arena — can be dumped out to, say, a file, loaded back at a different position, then continue to operate as-is. Very cool stuff.
Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are guaranteed to work. No object is ever too far away. It’s a compounding effect, too. Smaller map nodes means a larger number of them are in reach of a relative pointer. Also very cool.
However, as far as I know, no generally available programming language implementation supports this concept well enough to put into practice. You could implement relative pointers with language extension facilities, such as C++ operator overloads, but no tools will understand them — a major bummer. You can no longer use a debugger to examine such structures, and it’s just not worth that cost. If only arena allocation was more popular…
For the finale, let’s convert upsert
into a concurrent, lock-free hash
map. That is, multiple threads can call upsert concurrently on the same
map. Each must still have its own arena, probably per-thread arenas, and
so no implicit locking for allocation.
The structure itself requires no changes! Instead we need two atomic
operations: atomic load (acquire), and atomic compare-and-exchange
(acquire/release). They operate only on child
array elements and the
tree root. To illustrate I will use GCC atomics, also supported by
Clang.
valtype *upsert(map **m, keytype key, arena *perm)
{
for (uint64_t h = hash(key);; h <<= 2) {
map *n = __atomic_load_n(m, __ATOMIC_ACQUIRE);
if (!n) {
if (!perm) {
return 0;
}
arena rollback = *perm;
map *new = new(perm, map, 1);
new->key = key;
int pass = __ATOMIC_RELEASE;
int fail = __ATOMIC_ACQUIRE;
if (__atomic_compare_exchange_n(m, &n, new, 0, pass, fail)) {
return &new->value;
}
*perm = rollback;
}
if (equals(n->key, key)) {
return &n->value;
}
m = n->child + (h>>62);
}
}
First an atomic load retrieves the current node. If there is no such node, then attempt to insert one using atomic compare-and-exchange. The ABA problem is not an issue thanks again to lack of deletion: Once set, a pointer never changes. Before allocating a node, take a snapshot of the arena so that the allocation can be reverted on failure. If another thread got there first, continue tumbling down the tree as though a null was never observed.
On compare-and-swap failure, it turns into an acquire load, just as it began. On success, it’s a release store, synchronizing with acquire loads on other threads.
The key
field does not require atomics because it’s synchronized by the
compare-and-swap. That is, the assignment will happen before the node is
inserted, and keys do not change after insertion. The same goes for any
zeroing done by the arena.
Loads and stores through the returned pointer are the caller’s
responsibility. These likely require further synchronization. If
valtype
is a shared counter then an atomic increment is sufficient. In
other cases, upsert
should probably be modified to accept an initial
value to be assigned alongside the key so that the entire key/value pair
inserted atomically. Alternatively, break it into two steps. The
details depend on the needs of the program.
On small trees there will much contention near the root of the tree during inserts. Fortunately, a contentious tree will not stay small for long! The hash function will spread threads around a large tree, generally keeping them off each other’s toes.
A complete demo you can try yourself: concurrent-hash-trie.c
.
It returns a value pointer like above, and store/load is synchronized by
the thread join. Each thread is given a per-thread subarena allocated out
of the main arena, and the final tree is built from these subarenas.
For a practical example: a multithreaded rainbow table to find hash function collisions. Threads are synchronized solely through atomics in the shared hash-trie.
A complete fast, concurrent, lock-free hash map in under 30 lines of C sounds like a sweet deal to me!
]]>Over the past year I’ve refined my approach to arena allocation. With practice, it’s effective, simple, and fast; typically as easy to use as garbage collection but without the costs. Depending on need, an allocator can weigh just 7–25 lines of code — perfect when lacking a runtime. With the core details of my own technique settled, now is a good time to document and share lessons learned. This is certainly not the only way to approach arena allocation, but these are practices I’ve worked out to simplify programs and reduce mistakes.
An arena is a memory buffer and an offset into that buffer, initially zero. To allocate an object, grab a pointer at the offset, advance the offset by the size of the object, and return the pointer. There’s a little more to it, such as ensuring alignment and availability. We’ll get to that. Objects are not freed individually. Instead, groups of allocations are freed at once by restoring the offset to an earlier value. Without individual lifetimes, you don’t need to write destructors, nor do your programs need to walk data structures at run time to take them apart. You also no longer need to worry about memory leaks.
A minority of programs inherently require general purpose allocation, at least in part, that linear allocation cannot fulfill. This includes, for example, most programming language runtimes. If you like arenas, avoid accidentally create such a situation through an over-flexible API that allows callers to assume you have general purpose allocation underneath.
To get warmed up, here’s my style of arena allocation in action that shows off multiple features:
typedef struct {
uint8_t *data;
ptrdiff_t len;
} str;
typedef struct {
strlist *next;
str item;
} strlist;
typedef struct {
str head;
str tail;
} strpair;
// Defined elsewhere
void towidechar(wchar_t *, ptrdiff_t, str);
str loadfile(wchar_t *, arena *);
strpair cut(str, uint8_t);
strlist *getlines(str path, arena *perm, arena scratch)
{
int max_path = 1<<15;
wchar_t *wpath = new(&scratch, wchar_t, max_path);
towidechar(wpath, max_path, path);
strpair pair = {0};
pair.tail = loadfile(wpath, perm);
strlist *head = 0;
strlist **tail = &head;
while (pair.tail.len) {
pair = cut(pair.tail, '\n');
*tail = new(perm, strlist, 1);
(*tail)->item = pair.head;
tail = &(*tail)->next;
}
return head;
}
Take note of these details, each to be later discussed in detail:
getlines
takes two arenas, “permanent” and “scratch”. The former is
for objects that will be returned to the caller. The latter is for
temporary objects whose lifetime ends when the function returns. They
have stack lifetimes just like local variables.
Objects are not explicitly freed. Instead, all allocations from a scratch arena are implicitly freed upon return. This would include error return paths automatically.
The scratch arena is passed by copy — i.e. a copy of the “header” not the memory region itself. Allocating only changes the local copy, and so cannot survive the return. The semantics are obvious to callers, so they’re less likely to get mixed up.
While wpath
could be an automatic local variable, it’s relatively
large for the stack, so it’s allocated out of the scratch arena. A
scratch arena safely permits large, dynamic allocations that would never
be safe on the stack. In other words, a sane alloca
!
Same for variable-length arrays (VLAs). A scratch arena means you’ll
never be tempted to use either of these terrible ideas.
The second parameter to new
is a type, so it’s obviously a macro. As
you will see momentarily, this is not some complex macro magic, just a
convenience one-liner. There is no implicit cast, and you will get a
compiler diagnostic if the type is incorrect.
Despite all the allocation, there is not a single sizeof
operator nor
size computation. That’s because size computations are a major source
of defects. That job is handled by specialized code.
Allocation failures are not communicated by a null return. Lifting this burden greatly simplifies programs. Instead such errors are handled non-locally by the arena.
All allocations are zero-initialized by default. This makes for simpler, less error-prone programs. When that’s too expensive, this can become an opt-out without changing the default.
See also u-config.
An arena suitable for most cases can be this simple:
typedef struct {
char *beg;
char *end;
} arena;
void *alloc(arena *a, ptrdiff_t size, ptrdiff_t align, ptrdiff_t count)
{
ptrdiff_t padding = -(uintptr_t)a->beg & (align - 1);
ptrdiff_t available = a->end - a->beg - padding;
if (available < 0 || count > available/size) {
abort(); // one possible out-of-memory policy
}
void *p = a->beg + padding;
a->beg += padding + count*size;
return memset(p, 0, count*size);
}
Yup, just a pair of pointers! When allocating, all sizes are signed just as they ought to be. Unsigned sizes are another historically common source of defects, and offer no practical advantages in return.
The align
parameter allows the arena to handle any unusual alignments,
something that’s surprisingly difficult to do with libc. It’s difficult to
appreciate its usefulness until it’s convenient.
The uintptr_t
business may look unusual if you’ve never come across it
before. To align beg
, we need to compute the number of bytes to advance
the address (padding
) until the alignment evenly divides the address.
The modulo with align
computes the number of bytes it’s since the last
alignment:
extra = addr % align
We can’t operate numerically on an address like this, so in the code we
first convert to uintptr_t
. Alignment is always a power of two, which
notably excludes zero, so no worrying about division by zero. That also
means we can compute modulo by subtracting one and masking with AND:
extra = addr & (align - 1)
However, we want the number of bytes to advance to the next alignment, which is the inverse:
padding = -addr & (align - 1)
Add the uintptr_t
cast and you have the code in alloc
.
The if
tests if there’s enough memory and simultaneously for overflow on
size*count
. If either fails, it invokes the out-of-memory policy, which
in this case is abort
. I strongly recommend that, at least when testing,
always having something in place to, at minimum, abort when allocation
fails, even when you think it cannot happen. It’s easy to use more memory
than you anticipate, and you want a reliable signal when it happens.
An alternative policy is to longjmp to a “handler”, which with
GCC and Clang doesn’t even require runtime support. In that case add a
jmp_buf
to the arena:
typedef struct {
char *beg;
char *end;
void **jmp_buf;
} arena;
void *alloc(...)
{
// ...
if (/* out of memory */) {
__builtin_longjmp(a->jmp_buf, 1);
}
// ...
}
bool example(..., arena scratch)
{
void *jmp_buf[5];
if (__builtin_setjmp(jmp_buf)) {
return 0;
}
scratch.jmp_buf = jmp_buf;
// ...
return 1;
}
example
returns failure to the caller if it runs out of memory, without
needing to check individual allocations and, thanks to the implicit free
of scratch arenas, without needing to clean up. If callees receiving the
scratch arena don’t set their own jmp_buf
, they’ll return here, too. In
a real program you’d probably wrap the setjmp
setup in a macro.
Suppose zeroing is too expensive or unnecessary in some cases. Add a flag to opt out:
void *alloc(..., int flags)
{
// ...
return flag&NOZERO ? p : memset(p, 0, total);
}
Similarly, perhaps there’s a critical moment where you’re holding a non-memory resource (lock, file handle), or you don’t want allocation failure to be fatal. In either case, it’s important that the out-of-memory policy isn’t invoked. You could request a “soft” failure with another flag, and then do the usual null pointer check:
void *alloc(..., int flags)
{
// ...
if (/* out of memory */) {
if (flags & SOFTFAIL) {
return 0;
}
abort();
}
// ...
}
Most non-trivial programs will probably have at least one of these flags.
In case it wasn’t obvious, allocating an arena is simple:
arena newarena(ptrdiff_t cap)
{
arena a = {0};
a.beg = malloc(cap);
a.end = a.beg ? a.beg+cap : 0;
return a;
}
Or make a direct allocation from the operating system, e.g. mmap
,
VirtualAlloc
. Typically arena lifetime is the whole program, so you
don’t need to worry about freeing it. (Since you’re using arenas, you can
also turn off any memory leak checkers while you’re at it.)
If you need more arenas then you can always allocate smaller ones out of the first! In multi-threaded applications, each thread may have at least its own scratch arena.
new
macroI’ve shown alloc
, but few parts of the program should be calling it
directly. Instead they have a macro to automatically handle the details. I
call mine new
, though of course if you’re writing C++ you’ll need to
pick another name (make
? PushStruct
?):
#define new(a, t, n) (t *)alloc(a, sizeof(t), _Alignof(t), n)
The cast is an extra compile-time check, especially useful for avoiding
mistakes in levels of indirection. It also keeps normal code from directly
using the sizeof
operator, which is easy to misuse. If you added a
flags
parameter, pass in zero for this common case. Keep in mind that
the goal of this macro is to make common allocation simple and robust.
Often you’ll allocate single objects, and so the count is 1. If you think
that’s ugly, you could make variadic version of new
that fills in common
defaults. In fact, that’s partly why I put count
last!
#define new(...) newx(__VA_ARGS__,new4,new3,new2)(__VA_ARGS__)
#define newx(a,b,c,d,e,...) e
#define new2(a, t) (t *)alloc(a, sizeof(t), alignof(t), 1, 0)
#define new3(a, t, n) (t *)alloc(a, sizeof(t), alignof(t), n, 0)
#define new4(a, t, n, f) (t *)alloc(a, sizeof(t), alignof(t), n, f)
Not quite so simple, but it optionally makes for more streamlined code:
thing *t = new(perm, thing);
thing *ts = new(perm, thing, 1000);
char *buf = new(perm, char, len, NOZERO);
Side note: If sizeof
should be avoided, what about array lengths? That’s
part of the problem! Hardly ever do you want the size of an array, but
rather the number of elements. That includes char
arrays where this
happens to be the same number. So instead, define a countof
macro that
uses sizeof
to compute the value you actually want. I like to have this
whole collection:
#define sizeof(x) (ptrdiff_t)sizeof(x)
#define countof(a) (sizeof(a) / sizeof(*(a)))
#define lengthof(s) (countof(s) - 1)
Yes, you can convert sizeof
into a macro like this! It won’t expand
recursively and bottoms out as an operator. countof
also, of course,
produces a less error-prone signed count so users don’t fumble around with
size_t
. lengthof
statically produces null-terminated string length.
char msg[] = "hello world";
write(fd, msg, lengthof(msg));
#define MSG "hello world"
write(fd, MSG, lengthof(MSG));
alloc
with attributesAt least for GCC and Clang, we can further improve alloc
with three
function attributes:
__attribute((malloc, alloc_size(2, 4), alloc_align(3)))
void *alloc(...);
malloc
indicates that the pointer returned by alloc
does not alias any
existing object. Enables some significant optimizations that are otherwise
blocked, most often by breaking potential loop-carried dependencies.
alloc_size
tracks the allocation size for compile-time diagnostics and
run-time assertions (__builtin_object_size
). This generally
requires a non-zero optimization level. In other words, you will get a
compiler warnings about some out bounds accesses of arena objects, and
with Undefined Behavior Sanitizer you’ll get run-time bounds checking.
It’s a great complement to fuzzing.
In theory alloc_align
may also allow better code generation, but I’ve
yet to observe a case. Consider it optional and low-priority. I mention it
only for completeness.
How large an arena should you allocate? The simple answer: As much as is necessary for the program to successfully complete. Usually the cost of untouched arena memory is low or even zero. Most programs should probably have an upper limit, at which point they assume something has gone wrong. Arenas allow this case to be handled gracefully, simplifying recovery and paving the way for continued operation.
While a sufficient answer for most cases, it’s unsatisfying. There’s a common assumption that programs should increase their memory usage as much as needed and let the operating system respond if it’s too much. However, if you’ve ever tried this yourself, you probably noticed that mainstream operating systems don’t handle it well. The typical results are system instability — thrashing, drivers crashing — possibly necessitating a reboot.
If you insist on this route, on 64-bit hosts you can reserve a gigantic
virtual address space and gradually commit memory as needed. On Linux that
means leaning on overcommit by allocating the largest arena possible at
startup, which will automatically commit through use. Use MADV_FREE
to
decommit.
On Windows, VirtualAlloc
handles reserve and commit separately. In
addition to the allocation offset, you need a commit offset. Then expand
the committed region ahead of the allocation offset as it grows. If you
ever manually reset the allocation offset, you could decommit as well, or
at least MEM_RESET
. At some point commit may fail, which should then
trigger the out-of-memory policy, but the system is probably in poor shape
by that point — i.e. use an abort policy to release it all quickly.
While allocations out of an arena don’t require individual error checks,
allocating the arena itself at startup requires error handling. It would
be nice if the arena could be allocated out of .bss
and punt that job to
the loader. While you could make a big, global char[]
array to back
your arena, it’s technically not permitted (strict aliasing). A “clean”
.bss
region could be obtained with a bit of assembly — .comm
plus assembly to get the address into C without involving an array. I
wanted a more portable solution, so I came up with this:
arena getarena(void)
{
static char mem[1<<28];
arena r = {0};
r.beg = mem;
asm ("" : "+r"(r.beg)); // launder the pointer
r.end = r.beg + countof(mem);
return r;
}
The asm
accepts a pointer and returns a pointer ("+r"
). The compiler
cannot “see” that it’s actually empty, and so returns the same pointer.
The arena will be backed by mem
, but by laundering the address through
asm
, I’ve disconnected the pointer from its origin. As far the compiler
is concerned, this is some foreign, assembly-provided pointer, not a
pointer into mem
. It can’t optimize away mem
because it’s been given
to a mysterious assembly black box.
While inappropriate for a real project, I think it’s a neat trick.
In my initial example I used a linked list to stores lines. This data structure is great with arenas. It only takes a few of lines of code to implement a linked list on top of an arena, and no “destroy” code is needed. Simple.
What about arena-backed associative arrays? Or arena-backed dynamic arrays? See these follow-up articles for details!
]]>