As usual, it helps to begin with a concrete example of the problem. The
following is a conventional .pc
file much like you’d find on your own
system:
prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: Example
Version: 1.0
Description: An example .pc file
Cflags: -I${includedir}
Libs: -L${libdir} -lexample
It begins by defining the library’s installation prefix from which it
derives additional paths, which are finally used in the package fields
that generate build flags (Cflags
, Libs
). If I run u-config against
this configuration:
$ pkg-config --cflags --libs example
-I/usr/include -L/usr/lib -lexample
Typically prefix
is populated by the library’s build system, which knows
where the library is to be installed. In some situations that’s not
possible, and there is no opportunity to set prefix
to a meaningful
path. In that case, pkg-config can automatically override it
(--define-prefix
) with a path relative to the .pc
file, making the
installation relocatable. This works quite well on Windows, where it’s
the default:
$ pkg-config --cflags --libs example
-IC:/Users/me/example/include -LC:/Users/me/example/lib -lexample
This just works… so long as the path does not contain spaces. If so, it
risks splitting into separate fields. The .pc
format supports quoting to
control how such output is escaped. Regions between quotes are escaped in
the output so that they retain their spaces when field split in the shell.
If a .pc
file author is careful, they’d write it with quotes:
Cflags: -I"${includedir}"
Libs: -L"${libdir}" -lexample
The paths are carefully placed within quoted regions so that they come out properly:
$ pkg-config --cflags example
-IC:/Program\ Files/example/include
Almost nobody writes their .pc
files this way! The convention is not
to quote. My original solution was to implicitly wrap prefix
in quotes
on assignment, which fixes the vast majority of .pc
files. That
effectively looks like this in the “virtual” .pc
file:
prefix="C:/Program Files/example"
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
So the important region is quoted, its spaces preserved. However, the
occasional library author actively supporting Windows inevitably runs into
this problem, and their system’s pkg-config implementation does not quote
prefix
. They soon figure out explicit quoting and apply it, which then
undermines u-config’s implicit quoting. The quotes essentially cancel out:
"$includedir" -> ""C:/Program Files/example"/include"
The quoted regions are inverted and nothing happens. Though this is a
small minority, the libraries that do this and the ones you’re likely to
use on Windows are correlated. I was stumped: How to support quoted and
unquoted .pc
files simultaneously?
I recently had the thought: What if somehow u-config tracked which spans
of string were paths. prefix
is initially a path span, and then track it
through macro-expansion and concatenation. Soon after that I realized it’s
even simpler: Encode the spaces in a path as a value other than space,
but also a value that cannot appear in the input. Recall that certain
octets can never appear in UTF-8 text: the 8 values whose highest 5
bits are set. That would be the first octet of 5-octet, or longer, code
point, but those are forbidden.
11111xxx
When paths enter the macro system, special characters are encoded as one of these 8 values. They’re converted back to their original ASCII values during output encoding, escaped. It doesn’t interact with the pkg-config quoting mechanism, so there’s no quote cancellation. Both quoting cases are supported equally.
For example, if space is mapped onto \xff
(255), then:
in: C:/Program Files/foo -> C:/Program\xffFiles/foo
out: C:/Program\xffFiles/foo -> C:/Program\ Files/foo
Which prints the same regardless of ${includedir}
or "${includedir}"
.
Problem solved!
That’s not the only complication. Outputs may deliberately include shell
metacharacters, though typically these are Makefile fragments. For
example, the default value of ${pc_top_builddir}
is $(top_builddir)
,
which make
will later expand. While these characters are special to a
shell, and certainly special to make
, they must not be escaped.
What if a path contains these characters? The pkg-config quoting mechanism
won’t help. It’s only concerned with spaces, and $(...)
prints the same
quoted nor not. As before, u-config must track provenance — whether or not
such characters originated from a path.
If $PKG_CONFIG_TOP_BUILD_DIR
is set, then pc_top_builddir
is set to
this environment variable, useful when the result isn’t processed by
make
. In this case it’s a path, and $(...)
ought to be escaped. Even
without $
it must be quoted, because the parentheses would still invoke
a subshell. But who would put parenthesis in a path? Lo and behold!
C:/Program Files (x86)/example
Again, extending UTF-8 solves this as well: Encode $
, (
, and )
in
paths using three of those forbidden octets, and escape them on the way
out, allowing unencoded instances to go straight through.
in: C:/Program\xffFiles\xff\xfdx86\xfe/example
out: C:/Program\ Files\ \(x86\)/example
This makes pc_top_builddir
straightforward: default to a raw string,
otherwise a path-encoded environment variable (note: s8
is a string
type and upsert
is a hash map):
s8 top_builddir = s8("$(top_builddir)");
if (envvar_set) {
top_builddir = s8pathencode(envvar, perm);
}
*upsert(&global, s8("pc_top_builddir"), perm) = top_builddir;
For a particularly wild case, consider deliberately using a uname -m
command substitution to construct a path, i.e. the path contains the
target machine architecture (i686
, x86_64
, etc.):
Cflags: -I${prefix}/$(uname -m)/include
(Not that condone such nonsense. This is merely a reality of real world
.pc
files.) With prefix
automatically set as above, this will print:
-IC:/Program\ Files\ \(x86\)/example/$(uname -m)/include
Path parentheses are escaped because they came from a path, but command
substitution passes through because it came from the .pc
source. Quite
cool!
___chkstk_ms
, perhaps in an error message. It’s a little piece of
runtime provided by GCC via libgcc which ensures enough of the stack is
committed for the caller’s stack frame. The “function” uses a custom ABI
and is implemented in assembly. So is the subject of this article, a
slightly improved implementation soon to be included in w64devkit as
libchkstk (-lchkstk
).
The MSVC toolchain has an identical (x64) or similar (x86) function named
__chkstk
. We’ll discuss that as well, and w64devkit will include x86 and
x64 implementations, useful when linking with MSVC object files. The new
x86 __chkstk
in particular is also better than the MSVC definition.
A note on spelling: ___chkstk_ms
is spelled with three underscores, and
__chkstk
is spelled with two. On x86, cdecl
functions are
decorated with a leading underscore, and so may be rendered, e.g. in error
messages, with one fewer underscore. The true name is undecorated, and the
raw symbol name is identical on x86 and x64. Further complicating matters,
libgcc defines a ___chkstk
with three underscores. As far as I can tell,
this spelling arose from confusion regarding name decoration, but nobody’s
noticed for the past 28 years. libgcc’s x64 ___chkstk
is obviously and
badly broken, so I’m sure nobody has ever used it anyway, not even by
accident thanks to the misspelling. I’ll touch on that below.
When referring to a particular instance, I will use a specific spelling.
Otherwise the term “chkstk” refers to the family. If you’d like to skip
ahead to the source for libchkstk: libchkstk.S
.
The header of a Windows executable lists two stack sizes: a reserve size
and an initial commit size. The first is the largest the main thread
stack can grow, and the second is the amount committed when the
program starts. A program gradually commits stack pages as needed up to
the reserve size. Binutils objdump
option -p
lists the sizes. Typical
output for a Mingw-w64 program:
$ objdump -p example.exe | grep SizeOfStack
SizeOfStackReserve 0000000000200000
SizeOfStackCommit 0000000000001000
The values are in hexadecimal, and this indicates 2MiB reserved and 4KiB
initially committed. With the Binutils linker, ld
, you can set them at
link time using --stack
. Via gcc
, use -Xlinker
. For example, to
reserve an 8MiB stack and commit half of it:
$ gcc -Xlinker --stack=$((8<<20)),$((4<<20)) ...
MSVC link.exe
similarly has /stack
.
The purpose of this mechanism is to avoid paying the commit charge for
unused stack. It made sense 30 years ago when stacks were a potentially
large portion of physical memory. These days it’s a rounding error and
silly we’re still dealing with it. Using the above options you can choose
to commit the entire stack up front, at which point a chkstk helper is no
longer needed (-mno-stack-arg-probe
, /Gs2147483647
). This
requires link-time control of the main module, which isn’t always an
option, like when supplying a DLL for someone else to run.
The program grows the stack by touching the singular guard page mapped between the committed and uncommitted portions of the stack. This action triggers a page fault, and the default fault handler commits the guard page and maps a new guard page just below. In other words, the stack grows one page at a time, in order.
In most cases nothing special needs to happen. The guard page mechanism is transparent and in the background. However, if a function stack frame exceeds the page size then there’s a chance that it might leap over the guard page, crashing the program. To prevent this, compilers insert a chkstk call in the function prologue. Before local variable allocation, chkstk walks down the stack — that is, towards lower addresses — nudging the guard page with each step. (As a side effect it provides stack clash protection — the only security aspect of chkstk.) For example:
void callee(char *);
void example(void)
{
char large[1<<20];
callee(large);
}
Compiled with 64-bit gcc -O
:
example:
movl $1048616, %eax
call ___chkstk_ms
subq %rax, %rsp
leaq 32(%rsp), %rcx
call callee
addq $1048616, %rsp
ret
I used GCC, but this is practically identical to the code generated by
MSVC and Clang. Note the call to ___chkstk_ms
in the function prologue
before allocating the stack frame (subq
). Also note that it sets eax
.
As a volatile register, this would normally accomplish nothing because
it’s done just before a function call, but recall that ___chkstk_ms
has
a custom ABI. That’s the argument to chkstk. Further note that it uses
rax
on the return. That’s not the value returned by chkstk, but rather
that x64 chkstk preserves all registers.
Well, maybe. The official documentation says that registers r10 and r11 are volatile, but that information conflicts with Microsoft’s own implementation. Just in case, I choose a conservative interpretation that all registers are preserved.
In a high level language, chkstk might look something like so:
// NOTE: hypothetical implementation
void ___chkstk_ms(ptrdiff_t frame_size)
{
volatile char frame[frame_size]; // NOTE: variable-length array
for (ptrdiff_t i = frame_size - PAGE_SIZE; i >= 0; i -= PAGE_SIZE) {
frame[i] = 0; // touch the guard page
}
}
This wouldn’t work for a number of reasons, but if it did, volatile
would serve two purposes. First, forcing the side effect to occur. The
second is more subtle: The loop must happen in exactly this order, from
high to low. Without volatile
, loop iterations would be independent — as
there are no dependencies between iterations — and so a compiler could
reverse the loop direction.
The store can happen anywhere within the guard page, so it’s not necessary
to align frame
to the page. Simply touching at least one byte per page
is enough. This is essentially the definition of libgcc ___chkstk_ms
.
How many iterations occur? In example
above, the stack frame will be
around 1MiB (220). With pages of 4KiB (212) that’s
256 iterations. The loop happens unconditionally, meaning every function
call requires 256 iterations of this loop. Wouldn’t it be better if the
loop ran only as needed, i.e. the first time? MSVC x64 __chkstk
skips
iterations if possible, and the same goes for my new ___chkstk_ms
. Much
like the command line string, the low address of the current
thread’s guard page is accessible through the Thread Information
Block (TIB). A chkstk can cheaply query this address, only looping
during initialization or so. (In contrast to Linux, a thread’s
stack is fundamentally managed by the operating system.)
Taking that into account, an improved algorithm:
A little unusual for an unconditional forward jump in pseudo-code, but this closely matches my assembly. The loop causes page faults, and it’s the slow, uncommon path. The common, fast path never executes 5–6. I’d also chose smaller instructions in order to keep the function small and reduce instruction cache pressure. My x64 implementation as of this writing:
___chkstk_ms:
push %rax // 1.
push %rcx // 1.
neg %rax // 2. rax = frame low address
add %rsp, %rax // 2. "
mov %gs:(0x10), %rcx // 3. rcx = stack low address
jmp 1f // 4.
0: sub $0x1000, %rcx // 5.
test %eax, (%rcx) // 6. page fault (very slow!)
1: cmp %rax, %rcx // 7.
ja 0b // 7.
pop %rcx // 8.
pop %rax // 8.
ret // 8.
I’ve labeled each instruction with its corresponding pseudo-code. Step 6
is unusual among chkstk implementations: It’s not a store, but a load,
still sufficient to fault the page. That test
instruction is just two
bytes, and unlike other two-byte options, doesn’t write garbage onto the
stack — which would be allowed — nor use an extra register. I searched
through single byte instructions that can page fault, all of which involve
implicit addressing through rdi
or rsi
, but they increment rdi
or
rsi
, and would would require another instruction to correct it.
Because of the return address and two push
operations, the low stack
frame address is technically too low by 24 bytes. That’s fine. If this
exhausts the stack, the program is really cutting it close and the stack
is too small anyway. I could be more precise — which, as we’ll soon see,
is required for x86 __chkstk
— but it would cost an extra instruction
byte.
On x64, ___chkstk_ms
and __chkstk
have identical semantics, so name it
__chkstk
— which I’ve done in libchkstk — and it works with MSVC. The
only practical difference between my chkstk and MSVC __chkstk
is that
mine is smaller: 36 bytes versus 48 bytes. Largest of all, despite lacking
the optimization, is libgcc ___chkstk_ms
, weighing 50 bytes, or in
practice, due to an unfortunate Binutils default of padding sections, 64
bytes.
I’m no assembly guru, and I bet this can be even smaller without hurting the fast path, but this is the best I could come up with at this time.
Update: Stefan Kanthak, who has extensively explored this topic, points out that large stack frame requests might overflow my low frame address calculation at (3), effectively disabling the probe. Such requests might occur from alloca calls or variable-length arrays (VLAs) with untrusted sizes. As far as I’m concerned, such programs are already broken, but it only cost a two-byte instruction to deal with it. I have not changed this article, but the source in w64devkit has been updated.
On x86 ___chkstk_ms
has identical semantics to x64. Mine is a copy-paste
of my x64 chkstk but with 32-bit registers and an updated TIB lookup. GCC
was ahead of the curve on this design.
However, x86 __chkstk
is bonkers. It not only commits the stack, but
also allocates the stack frame. That is, it returns with a different stack
pointer. The return pointer is initially inside the new stack frame, so
chkstk must retrieve it and return by other means. It must also precisely
compute the low frame address.
__chkstk:
push %ecx // 1.
neg %eax // 2.
lea 8(%esp,%eax), %eax // 2.
mov %fs:(0x08), %ecx // 3.
jmp 1f // 4.
0: sub $0x1000, %ecx // 5.
test %eax, (%ecx) // 6. page fault (very slow!)
1: cmp %eax, %ecx // 7.
ja 0b // 7.
pop %ecx // 8.
xchg %eax, %esp // ?. allocate frame
jmp *(%eax) // 8. return
The main differences are:
eax
is treated as volatile, so it is not savedlea
(2)MSVC x86 __chkstk
does not query the TIB (3), and so unconditionally
runs the loop. So there’s an advantage to my implementation besides size.
libgcc x86 ___chkstk
has this behavior, and so it’s also a suitable
__chkstk
aside from the misspelling. Strangely, libgcc x64 ___chkstk
also allocates the stack frame, which is never how chkstk was supposed
to work on x64. I can only conclude it’s never been used.
Does the skip-the-loop optimization matter in practice? Consider a function using a large-ish, stack-allocated array, perhaps to process environment variables or long paths, each of which max out around 64KiB.
_Bool path_contains(wchar_t *name, wchar *path)
{
wchar_t var[1<<15];
GetEnvironmentVariableW(name, var, countof(var));
// ... search for path in var ...
}
int64_t getfilesize(char *path)
{
wchar_t wide[1<<15];
MultiByteToWideChar(CP_UTF8, 0, path, -1, wide, countof(wide));
// ... look up file size via wide path ...
}
void example(void)
{
if (path_contains(L"PATH", L"c:\\windows\\system32")) {
// ...
}
int64_t size = getfilesize("π.txt");
// ...
}
Each call to these functions with such large local arrays is also a call to chkstk. Though with a 64KiB frame, that’s only 16 iterations; barely detectable in a benchmark. If the function touches the file system, which is likely when processing paths, then chkstk doesn’t matter at all. My starting example had a 1MiB array, or 256 chkstk iterations. That starts to become measurable, though it’s also pushing the limits. At that point you ought to be using a scratch arena.
So ultimately after writing an improved ___chkstk_ms
I could only
measure a tiny difference in contrived programs, and none in any real
application. Though there’s still one more benefit I haven’t yet
mentioned…
My original motivation for this project wasn’t the optimization — which I didn’t even discover until after I had started — but licensing. I hate software licenses, and the tools I’ve written for w64devkit are dedicated to the public domain. Both source and binaries (as distributed). I can do so because I don’t link runtime components, not even libgcc. Not even header files. Every byte of code in those binaries is my work or the work of my collaborators.
Every once in awhile ___chkstk_ms
rears its ugly head, and I have to
make a decision. Do I re-work my code to avoid it? Do I take the reigns of
the linker and disable stack probes? I haven’t necessarily allocated a
large local array: A bit of luck with function inlining can combine
several smaller stack frames into one that’s just large enough to require
chkstk.
Since libgcc falls under the GCC Runtime Library Exception, if it’s linked into my program through an “Eligible Compilation Process” — which I believe includes w64devkit — then the GPL-licensed functions embedded in my binary are legally siloed and the GPL doesn’t infect the rest of the program. These bits are still GPL in isolation, and if someone were to copy them out of the program then they’d be normal GPL code again. In other words, it’s not a 100% public domain binary if libgcc was linked!
(If some FSF lawyer says I’m wrong, then this is an escape hatch through which anyone can scrub the GPL from GCC runtime code, and then ignore the runtime exception entirely.)
MSVC is worse. Hardly anyone follows its license, but fortunately for most
the license is practically unenforced. Its chkstk, which currently resides
in a loose chkstk.obj
, falls into what Microsoft calls “Distributable
Code.” Its license requires “external end users to agree to terms that
protect the Distributable Code.” In other words, if you compile a program
with MSVC, you’re required to have a EULA including the relevant terms
from the Visual Studio license. You’re not legally permitted to distribute
software in the manner of w64devkit — no installer, just a portable zip
distribution — if that software has been built with MSVC. At least not
without special care which nobody does. (Don’t worry, I won’t tell.)
To avoid libgcc entirely you need -nostdlib
. Otherwise it’s implicitly
offered to the linker, and you’d need to manually check if it picked up
code from libgcc. If ld
complains about a missing chkstk, use -lchkstk
to get a definition. If you use -lchkstk
when it’s not needed, nothing
happens, so it’s safe to always include.
I also recently added a libmemory to w64devkit, providing tiny,
public domain definitions of memset
, memcpy
, memmove
, memcmp
, and
strlen
. All compilers fabricate calls to these five functions even if
you don’t call them yourself, which is how they were selected. (Not
because I like them. I really don’t.). If a -nostdlib
build
complains about these, too, then add -lmemory
.
$ gcc -nostdlib ... -lchkstk -lmemory
In MSVC the equivalent option is /nodefaultlib
, after which you may see
missing chkstk errors, and perhaps more. libchkstk.a
is compatible with
MSVC, and link.exe
doesn’t care that the extension is .a
rather than
.lib
, so supply it at link time. Same goes for libmemory.a
if you need
any of those, too.
$ cl ... /link /nodefaultlib libchkstk.a libmemory.a
While I despise licenses, I still take them seriously in the software I distribute. With libchkstk I have another tool to get it under control.
Big thanks to Felipe Garcia for reviewing and correcting mistakes in this article before it was published!
]]>The assert
macro in typical C implementations leaves a lot to be
desired, as does raise
and abort
, so I’ve suggested
alternative definitions that behave better under debuggers:
#define assert(c) while (!(c)) __builtin_trap()
#define assert(c) while (!(c)) __builtin_unreachable()
#define assert(c) while (!(c)) *(volatile int *)0 = 0
Each serves a slightly different purpose but still has the most important property: Immediately halt the program directly on the defect. None have an occasionally useful secondary property: Optionally allow the program to continue through the defect. If the program reaches the body of any of these macros then there is no reliable continuation. Even manually nudging the instruction pointer over the assertion isn’t enough. Compilers assume that the program cannot continue through the condition and generate code accordingly.
The MSVC ecosystem has a solution for this on x86: int3
. The portable
name is __debugbreak
, a name I’ve borrowed elsewhere.
#define assert(c) do if (!(c)) __debugbreak(); while (0)
On x86 it inserts an int3
instruction, which fires an interrupt,
trapping in the attached debugger, or otherwise abnormally terminating the
program. Because it’s an interrupt, it’s expected that the program might
continue. It even leaves the instruction pointer on the next instruction.
As of this writing, GCC has no matching intrinsic, but Clang recently
added __builtin_debugtrap
. In GCC you need some less portable inline
assembly: asm ("int3")
.
However, regardless of how you get an int3
in your program, GDB does not
currently understand it. The problem is that feature I mentioned: The
instruction pointer does not point at the int3
but the next instruction.
This confuses GDB, causing it to break in the wrong places, possibly even
in the wrong scope. For example:
for (int i = 0; i < n; i++) {
// ...
int3_assert(...);
}
With int3
at the very end of the loop, GDB will break at the top of
the next loop iteration, because that’s where the instruction pointer
lands by the time GDB is involved. It’s a similar story when placed at the
end of a function, leaving GDB to break in the caller. To resolve this, we
need the instruction pointer to still be “inside” the breakpoint after the
interrupt fires. Easy! Add a nop
:
#define breakpoint() asm ("int3; nop")
This behaves beautifully, eliminating all the problems GDB has with a
plain int3
. Not only is this a solid basis for a continuable assertion,
it’s also useful as a fast conditional breakpoint, where conventional
conditional breakpoints are far too slow.
for (int i = 0; i < 1000000000; i++) {
if (/* rare condition */) breakpoint();
// ...
}
Could GDB handle int3
better? Yes! Visual Studio, for instance, does not
require the nop
instruction. As far as I know there is no ARM equivalent
compatible with GDB (or even LLDB). The closest instruction, brk #0x1
,
does not behave as needed.
GDB’s built-in user interface understands three classes of breakpoint positions: symbols, context-free line numbers, and absolute addresses. When you set some breakpoints and (re)start a program under GDB, each kind of breakpoint is handled differently:
Resolve each symbol, placing a breakpoint on its run-time address.
Map each file+lineno tuple to a run-time address, and place a breakpoint on that address. If the line does not exist (i.e. the file is shorter), skip it.
Place breakpoints exactly on each absolute address. If it’s not a mapped address, don’t start the program.
The first is the best case because it adapts to program changes. Modify the code, recompile, and the breakpoint generally remains where you want it.
The third is the least useful. These breakpoints rarely survive across rebuilds, and sometimes not even across reruns.
The second is in the middle between useful and useless. If you edit the source file which has the breakpoint — likely, because you placed the breakpoint there for a reason — chances are high that the line number is no longer correct. Instead it drifts, requiring manual replacement. This is tedious and GDB ought to do better. Think that’s unreasonable? The Visual Studio debugger does exactly that quite effectively through external code edits! GDB front ends tend to handle it better, especially when they’re also the code editor and so directly observe all edits.
As a workaround we can get the first kind by temporarily naming a line number. This requires editing the source, but remember, the very reason we need it is because the source in question is actively changing. How to name a line? C and C++ labels give a name to program position:
void example(double *nums, int n, ...)
{
for (int i = 0; i < n; i++) {
loop: // named position at the start of the loop
// ...
}
}
The name loop
is local to example
, but the qualified example:loop
is
a global name, as suitable as any other symbol. I could, say, reliably
trace the progress of this loop despite changes to its position in the
source.
(gdb) dprintf example:loop,"nums[%d] = %g\n",i,nums[i]
One downside is dealing with -Wunused-label
(enabled by -Wall
), and so
I’ve considered disabling the warning in my defaults. Update:
Matthew Fernandez pointed out that the unused
label attribute eliminates
the warning, solving my problem:
for (int i = 0; i < n; i++) {
loop: __attribute((unused))
// ...
}
More often I use an assembly label, usually named b
for convenience:
for (int i = 0; i < n; i++) {
asm ("b:");
// ...
}
Like int3
, sometimes it’s necessary to give it a nop
so that GDB has
something on which to break. “Enabling” it at any time is quick:
(gdb) b b
Because it’s not .globl
, it’s a weak symbol, and I can place up to
one per translation unit, all covered by the same GDB breakpoint item
(less useful than it sounds). I haven’t actually checked, but I probably
more often use dprintf
with such named lines than actual breakpoints.
If you have similar tips and tricks of your own, I’d like to learn about them!
]]>Users of mature C libraries conventionally get to choose how memory is allocated — that is, when it cannot be avoided entirely. The C standard never laid down a convention — perhaps for the better — so each library re-invents an allocator interface. Not all are created equal, and most repeat a few fundamental mistakes. Often the interface is merely a token effort, to check off that it’s “supported” without actual consideration to its use. This article describes the critical features of a practical allocator interface, and demonstrates why they’re important.
Before diving into the details, here’s the checklist for library authors:
The standard library allocator keeps its state in global variables. This makes for a simple interface, but comes with significant performance and complexity costs. These costs likely motivate custom allocator use in the first place, in which case slavishly duplicating the standard interface is essentially the worst possible option. Unfortunately this is typical:
#define LIB_MALLOC malloc
#define LIB_FREE free
I could observe the library’s allocations, and I could swap in a library functionality equivalent to the standard library allocator — jemalloc, mimalloc, etc. — but that’s about it. Better than nothing, I suppose, but only just so. Function pointer callbacks are slightly better:
typedef struct {
void *(*malloc)(size_t);
void (*free)(void *);
} allocator;
session *session_new(..., allocator);
At least I could use different allocators at different times, and there are even tricks to bind a context pointer to the callback. It also works when the library is dynamically linked.
Either case barely qualifies as custom allocator support, and they’re useless when it matters most. Only a small ingredient is needed to make these interfaces useful: a context pointer.
// NOTE: Better, but still not great
typedef struct {
void *(*malloc)(size_t, void *ctx);
void (*free)(void *, void *ctx);
void *ctx;
} allocator;
Users can choose from where the library will allocate at at given time. It liberates the allocator from global variables (or janky workarounds), and multithreading woes. The default can still hook up to the standard library through stubs that fit these interfaces.
static void *lib_malloc(size_t size, void *ctx)
{
(void)ctx;
return malloc(size);
}
static void *lib_free(void *ptr, void *ctx)
{
(void)ctx;
free(ptr);
}
static allocator lib_allocator = {lib_malloc, lib_free, 0};
Note that the context pointer came after the “standard” arguments. All things being equal, “extra” arguments should go after standard ones. But don’t sweat it! In the most common calling conventions this allows stub implementations to be merely an unconditional jump. It’s as though the stubs are a kind of subtype of the original functions.
lib_malloc:
jmp malloc
lib_free:
jmp free
Typically the decision is completely arbitrary, and so this minutia tips the balance.
So what’s the big deal? It means we can trivially plug in, say, a tiny
arena allocator. To demonstrate, consider this fictional string
set and partial JSON API, each of which supports a custom allocator. For
simplicity — I’m attempting to balance substance and brevity — they share
an allocator interface. (Note: Because subscripts and sizes should be
signed, and we’re now breaking away from the standard library
allocator, I will use ptrdiff_t
for the rest of the examples.)
typedef struct {
void *(*malloc)(ptrdiff_t, void *ctx);
void (*free)(void *, void *ctx);
void *ctx;
} allocator;
typedef struct set set;
set *set_new(allocator *);
set *set_free(set *);
bool set_add(set *, char *);
typedef struct json json;
json *json_load(char *buf, ptrdiff_t len, allocator *);
json *json_free(json *);
ptrdiff_t json_length(json *);
json *json_subscript(json *, ptrdiff_t i);
json *json_getfield(json *, char *field);
double json_getnumber(json *);
char *json_getstring(json *);
set
and json
objects retain a copy of the allocator
object for all
allocations made through that object. Given nothing, they default to the
standard library using the pass-through definitions above. Used together
with the standard library allocator:
typedef struct {
double sum;
bool ok;
} sum_result;
sum_result sum_unique(char *json, ptrdiff_t len)
{
sum_result r = {0};
json *namevals = json_load(json, len, 0);
if (!namevals) {
return r; // parse error
}
ptrdiff_t arraylen = json_length(namevals);
if (arraylen < 0) {
json_free(namevals);
return r; // not an array
}
set *seen = set_new(0);
for (ptrdiff_t i = 0; i < arraylen; i++) {
json *element = json_subscript(namevals, i);
char *name = json_getfield(element, "name");
char *value = json_getfield(element, "value");
if (!name || !value) {
set_free(set);
json_free(namevals);
return r; // invalid element
} else if (set_add(set, name)) {
r.sum += json_getnumber(value);
}
}
set_free(set);
json_free(namevals);
r.ok = 1;
return r;
}
Which given as JSON input:
[
{"name": "foo", "value": 123},
{"name": "bar", "value": 456},
{"name": "foo", "value": 1000}
]
Would return 579.0
. Because it’s using standard library allocation, it
must carefully clean up before returning. There’s also no out-of-memory
handling because, in practice, programs typically do not get to observe
and respond to the standard allocator running out of memory.
We can improve and simplify it with an arena allocator:
typedef struct {
char *beg;
char *end;
jmp_buf *oom;
} arena;
void *arena_malloc(ptrdiff_t size, void *ctx)
{
arena *a = ctx;
ptrdiff_t available = a->end - a->beg;
ptrdiff_t alignment = -size & 15;
if (size > available-alignment) {
longjmp(*a->oom);
}
return a->end -= size + alignment;
}
void arena_free(void *ptr, void *ctx)
{
// nothing to do (yet!)
}
I’m allocating from the end rather than the beginning because it will make a later change simpler. Applying that to the function:
sum_result sum_unique(char *json, ptrdiff_t len, arena scratch)
{
sum_result r = {0};
allocator a = {0};
a.malloc = arena_malloc;
a.free = arena_free;
a.ctx = &scratch;
json *namevals = json_load(json, len, &a);
if (!namevals) {
return r; // parse error
}
ptrdiff_t arraylen = json_length(namevals);
if (arraylen < 0) {
return r; // not an array
}
set *seen = set_new(&a);
for (ptrdiff_t i = 0; i < arraylen; i++) {
json *element = json_subscript(namevals, i);
char *name = json_getfield(element, "name");
char *value = json_getfield(element, "value");
if (!name || !value) {
return r; // invalid element
} else if (set_add(set, name)) {
r.sum += json_getnumber(value);
}
}
r.ok = 1;
return r;
}
Calls to set_free
and json_free
are no longer necessary because the
arena automatically frees these on any return, in O(1). I almost feel bad
the library authors bothered to write them! It also handles allocation
failure without introducing it to sum_unique
. We may even deliberately
restrict the memory available to this function — perhaps because the input
is untrusted, and we want to quickly abort denial-of-service attacks — by
giving it a small arena, relying on out-of-memory to reject pathological
inputs.
There are so many possibilities unlocked by the context pointer.
When an application frees an object it always has the original, requested allocation size on hand. After all, it’s a necessary condition to use the object correctly. In the simplest case it’s the size of the freed object’s type: a static quantity. If it’s an array, then it’s a multiple of the tracked capacity: a dynamic quantity. In any case the size is either known statically or tracked dynamically by the application.
Yet free()
does not accept a size, meaning that the allocator must track
the information redundantly! That’s a needless burden on custom
allocators, and with a bit of care a library can lift it.
This was noticed in C++, and WG21 added sized deallocation in
C++14. It’s now the default on two of the three major implementations (and
probably not the two you’d guess). In other words, object size is so
readily available that it can mostly be automated away. Notable exception:
operator new[]
and operator delete[]
with trivial destructors. With
non-trivial destructors, operator new[]
must track the array length for
its its own purposes on top of libc bookkeeping. In other words, array
allocations have their size stored in at least three different places!
That means the “free” interface should look like this:
void *lib_free(void *ptr, ptrdiff_t len, void *ctx);
And calls inside the library might look like:
lib_free(p, sizeof(*p), ctx);
lib_free(a, sizeof(*a)*len, ctx);
Now that arena_free
has size information, it can free an allocation if
it was the most recent:
void arena_free(void *ptr, ptrdiff_t size, void *ctx)
{
arena *a = ctx;
if (ptr == a->end) {
ptrdiff_t alignment = -size & 15;
a->end += size + alignment;
}
}
If the library allocates short-lived objects to compute some value, then discards in reverse order, the memory can be reused. The arena doesn’t have to do anything special. The library merely needs to share its knowledge with the allocator.
Beyond arena allocation, an allocator could use the size to locate the allocation’s size class and, say, push it onto a freelist of its size class. Size-class freelists compose well with arenas, and an implementation is short and simple when the caller of “free” communicates object size.
Another idea: During testing, use a debug allocator that tracks object size and validates the reported size against its own bookkeeping. This can help catch mistakes sooner.
Resizing an allocation requires a lot from an allocator, and it should be avoided if possible. At the very least it cannot be done at all without knowing the original allocation size. An allocator can’t simply no-op it like it can with “free.” With the standard library interface, allocators have no choice but to redundantly track object sizes when “realloc” is required.
So, just as with “free,” the allocator should be given the old object size!
void *lib_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx);
At the very least, an allocator could implement “realloc” with “malloc”
and memcpy
:
void arena_realloc(void *ptr, ptrdiff_t old, ptrdiff_t new, void *ctx)
{
assert(new > old);
void *r = arena_malloc(new, ctx);
return memcpy(r, ptr, old);
}
Of the three checklist items, this is the most neglected. Exercise for the
reader: The last-allocated object can be resized in place, instead using
memmove
. If this is frequently expected, allocate from the front, adjust
arena_free
as needed, and extend the allocation in place as discussed a
previous addendum, without any copying.
Let’s examine real world examples to see how well they fit the checklist. First up is uthash, a popular, easy-to-use, intrusive hash table:
#define uthash_malloc(sz) my_malloc(sz)
#define uthash_free(ptr, sz) my_free(ptr)
No “realloc” so it trivially checks (3). It optionally provides the old size to “free” which checks (2). However it misses (1) which is the most important, greatly limiting its usefulness.
Next is the venerable zlib. It has function pointers with these
prototypes on its z_stream
object.
void *zlib_malloc(void *ctx, unsigned items, unsigned size);
void zlib_free(void *ctx, void *ptr);
The context pointer checks (1), and I can confirm from experience that it’s genuinely useful with a custom allocator. No “realloc” so it passes (3) automatically. It misses (2), but in practice this hardly matters: It allocates everything up front, and frees at the very end, meaning a no-op “free” is quite sufficient.
Finally there’s the Lua programming language with this economical, single-function interface:
void *lua_Alloc(void *ctx, void *ptr, size_t old, size_t new);
It packs all three allocator functions into one function. It includes a context pointer (1), a free size (2), and two realloc sizes (3). It’s a simple allocator’s best friend!
]]>This has been a ground-breaking year for my C skills, and paradigm shifts in my technique has provoked me to reconsider my habits and coding style. It’s been my largest personal style change in years, so I’ve decided to take a snapshot of its current state and my reasoning. These changes have produced significant productive and organizational benefits, so while most is certainly subjective, it likely includes a few objective improvements. I’m not saying everyone should write C this way, and when I contribute code to a project I follow their local style. This is about what works well for me.
Starting with the fundamentals, I’ve been using short names for primitive
types. The resulting clarity was more than I had expected, and it’s made
my code more enjoyable to review. These names appear frequently throughout
a program, so conciseness pays. Also, now that I’ve gone without, _t
suffixes are more visually distracting than I had realized.
typedef uint8_t u8;
typedef char16_t c16;
typedef int32_t b32;
typedef int32_t i32;
typedef uint32_t u32;
typedef uint64_t u64;
typedef float f32;
typedef double f64;
typedef uintptr_t uptr;
typedef char byte;
typedef ptrdiff_t size;
typedef size_t usize;
Some people prefer an s
prefix for signed types. I prefer i
, plus as
you’ll see, I have other designs for s
. For sizes, isize
would be more
consistent, and wouldn’t hog the identifier, but signed sizes are the
way and so I want them in a place of privilege. usize
is niche,
mainly for interacting with external interfaces where it might matter.
b32
is a “32-bit boolean” and communicates intent. I could use _Bool
,
but I’d rather stick to a natural word size and stay away from its weird
semantics. To beginners it might seem like “wasting memory” by using a
32-bit boolean, but in practice that’s never the case. It’s either in a
register (return value, local variable) or would be padded anyway (struct
field). When it actually matters, I pack booleans into a flags
variable,
and a 1-byte boolean rarely important.
While UTF-16 might seem niche, it’s a necessary evil when dealing with
Win32, so c16
(“16-bit character”) has made a frequent appearance. I
could have based it on uint16_t
, but putting the name char16_t
in its
“type hierarchy” communicates to debuggers, particularly GDB, that for
display purposes these variables hold character data. Officially Win32
uses a type named wchar_t
, but I like being explicit about UTF-16.
u8
is for octets, usually UTF-8 data. It’s distinct from byte
, which
represents raw memory and is a special aliasing type. In theory these
can be distinct types with differing semantics, though I’m not aware of
any implementation that does so (yet?). For now it’s about intent.
What about systems that don’t support fixed width types? That’s academic,
and far too much time has been wasted worrying about it. That includes
time wasted on typing out int_fast32_t
and similar nonsense. Virtually
no existing software would actually work correctly on such systems — I’m
certain nobody’s testing it after all — so it seems nobody else cares
either.
I don’t intend to use these names in isolation, such as in code snippets
(outside of this article). If I did, examples would require the typedefs
to give readers the complete context. That’s not worth extra explanation.
Even in the most recent articles I’ve used ptrdiff_t
instead of size
.
Next, some “standard” macros:
#define countof(a) (size)(sizeof(a) / sizeof(*(a)))
#define lengthof(s) (countof(s) - 1)
#define new(a, t, n) (t *)alloc(a, sizeof(t), _Alignof(t), n)
While I still prefer ALL_CAPS
for constants, I’ve adopted lowercase for
function-like macros because it’s nicer to read. They don’t have the same
namespace problems as other macro definitions: I can have a macro named
new()
and also variables and fields named new
because they don’t look
like function calls.
For GCC and Clang, my favorite assert
macro now looks like this:
#define assert(c) while (!(c)) __builtin_unreachable()
It has useful properties beyond the usual benefits:
It does not require separate definitions for debug and release builds. Instead it’s controlled by the presence of Undefined Behavior Sanitizer (UBSan), which is already present/absent in these circumstances. That includes fuzz testing.
libubsan
provides a diagnostic printout with a file and line number.
In release builds it turns into a practical optimization hint.
To enable assertions in release builds, put UBSan in trap mode with
-fsanitize-trap
and then enable at least -fsanitize=unreachable
. In
theory this can also be done with -funreachable-traps
, but as of this
writing it’s been broken for the past few GCC releases.
No const
. It serves no practical role in optimization, and I cannot
recall an instance where it caught, or would have caught, a mistake. I
held out for awhile as prototype documentation, but on reflection I found
that good parameter names were sufficient. Dropping const
has made me
noticeably more productive by reducing cognitive load and eliminating
visual clutter. I now believe its inclusion in C was a costly mistake.
(One small exception: I still like it as a hint to place static tables in
read-only memory closer to the code. I’ll cast away the const
if needed.
This is only of minor importance.)
Literal 0
for null pointers. Short and sweet. This is not new, but a
style I’ve used for about 7 years now, and has appeared all over my
writing since. There are some theoretical edge cases where it may cause
defects, and lots of ink has been spilled on the subject, but
after a couple 100K lines of code I’ve yet to see it happen.
restrict
when necessary, but better to organize code so that it’s not,
e.g. don’t write to “out” parameters in loops, or don’t use out parameters
at all (more on that momentarily). I don’t bother with inline
because I
compile everything as one translation unit anyway.
typedef
all structures. I used to shy away from it, but eliminating the
struct
keyword makes code easier to read. If it’s a recursive structure,
use a forward declaration immediately above so that such fields can use
the short name:
typedef struct map map;
struct map {
map *child[4];
// ...
};
Declare all functions static
except for entry points. Again, with
everything compiled as a single translation unit there’s no reason to do
otherwise. It was probably a mistake for C not to default to static
,
though I don’t have a strong opinion on the matter. With the clutter
eliminated through short types, no const
, no struct
, etc. functions
fit comfortably on the same line as their return type. I used to break
them apart so that the function name began on its own line, but that’s no
longer necessary.
In my writing I sometimes omit static
to simplify, and because outside
the context of a complete program it’s mostly irrelevant. However, I will
use it below to emphasize this style.
For awhile I capitalized type names as that effectively put them in a kind of namespace apart from variables and functions, but I eventually stopped. I may try this idea in different way in the future.
One of my most productive changes this year has been the total rejection of null terminated strings — another of those terrible mistakes — and the embrace of this basic string type:
#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
u8 *data;
size len;
} s8;
I’ve used a few names for it, but this is my favorite. The s
is for
string, and the 8
is for UTF-8 or u8
. The s8
macro (sometimes just
spelled S
) wraps a C string literal, making a s8
string out of it. A
s8
is handled like a fat pointer, passed and returned by copy.
s8
makes for a great function prefix, unlike str
, all of which are
reserved. Some examples:
static s8 s8span(u8 *, u8 *);
static b32 s8equals(s8, s8);
static size s8compare(s8, s8);
static u64 s8hash(s8);
static s8 s8trim(s8);
static s8 s8clone(s8, arena *);
Then when combined with the macro:
if (s8equals(tagname, s8("body"))) {
// ...
}
You might be tempted to use a flexible array member to pack the size and array together as one allocation. Tried it. Its inflexibility is totally not worth whatever benefits it might have. Consider, for instance, how you’d create such a string out of a literal, and how it would be used.
A few times I’ve thought, “This program is simple enough that I don’t need
a string type for this data.” That thought is nearly always wrong. Having
it available helps me think more clearly, and makes for simpler programs.
(C++ got it only a few years ago with std::string_view
and std::span
.)
It has a natural UTF-16 counterpart, s16
:
#define s16(s) (s16){u##s, lengthof(u##s)}
typedef struct {
c16 *data;
size len;
} s16;
I’m not entirely sold on gluing u
to the literal in the macro, versus
writing it out on the string literal.
Another change has been preferring structure returns instead of out parameters. It’s effectively a multiple value return, though without destructuring. A great organizational change. For example, this function returns two values, a parse result and a status:
typedef struct {
i32 value;
b32 ok;
} i32parsed;
static i32parsed i32parse(s8);
Worried about the “extra copying?” Have no fear, because in practice
calling conventions turn this into a hidden, restrict
-qualified out
parameter — if it’s not inlined such that any return value overhead would
be irrelevant anyway. With this return style I’m less tempted to use
in-band signals like special null returns to indicate errors, which is
less clear.
It’s also led to a style of defining a zero-initialized return value at
the top of the function, i.e. ok
is false, and then use it for all
return
statements. On error, it can bail out with an immediate return.
The success path sets ok
to true before the return.
static i32parsed i32parse(s8 s)
{
i32parsed r = {0};
for (size i = 0; i < s.len; i++) {
u8 digit = s.data[i] - '0';
// ...
if (overflow) {
return r;
}
r.value = r.value*10 + digit;
}
r.ok = 1;
return r;
}
Aside from static data, I’ve also moved away from initializers except the
conventional zero initializer. (Notable exception: s8
and s16
macros.)
This includes designated initializers. Instead I’ve been initializing with
assignments. For example, this buffered output “constructor”:
typedef struct {
u8 *buf;
i32 len;
i32 cap;
i32 fd;
b32 err;
} u8buf;
static u8buf newu8buf(arena *perm, i32 cap, i32 fd)
{
u8buf r = {0};
r.buf = new(perm, u8, cap);
r.cap = cap;
r.fd = fd;
return r;
}
I like how this reads, but it also eliminates a cognitive burden: The assignments are separated by sequence points, giving them an explicit order. It doesn’t matter here, but in other cases it does:
example e = {
.name = randname(&rng),
.age = randage(&rng),
.seat = randseat(&rng),
};
There are 6 possible values for e
from the same seed. I like no longer
thinking about these possibilities.
Prefer __attribute
to __attribute__
. The __
suffix is excessive and
unnecessary.
__attribute((malloc, alloc_size(2, 4)))
For Win32 systems programming, which typically only requires a modest
number of declarations and definitions, rather than include windows.h
,
write the prototypes out by hand using custom types. It reduces
build times, declutters namespaces, and interfaces more cleanly with the
program (no more DWORD
/BOOL
/ULONG_PTR
, but u32
/b32
/uptr
).
#define W32(r) __declspec(dllimport) r __stdcall
W32(void) ExitProcess(u32);
W32(i32) GetStdHandle(u32);
W32(byte *) VirtualAlloc(byte *, usize, u32, u32);
W32(b32) WriteConsoleA(uptr, u8 *, u32, u32 *, void *);
W32(b32) WriteConsoleW(uptr, c16 *, u32, u32 *, void *);
For inline assembly, treat the outer parentheses like braces, put a space
before the opening parenthesis, just like if
, and start each constraint
line with its colon.
static u64 rdtscp(void)
{
u32 hi, lo;
asm volatile (
"rdtscp"
: "=d"(hi), "=a"(lo)
:
: "cx", "memory"
);
return (u64)hi<<32 | lo;
}
There’s surely a lot more to my style than this, but unlike the above,
those details haven’t changed this year. To see most of the mentioned
items in action in a small program, see wordhist.c
, one of my
testing grounds for hash-tries, or for a slightly larger program,
asmint.c
, a mini programming language implementation.
Unlike a hash map or linked list, a dynamic array — a data buffer with a
size that varies during run time — is more difficult to square with arena
allocation. They’re contiguous by definition, and we cannot resize objects
in the middle of an arena, i.e. realloc
. So while convenient, they come
with trade-offs. At least until they stop growing, dynamic arrays are more
appropriate for shorter-lived, temporary contexts, where you would use a
scratch arena. On average they consume about twice the memory of a fixed
array of the same size.
As before, I begin with a motivating example of its use. The guts of the
generic dynamic array implementation are tucked away in a push()
macro,
which is essentially the entire interface.
typedef struct {
int32_t *data;
ptrdiff_t len;
ptrdiff_t cap;
} int32s;
int32s fibonacci(int32_t max, arena *perm)
{
static int32_t init[] = {0, 1};
int32s fib = {0};
fib.data = init;
fib.len = fib.cap = countof(init);
for (;;) {
int32_t a = fib.data[fib.len-2];
int32_t b = fib.data[fib.len-1];
if (a+b > max) {
return fib;
}
*push(&fib, perm) = a + b;
}
}
Anyone familiar with Go will quickly notice a pattern: int32s
looks an
awful lot like a Go slice. That was indeed my inspiration, and
there is enough context that you could infer similar semantics. I
will even call these “slice headers.” Initially I tried a design based on
stretchy buffers, but I didn’t like the macros nor the ergonomics.
I wouldn’t write a fibonacci
this way in practice, but it’s useful for
highlighting certain features. Of particular note:
The dynamic array initially wraps a static array, yet I can append to it as though it were a dynamic allocation. If I don’t append at all, it still works. (Though of course the caller then shouldn’t modify the elements.)
push()
operates on any object which is slice-shaped. That is it has
a pointer field named data
, a ptrdiff_t
length field named len
, a
ptrdiff_t
capacity field named cap
, and all in that order.
push()
evaluates to a pointer to the newly-pushed element. In my
example I immediately dereference and assign a value.
An element is zero-initialized the first time it’s pushed. I say “first
time” because you can truncate an array by reducing len
, and “pushing”
afterward will simply reveal the original elements.
The name int32s
is intended to evoke plurality. I’ll use this
convention again in a moment.
The arena passed to push()
is only used if the array needs to grow.
The new backing array will be allocated out of this arena regardless of
the original backing array.
Resizes always change the backing array address, and the old array remains valid. This is also just like slices in Go.
Despite the name perm
, I expect it points to the caller’s scratch
arena. It’s “permanent” only relative to the fibonacci
call. Otherwise
I might build the array in a scratch arena, then create a final copy in
a permanent arena.
For a slightly more realistic example: rendering triangles. Suppose we need data in array format for OpenGL, but we don’t know the number of vertices ahead of time. A dynamic array is convenient, especially if we discard the array as soon as OpenGL is done with it. We could build up entire scenes like this for each display frame.
typedef struct {
GLfloat x, y, z;
} GLvert;
typedef struct {
GLvert *data;
ptrdiff_t len;
ptrdiff_t cap;
} GLverts;
void renderobj(char *buf, ptrdiff_t len, arena scratch)
{
GLverts vs = {0};
objparser parser = newobjparser(buf, len);
for (...) {
*push(&vs, &scratch) = nextvert(&parser);
}
glVertexPointer(3, GL_FLOAT, 0, vs.data);
glDrawArrays(GL_TRIANGLES, 0, vs.len);
}
As before, GLverts
is slice-shaped. This time it’s zero-initialized,
which is a valid empty dynamic array. As with maps, that means any object
with such a field comes with a ready-to-use empty dynamic array. Putting
it together, here’s an example that gradually appends vertices to named
dynamic arrays, randomly accessed by string name:
typedef struct {
map *child[4];
str name;
GLverts verts;
} map;
verts *upsert(map **, str, arena *); // from the last article
map *example(..., arena *perm)
{
map *m = 0;
for (...) {
str name = ...;
vert v = ...;
verts *vs = upsert(&m, name, perm);
*push(vs, perm) = v;
}
return m;
}
That’s what Go would call map[str][]vert
, but allocated entirely out of
an arena. Ever thought C could do this so simply and conveniently? The
memory allocator (~15 lines), map (~30 lines), dynamic array (~30 lines),
constructors (0 lines), and destructors (0 lines) that power this total to
~75 lines of zero-dependency code!
I despise macro abuse, and programs substantially implemented in macros are annoying. They’re difficult to understand and debug. A good dynamic array implementation will require a macro, and one of my goals was to keep it as simple and minimal as possible. The macro’s job is to:
sizeof
) to that function.Here’s what I came up with:
#define push(s, arena) \
((s)->len >= (s)->cap \
? grow(s, sizeof(*(s)->data), arena), \
(s)->data + (s)->len++ \
: (s)->data + (s)->len++)
The macro will be used as an expression, so it cannot use statements like
if
. The condition is therefore a ternary operator. If it’s full, it
calls the supporting grow
function. In either case, it computes the
result from data
. In particular, note that the grow
branch uses a
comma operator to sequence growth before pointer derivation, as grow
will change the value of data
as a side effect.
To be generic, the grow
function uses memcpy
-based type punning:
static void grow(void *slice, ptrdiff_t size, arena *a)
{
struct {
void *data;
ptrdiff_t len;
ptrdiff_t cap;
} replica;
memcpy(&replica, slice, sizeof(replica));
replica.cap = replica.cap ? replica.cap : 1;
ptrdiff_t align = 16;
void *data = alloc(a, 2*size, align, replica.cap);
replica.cap *= 2;
if (replica.len) {
memcpy(data, replica.data, size*replica.len);
}
replica.data = data;
memcpy(slice, &replica, sizeof(replica));
}
The slice header is copied over a local replica, avoiding conflicts with strict aliasing. This is the archetype slice header. It still requires that different pointers have identical memory representation. That’s virtually always true, and certainly true anywhere I’d use an arena.
If the capacity was zero, it behaves as though it was one, and so, through
doubling, zero-capacity arrays become capacity-2 arrays on the first push.
It’s better to let alloc
— whose definition, you may recall, included an
overflow check — handle size overflow so that it can invoke the out of
memory policy, so instead of doubling cap
, which would first require an
overflow check, it doubles the object size. This is a small constant
(i.e. from sizeof
), so doubling it is always safe.
Copying over old data includes a special check for zero-length inputs,
because, quite frustratingly, memcpy
does not accept null even
when the length is zero. I check for zero length instead of null so that
it’s more sensitive to defects. If the pointer is null with a non-zero
length, it will trip Undefined Behavior Sanitizer, or at least crash the
program, rather than silently skip copying.
Finally the updated replica is copied over the original slice header,
updating it with the new data
pointer and capacity. The original backing
array is untouched but is no longer referenced through this slice header.
Old slice headers will continue to function with the old backing array,
such as when the arena is reset to a point where the dynamic array was
smaller.
int32s vals = {0};
*push(&vals, &scratch) = 1; // resize: cap=2
*push(&vals, &scratch) = 2;
*push(&vals, &scratch) = 3; // resize: cap=4
{
arena tmp = scratch; // scoped arena
int32s extended = vals;
*push(&extended, &tmp) = 4;
*push(&extended, &tmp) = 5; // resize: cap=8
example(extended);
}
// vals still works, cap=4, extension freed
In practice, a dynamic array comes from old backing arrays whose total size adds up just shy of the current array capacity. For example, if the current capacity is 16, old arrays are size 2+4+8 = 14.
If you’re worried about misuse, such as slice header fields being in the
wrong order, a couple of assertions can quickly catch such mistakes at run
time, typically under the lightest of testing. In fact, I planned for this
by using the more-sensitive len>=cap
instead of just len==cap
, so that
it would direct execution towards assertions in grow
:
assert(replica.len >= 0);
assert(replica.cap >= 0);
assert(replica.len <= replica.cap);
This also demonstrates another benefit of signed sizes: Exactly half the range is invalid and so defects tend to quickly trip these assertions.
Alignment is unfortunately fixed, and I picked a “safe” value of 16. In my
new()
macro I used _Alignof
to pass type information to alloc
. Due
to an oversight, unlike sizeof
, _Alignof
cannot be applied
to expressions, and so it cannot be used in dynamic arrays. GCC and Clang
support _Alignof
on expressions just like sizeof
, as it’s such an
obvious idea, but Microsoft chose to strictly follow the oversight in the
standard. To support MSVC, I’ve deliberately limited the capabilities of
push
. If that doesn’t matter, fixing it is easy:
--- a/example.c
+++ b/example.c
@@ -2,3 +2,3 @@
((s)->len >= (s)->cap \
- ? grow(s, sizeof(*(s)->data), arena), \
+ ? grow(s, sizeof(*(s)->data), _Alignof(*(s)->data), arena), \
(s)->data + (s)->len++ \
@@ -6,3 +6,3 @@
-static void grow(void *slice, ptrdiff_t size, arena *a)
+static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
@@ -16,3 +16,2 @@
replica.cap = replica.cap ? replica.cap : 1;
- ptrdiff_t align = 16;
void *data = alloc(a, 2*size, align, replica.cap);
Though while you’re at it, if you’re already using extensions you might
want to switch push
to a statement expression so that the slice
header s
does not get evaluated more than once — i.e. so that upsert()
in my example above could be used inside the push()
expession.
#define push(s, a) ({ \
typeof(s) s_ = (s); \
typeof(a) a_ = (a); \
if (s_->len >= s_->cap) { \
grow(s_, sizeof(*s_->data), _Alignof(*s_->data), a_); \
} \
s_->data + s_->len++; \
})
So far this approach to dynamic arrays has been useful on a number of occasions, and I’m quite happy with the results. As with arena-friendly hash maps, I’ve no doubt they’ll become a staple in my C programs.
Dennis Schön suggests a check if the array ends at the next arena
allocation and, if so, extend the array into the arena in place. grow()
already has the necessary information on hand, so it needs only the
additional check:
static void grow(void *slice, ptrdiff_t size, ptrdiff_t align, arena *a)
{
struct {
char *data;
ptrdiff_t len;
ptrdiff_t cap;
} replica;
memcpy(&replica, slice, sizeof(replica));
if (!replica.data) {
replica.cap = 1;
replica.data = alloc(a, 2*size, align, replica.cap);
} else if (a->beg == replica.data + size*replica.cap) {
alloc(a, size, 1, replica.cap);
} else {
void *data = alloc(a, 2*size, align, replica.cap);
memcpy(data, replica.data, size*replica.len);
replica.data = data;
}
replica.cap *= 2;
memcpy(slice, &replica, sizeof(replica));
}
Because that’s yet another check for null, I’ve split it out into an independent third case:
Not quite as simple, but it improves the most common case.
]]>I’ve written before about MSI hash tables, a simple, very fast map that can be quickly implemented from scratch as needed, tailored to the problem at hand. The trade off is that one must know the upper bound a priori in order to size the base array. Scaling up requires resizing the array — an impedance mismatch with arena allocation. Search trees scale better, as there’s no underlying array, but tree balancing tends to be finicky and complex, unsuitable to rapid, on-demand implementation. We want the ease of an MSI hash table with the scaling of a tree.
I’ll motivate the discussion with example usage. Suppose we have an array of pointer+length strings, as defined last time:
typedef struct {
uint8_t *data;
ptrdiff_t len;
} str;
And we need a function that removes duplicates in place, but (for the
moment) we’re not worried about preserving order. This could be done
naively in quadratic time. Smarter is to sort, then look for runs.
Instead, I’ve used a hash map to track seen strings. It maps str
to
bool
, and it is represented as type strmap
and one insert+lookup
function, upsert
.
// Insert/get bool value for given str key.
bool *upsert(strmap **, str key, arena *);
ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
ptrdiff_t count = 0;
strmap *seen = 0;
while (count < len) {
bool *b = upsert(&seen, strings[count], &scratch);
if (*b) {
// previously seen (discard)
strings[count] = strings[--len];
} else {
// newly-seen (keep)
count++;
*b = 1;
}
}
return count;
}
In particular, note:
A null pointer is an empty hash map and initialization is trivial. As discussed in the last article, one of my arena allocation principles is default zero-initializion. Put together, that means any data structure containing a map comes with a ready-to-use, empty map.
The map is allocated out of the scratch arena so it’s automatically freed upon any return. It’s as care-free as garbage collection.
The map directly uses strings in the input array as keys, without making copies nor worrying about ownership. Arenas own objects, not references. If I wanted to carve out some fixed keys ahead of time, I could even insert static strings.
upsert
returns a pointer to a value. That is, a pointer into the map.
This is not strictly required, but usually makes for a simple interface.
When an entry is new, this value will be false (zero-initialized).
So, what is this wonderful data structure? Here’s the basic shape:
typedef struct {
hashmap *child[4];
keytype key;
valtype value;
} hashmap;
They child
and key
fields are essential to the map. Adding a child
to any data structure turns it into a hash map over whatever field you
choose as the key. In other words, a hash-trie can serve as an intrusive
hash map. In several programs I’ve combined intrusive lists and hash maps
to create an insert-ordered hash map. Going the other direction, omitting
value
turns it into a hash set. (Which is what unique
really needs!)
As you probably guessed, this hash-trie is a 4-ary tree. It can easily be
2-ary (leaner but slower) or 8-ary (bigger and usually no faster), but
4-ary strikes a good balance, if a bit bulky. In the example above,
keytype
would be str
and valtype
would be bool
. The most general
form of upsert
looks like this:
valtype *upsert(hashmap **m, keytype key, arena *perm)
{
for (uint64_t h = hash(key); *m; h <<= 2) {
if (equals(key, (*m)->key)) {
return &(*m)->value;
}
m = &(*m)->child[h>>62];
}
if (!perm) {
return 0;
}
*m = new(perm, hashmap);
(*m)->key = key;
return &(*m)->value;
}
This will take some unpacking. The first argument is a pointer to a
pointer. That’s the destination for any newly-allocated element. As it
travels down the tree, this points into the parent’s child
array. If
it points to null, then it’s an empty tree which, by definition, does not
contain the key.
We need two “methods” for keys: hash
and equals
. The hash function
should return a uniformly distributed integer. As is usually the case,
less uniform fast hashes generally do better than highly-uniform slow
hashes. For hash maps under ~100K elements a 32-bit hash is fine, but
larger maps should use a 64-bit hash state and result. Hash collisions
revert to linear, linked list performance and, per the birthday paradox,
that will happen often with 32-bit hashes on large hash maps.
If you’re worried about pathological inputs, add a seed parameter to
upsert
and hash
. Or maybe even use the address m
as a seed. The
specifics depend on your security model. It’s not an issue for most hash
maps, so I don’t demonstrate it here.
The top two bits of the hash are used to select a branch. These tend to be higher quality for multiplicative hash functions. At each level two bits are shifted out. This is what gives it its name: a trie of the hash bits. Though it’s un-trie-like in the way it deposits elements at the first empty spot. To make it 2-ary or 8-ary, use 1 or 3 bits at a time.
I initially tried a Multiplicative Congruential Generator (MCG) to select the next branch at each trie level, instead of bit shifting, but NRK noticed it was consistently slower than shifting.
While “delete” could be handled using gravestones, many deletes would not work well. After all, the underlying allocator is an arena. A combination of uniformly distributed branching and no deletion means that rebalancing is unnecessary. This is what grants it its simplicity!
If no arena is provided, it reverts to a lookup and returns null when the
key is not found. It allows one function to flexibly serve both modes. In
unique
, pure lookups are unneeded, so this condition could be skipped in
its strmap
.
Sometimes it’s useful to return the entire hashmap
object itself rather
than an internal pointer, particularly when it’s intrusive. Use whichever
works best for the situation. Regardless, exploit zero-initialization to
detect newly-allocated elements when possible.
In some cases we may deep copy the key in its arena before inserting it
into the map. The provided key may be a temporary (e.g. sprintf
) which
the map outlives, and the caller doesn’t want to allocate a longer-lived
key unless it’s needed. It’s all part of tailoring the map to the problem,
which we can do because it’s so short and simple!
Putting it all together, unique
could look like the following, with
strmap
/upsert
renamed to strset
/ismember
:
uint64_t hash(str s)
{
uint64_t h = 0x100;
for (ptrdiff_t i = 0; i < s.len; i++) {
h ^= s.data[i];
h *= 1111111111111111111u;
}
return h;
}
bool equals(str a, str b)
{
return a.len==b.len && !memcmp(a.data, b.data, a.len);
}
typedef struct {
strset *child[4];
str key;
} strset;
bool ismember(strset **m, str key, arena *perm)
{
for (uint64_t h = hash(key); *m; h <<= 2) {
if (equals(key, (*m)->key)) {
return 1;
}
m = &(*m)->child[h>>62];
}
*m = new(perm, strset);
(*m)->key = key;
return 0;
}
ptrdiff_t unique(str *strings, ptrdiff_t len, arena scratch)
{
ptrdiff_t count = 0;
for (strset *seen = 0; count < len;) {
if (ismember(&seen, strings[count], &scratch)) {
strings[count] = strings[--len];
} else {
count++;
}
}
return count;
}
The FNV hash multiplier is 19 ones, my favorite prime. I don’t bother with
an xorshift finalizer because the bits are used most-significant first.
Exercise for the reader: Support retaining the original input order using
an intrusive linked list on strset
.
As mentioned, four pointers per entry — 32 bytes on 64-bit hosts — makes these hash-tries a bit heavier than average. It’s not an issue for smaller hash maps, but has practical consequences for huge hash maps.
In attempt to address this, I experimented with relative pointers
(example: markov.c
). That is, instead of pointers I use signed
integers whose value indicates an offset relative to itself. Because
relative pointers can only refer to nearby memory, a custom allocator is
imperative, and arenas fit the bill perfectly. Range can be extended by
exploiting memory alignment. In particular, 32-bit relative pointers can
reference up to 8GiB in either direction. Zero is reserved to represent a
null pointer, and relative pointers cannot refer to themselves.
As a bonus, data structures built out of relative pointers are position independent. A collection of them — perhaps even a whole arena — can be dumped out to, say, a file, loaded back at a different position, then continue to operate as-is. Very cool stuff.
Using 32-bit relative pointers on 64-bit hosts cuts the hash-trie overhead in half, to 16 bytes. With an arena no larger than 8GiB, such pointers are guaranteed to work. No object is ever too far away. It’s a compounding effect, too. Smaller map nodes means a larger number of them are in reach of a relative pointer. Also very cool.
However, as far as I know, no generally available programming language implementation supports this concept well enough to put into practice. You could implement relative pointers with language extension facilities, such as C++ operator overloads, but no tools will understand them — a major bummer. You can no longer use a debugger to examine such structures, and it’s just not worth that cost. If only arena allocation was more popular…
For the finale, let’s convert upsert
into a concurrent, lock-free hash
map. That is, multiple threads can call upsert concurrently on the same
map. Each must still have its own arena, probably per-thread arenas, and
so no implicit locking for allocation.
The structure itself requires no changes! Instead we need two atomic
operations: atomic load (acquire), and atomic compare-and-exchange
(acquire/release). They operate only on child
array elements and the
tree root. To illustrate I will use GCC atomics, also supported by
Clang.
valtype *upsert(map **m, keytype key, arena *perm)
{
for (uint64_t h = hash(key);; h <<= 2) {
map *n = __atomic_load_n(m, __ATOMIC_ACQUIRE);
if (!n) {
if (!perm) {
return 0;
}
arena rollback = *perm;
map *new = new(perm, map, 1);
new->key = key;
int pass = __ATOMIC_RELEASE;
int fail = __ATOMIC_ACQUIRE;
if (__atomic_compare_exchange_n(m, &n, new, 0, pass, fail)) {
return &new->value;
}
*perm = rollback;
}
if (equals(n->key, key)) {
return &n->value;
}
m = n->child + (h>>62);
}
}
First an atomic load retrieves the current node. If there is no such node, then attempt to insert one using atomic compare-and-exchange. The ABA problem is not an issue thanks again to lack of deletion: Once set, a pointer never changes. Before allocating a node, take a snapshot of the arena so that the allocation can be reverted on failure. If another thread got there first, continue tumbling down the tree as though a null was never observed.
On compare-and-swap failure, it turns into an acquire load, just as it began. On success, it’s a release store, synchronizing with acquire loads on other threads.
The key
field does not require atomics because it’s synchronized by the
compare-and-swap. That is, the assignment will happen before the node is
inserted, and keys do not change after insertion. The same goes for any
zeroing done by the arena.
Loads and stores through the returned pointer are the caller’s
responsibility. These likely require further synchronization. If
valtype
is a shared counter then an atomic increment is sufficient. In
other cases, upsert
should probably be modified to accept an initial
value to be assigned alongside the key so that the entire key/value pair
inserted atomically. Alternatively, break it into two steps. The
details depend on the needs of the program.
On small trees there will much contention near the root of the tree during inserts. Fortunately, a contentious tree will not stay small for long! The hash function will spread threads around a large tree, generally keeping them off each other’s toes.
A complete demo you can try yourself: concurrent-hash-trie.c
.
It returns a value pointer like above, and store/load is synchronized by
the thread join. Each thread is given a per-thread subarena allocated out
of the main arena, and the final tree is built from these subarenas.
For a practical example: a multithreaded rainbow table to find hash function collisions. Threads are synchronized solely through atomics in the shared hash-trie.
A complete fast, concurrent, lock-free hash map in under 30 lines of C sounds like a sweet deal to me!
]]>Over the past year I’ve refined my approach to arena allocation. With practice, it’s effective, simple, and fast; typically as easy to use as garbage collection but without the costs. Depending on need, an allocator can weigh just 7–25 lines of code — perfect when lacking a runtime. With the core details of my own technique settled, now is a good time to document and share lessons learned. This is certainly not the only way to approach arena allocation, but these are practices I’ve worked out to simplify programs and reduce mistakes.
An arena is a memory buffer and an offset into that buffer, initially zero. To allocate an object, grab a pointer at the offset, advance the offset by the size of the object, and return the pointer. There’s a little more to it, such as ensuring alignment and availability. We’ll get to that. Objects are not freed individually. Instead, groups of allocations are freed at once by restoring the offset to an earlier value. Without individual lifetimes, you don’t need to write destructors, nor do your programs need to walk data structures at run time to take them apart. You also no longer need to worry about memory leaks.
A minority of programs inherently require general purpose allocation, at least in part, that linear allocation cannot fulfill. This includes, for example, most programming language runtimes. If you like arenas, avoid accidentally create such a situation through an over-flexible API that allows callers to assume you have general purpose allocation underneath.
To get warmed up, here’s my style of arena allocation in action that shows off multiple features:
typedef struct {
uint8_t *data;
ptrdiff_t len;
} str;
typedef struct {
strlist *next;
str item;
} strlist;
typedef struct {
str head;
str tail;
} strpair;
// Defined elsewhere
void towidechar(wchar_t *, ptrdiff_t, str);
str loadfile(wchar_t *, arena *);
strpair cut(str, uint8_t);
strlist *getlines(str path, arena *perm, arena scratch)
{
int max_path = 1<<15;
wchar_t *wpath = new(&scratch, wchar_t, max_path);
towidechar(wpath, max_path, path);
strpair pair = {0};
pair.tail = loadfile(wpath, perm);
strlist *head = 0;
strlist **tail = &head;
while (pair.tail.len) {
pair = cut(pair.tail, '\n');
*tail = new(perm, strlist, 1);
(*tail)->item = pair.head;
tail = &(*tail)->next;
}
return head;
}
Take note of these details, each to be later discussed in detail:
getlines
takes two arenas, “permanent” and “scratch”. The former is
for objects that will be returned to the caller. The latter is for
temporary objects whose lifetime ends when the function returns. They
have stack lifetimes just like local variables.
Objects are not explicitly freed. Instead, all allocations from a scratch arena are implicitly freed upon return. This would include error return paths automatically.
The scratch arena is passed by copy — i.e. a copy of the “header” not the memory region itself. Allocating only changes the local copy, and so cannot survive the return. The semantics are obvious to callers, so they’re less likely to get mixed up.
While wpath
could be an automatic local variable, it’s relatively
large for the stack, so it’s allocated out of the scratch arena. A
scratch arena safely permits large, dynamic allocations that would never
be safe on the stack. In other words, a sane alloca
!
Same for variable-length arrays (VLAs). A scratch arena means you’ll
never be tempted to use either of these terrible ideas.
The second parameter to new
is a type, so it’s obviously a macro. As
you will see momentarily, this is not some complex macro magic, just a
convenience one-liner. There is no implicit cast, and you will get a
compiler diagnostic if the type is incorrect.
Despite all the allocation, there is not a single sizeof
operator nor
size computation. That’s because size computations are a major source
of defects. That job is handled by specialized code.
Allocation failures are not communicated by a null return. Lifting this burden greatly simplifies programs. Instead such errors are handled non-locally by the arena.
All allocations are zero-initialized by default. This makes for simpler, less error-prone programs. When that’s too expensive, this can become an opt-out without changing the default.
See also u-config.
An arena suitable for most cases can be this simple:
typedef struct {
char *beg;
char *end;
} arena;
void *alloc(arena *a, ptrdiff_t size, ptrdiff_t align, ptrdiff_t count)
{
ptrdiff_t padding = -(uintptr_t)a->beg & (align - 1);
ptrdiff_t available = a->end - a->beg - padding;
if (available < 0 || count > available/size) {
abort(); // one possible out-of-memory policy
}
void *p = a->beg + padding;
a->beg += padding + count*size;
return memset(p, 0, count*size);
}
Yup, just a pair of pointers! When allocating, all sizes are signed just as they ought to be. Unsigned sizes are another historically common source of defects, and offer no practical advantages in return.
The align
parameter allows the arena to handle any unusual alignments,
something that’s surprisingly difficult to do with libc. It’s difficult to
appreciate its usefulness until it’s convenient.
The uintptr_t
business may look unusual if you’ve never come across it
before. To align beg
, we need to compute the number of bytes to advance
the address (padding
) until the alignment evenly divides the address.
The modulo with align
computes the number of bytes it’s since the last
alignment:
extra = addr % align
We can’t operate numerically on an address like this, so in the code we
first convert to uintptr_t
. Alignment is always a power of two, which
notably excludes zero, so no worrying about division by zero. That also
means we can compute modulo by subtracting one and masking with AND:
extra = addr & (align - 1)
However, we want the number of bytes to advance to the next alignment, which is the inverse:
padding = -addr & (align - 1)
Add the uintptr_t
cast and you have the code in alloc
.
The if
tests if there’s enough memory and simultaneously for overflow on
size*count
. If either fails, it invokes the out-of-memory policy, which
in this case is abort
. I strongly recommend that, at least when testing,
always having something in place to, at minimum, abort when allocation
fails, even when you think it cannot happen. It’s easy to use more memory
than you anticipate, and you want a reliable signal when it happens.
An alternative policy is to longjmp to a “handler”, which with
GCC and Clang doesn’t even require runtime support. In that case add a
jmp_buf
to the arena:
typedef struct {
char *beg;
char *end;
void **jmp_buf;
} arena;
void *alloc(...)
{
// ...
if (/* out of memory */) {
__builtin_longjmp(a->jmp_buf, 1);
}
// ...
}
bool example(..., arena scratch)
{
void *jmp_buf[5];
if (__builtin_setjmp(jmp_buf)) {
return 0;
}
scratch.jmp_buf = jmp_buf;
// ...
return 1;
}
example
returns failure to the caller if it runs out of memory, without
needing to check individual allocations and, thanks to the implicit free
of scratch arenas, without needing to clean up. If callees receiving the
scratch arena don’t set their own jmp_buf
, they’ll return here, too. In
a real program you’d probably wrap the setjmp
setup in a macro.
Suppose zeroing is too expensive or unnecessary in some cases. Add a flag to opt out:
void *alloc(..., int flags)
{
// ...
return flag&NOZERO ? p : memset(p, 0, total);
}
Similarly, perhaps there’s a critical moment where you’re holding a non-memory resource (lock, file handle), or you don’t want allocation failure to be fatal. In either case, it’s important that the out-of-memory policy isn’t invoked. You could request a “soft” failure with another flag, and then do the usual null pointer check:
void *alloc(..., int flags)
{
// ...
if (/* out of memory */) {
if (flags & SOFTFAIL) {
return 0;
}
abort();
}
// ...
}
Most non-trivial programs will probably have at least one of these flags.
In case it wasn’t obvious, allocating an arena is simple:
arena newarena(ptrdiff_t cap)
{
arena a = {0};
a.beg = malloc(cap);
a.end = a.beg ? a.beg+cap : 0;
return a;
}
Or make a direct allocation from the operating system, e.g. mmap
,
VirtualAlloc
. Typically arena lifetime is the whole program, so you
don’t need to worry about freeing it. (Since you’re using arenas, you can
also turn off any memory leak checkers while you’re at it.)
If you need more arenas then you can always allocate smaller ones out of the first! In multi-threaded applications, each thread may have at least its own scratch arena.
new
macroI’ve shown alloc
, but few parts of the program should be calling it
directly. Instead they have a macro to automatically handle the details. I
call mine new
, though of course if you’re writing C++ you’ll need to
pick another name (make
? PushStruct
?):
#define new(a, t, n) (t *)alloc(a, sizeof(t), _Alignof(t), n)
The cast is an extra compile-time check, especially useful for avoiding
mistakes in levels of indirection. It also keeps normal code from directly
using the sizeof
operator, which is easy to misuse. If you added a
flags
parameter, pass in zero for this common case. Keep in mind that
the goal of this macro is to make common allocation simple and robust.
Often you’ll allocate single objects, and so the count is 1. If you think
that’s ugly, you could make variadic version of new
that fills in common
defaults. In fact, that’s partly why I put count
last!
#define new(...) newx(__VA_ARGS__,new4,new3,new2)(__VA_ARGS__)
#define newx(a,b,c,d,e,...) e
#define new2(a, t) (t *)alloc(a, sizeof(t), alignof(t), 1, 0)
#define new3(a, t, n) (t *)alloc(a, sizeof(t), alignof(t), n, 0)
#define new4(a, t, n, f) (t *)alloc(a, sizeof(t), alignof(t), n, f)
Not quite so simple, but it optionally makes for more streamlined code:
thing *t = new(perm, thing);
thing *ts = new(perm, thing, 1000);
char *buf = new(perm, char, len, NOZERO);
Side note: If sizeof
should be avoided, what about array lengths? That’s
part of the problem! Hardly ever do you want the size of an array, but
rather the number of elements. That includes char
arrays where this
happens to be the same number. So instead, define a countof
macro that
uses sizeof
to compute the value you actually want. I like to have this
whole collection:
#define sizeof(x) (ptrdiff_t)sizeof(x)
#define countof(a) (sizeof(a) / sizeof(*(a)))
#define lengthof(s) (countof(s) - 1)
Yes, you can convert sizeof
into a macro like this! It won’t expand
recursively and bottoms out as an operator. countof
also, of course,
produces a less error-prone signed count so users don’t fumble around with
size_t
. lengthof
statically produces null-terminated string length.
char msg[] = "hello world";
write(fd, msg, lengthof(msg));
#define MSG "hello world"
write(fd, MSG, lengthof(MSG));
alloc
with attributesAt least for GCC and Clang, we can further improve alloc
with three
function attributes:
__attribute((malloc, alloc_size(2, 4), alloc_align(3)))
void *alloc(...);
malloc
indicates that the pointer returned by alloc
does not alias any
existing object. Enables some significant optimizations that are otherwise
blocked, most often by breaking potential loop-carried dependencies.
alloc_size
tracks the allocation size for compile-time diagnostics and
run-time assertions (__builtin_object_size
). This generally
requires a non-zero optimization level. In other words, you will get a
compiler warnings about some out bounds accesses of arena objects, and
with Undefined Behavior Sanitizer you’ll get run-time bounds checking.
It’s a great complement to fuzzing.
In theory alloc_align
may also allow better code generation, but I’ve
yet to observe a case. Consider it optional and low-priority. I mention it
only for completeness.
How large an arena should you allocate? The simple answer: As much as is necessary for the program to successfully complete. Usually the cost of untouched arena memory is low or even zero. Most programs should probably have an upper limit, at which point they assume something has gone wrong. Arenas allow this case to be handled gracefully, simplifying recovery and paving the way for continued operation.
While a sufficient answer for most cases, it’s unsatisfying. There’s a common assumption that programs should increase their memory usage as much as needed and let the operating system respond if it’s too much. However, if you’ve ever tried this yourself, you probably noticed that mainstream operating systems don’t handle it well. The typical results are system instability — thrashing, drivers crashing — possibly necessitating a reboot.
If you insist on this route, on 64-bit hosts you can reserve a gigantic
virtual address space and gradually commit memory as needed. On Linux that
means leaning on overcommit by allocating the largest arena possible at
startup, which will automatically commit through use. Use MADV_FREE
to
decommit.
On Windows, VirtualAlloc
handles reserve and commit separately. In
addition to the allocation offset, you need a commit offset. Then expand
the committed region ahead of the allocation offset as it grows. If you
ever manually reset the allocation offset, you could decommit as well, or
at least MEM_RESET
. At some point commit may fail, which should then
trigger the out-of-memory policy, but the system is probably in poor shape
by that point — i.e. use an abort policy to release it all quickly.
While allocations out of an arena don’t require individual error checks,
allocating the arena itself at startup requires error handling. It would
be nice if the arena could be allocated out of .bss
and punt that job to
the loader. While you could make a big, global char[]
array to back
your arena, it’s technically not permitted (strict aliasing). A “clean”
.bss
region could be obtained with a bit of assembly — .comm
plus assembly to get the address into C without involving an array. I
wanted a more portable solution, so I came up with this:
arena getarena(void)
{
static char mem[1<<28];
arena r = {0};
r.beg = mem;
asm ("" : "+r"(r.beg)); // launder the pointer
r.end = r.beg + countof(mem);
return r;
}
The asm
accepts a pointer and returns a pointer ("+r"
). The compiler
cannot “see” that it’s actually empty, and so returns the same pointer.
The arena will be backed by mem
, but by laundering the address through
asm
, I’ve disconnected the pointer from its origin. As far the compiler
is concerned, this is some foreign, assembly-provided pointer, not a
pointer into mem
. It can’t optimize away mem
because it’s been given
to a mysterious assembly black box.
While inappropriate for a real project, I think it’s a neat trick.
In my initial example I used a linked list to stores lines. This data structure is great with arenas. It only takes a few of lines of code to implement a linked list on top of an arena, and no “destroy” code is needed. Simple.
What about arena-backed associative arrays? Or arena-backed dynamic arrays? See these follow-up articles for details!
]]>ld
, link
) that the DLL
exports a symbol with that name (import library), it matches the declared
name with this export, and it becomes an import in your program’s import
table. What happens when two different DLLs export the same symbol? The
link editor will pick the first found. But what if you want to use both
exports? If they have the same name, how could program or link editor
distinguish them? In this article I’ll demonstrate a technique to resolve
this by creating a program which links with and directly uses two
different C runtimes (CRTs) simultaneously.
In PE executable images, an import isn’t just a symbol, but a tuple
of DLL name and symbol. For human display, a tuple is typically formatted
with an exclamation point delimiter, as in msvcrt.dll!malloc
, though
sometimes without the .dll
suffix. You’ve likely seen this in stack
traces. Because it’s a tuple and not just a symbol, it’s possible to refer
to, and import, the same symbol from different DLLs. Contrast that with
ELF, which has a list of shared objects, and a separate list of symbols,
with the dynamic linker pairing them up at load time. That permits cool
tricks like LD_PRELOAD
, but for the same reason loading is less
predictable.
Windows comes with several CRTs, and various libraries and applications
use one or another (or none) depending on how they were built. As
C standard library implementations they export mostly the same symbols,
malloc
, printf
, etc. With imports as tuples, it’s not so unusual for
an application to load multiple CRTs at once. Typically coexistence is
transitive. That is, a module does not directly access both CRTs but
depends on modules that use different CRTs. One module calls, say,
msvcrt.dll!malloc
, and another module calls ucrtbase.dll!malloc
. With
DLL-qualified symbols, this is sound so long as modules don’t cross the
streams, e.g. an allocation in one module must not be freed in the other.
Libraries in this ecosystem must avoid exposing their CRT through their
interfaces, such as expecting the library’s caller to free()
objects:
The caller might not have access to the right free
!
Contrast again with the unix ecosystem generally, where a process can only
load one libc and everyone is expected to share. Libraries commonly expect
callers to free()
their objects (e.g. libreadline, xcb),
blending their interface with libc.
Suppose you’re in such a situation where, due to unix-oriented libraries,
your application must use functions from two different CRTs at once. One
might have been compiled with Mingw-w64 and linked with MSVCRT, and the
other compiled with MSVC and linked with UCRT. We need to call malloc
and free
in each, but they have the same name. What a pickle!
There’s an obvious, and probably most common, solution: run-time dynamic linking. Use load-time linking on one CRT, and LoadLibrary on the other CRT with GetProcAddress to obtain function pointers. However, it’s possible to do this entirely with load-time linking!
Think about it a moment and you might wonder: If the names are the same,
how can I pick which I’m calling? The tuple representation won’t work
because !
cannot appear in an identifier, which is, after all, why it
was chosen. The trick is that we’re going to rename one of them! To
demonstrate, I’ll use my Windows development kit, w64devkit, a
Mingw-w64 distribution that links MSVCRT. I’m going to use UCRT as the
second CRT to access ucrtbase.dll!malloc
.
I can choose whatever valid identifier I’d like, so I’m going to pick
ucrt_malloc
. This will require a declaration:
__declspec(dllimport) void *ucrt_malloc(size_t);
If I stop here and try to use it, of course it won’t work:
ld: undefined reference to `__imp_ucrt_malloc'
The linker hasn’t yet been informed of the change in management. For that
we’ll need an import library. I’ll define one using a .def file,
which I’ll name ucrtbase.def
:
LIBRARY ucrtbase.dll
EXPORTS
ucrt_malloc == malloc
The last line says that this library has the symbol ucrt_malloc
, but
that it should be imported as malloc
. This line is the lynchpin to the
whole scheme. Note: The double equals is important, as a single equals
sign means something different. Next, use dlltool
to build the import
library:
$ dlltool -d ucrtbase.def -l ucrtbase.lib
The equivalent MSVC tool is lib
, but as far as I know it cannot
quite do this sort of renaming. However, MSVC link
will work just fine
with this dlltool
-created import library. The name ucrtbase.lib
, while
obvious, is irrelevant. It’s that LIBRARY
line that ties it to the DLL.
My test source file looks like this:
#include <stdlib.h>
__declspec(dllimport) void *ucrt_malloc(size_t);
int main(void)
{
void *msvcrt[] = {malloc(1), malloc(1), malloc(1)};
void *ucrt[] = {ucrt_malloc(1), ucrt_malloc(1), ucrt_malloc(1)};
return 0;
}
It compiles successfully:
$ cc -g3 -o main.exe main.c ucrtbase.lib
I can see the two malloc
imports with objdump
:
$ objdump -p main.exe
...
DLL Name: msvcrt.dll
...
844a 1021 malloc
...
DLL Name: ucrtbase.dll
847e 1 malloc
It loads and runs successfully, too:
$ gdb main.exe
Reading symbols from main.exe...
(gdb) break 9
Breakpoint 1 at 0x1400013cd: file main.c, line 9.
(gdb) run
Thread 1 hit Breakpoint 1, main () at main.c:9
9 return 0;
(gdb) p msvcrt
$1 = {0xd06a30, 0xd06a70, 0xd06ab0}
(gdb) p ucrt
$2 = {0x6e9490, 0x6eb7c0, 0x6eb800}
The pointer addresses confirm that these are two, distinct allocators. Perhaps you’re wondering what happens if I cross the streams?
int main(void)
{
free(ucrt_malloc(1));
}
The MSVCRT allocator justifiably panics over the bad pointer:
$ cc -g3 -o chaos.exe chaos.c ucrtbase.lib
$ gdb -ex run chaos.exe
Starting program: chaos.exe
warning: HEAP[chaos.exe]:
warning: Invalid address specified to RtlFreeHeap
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffc42c369af in ntdll!RtlRegisterSecureMemoryCacheCallback ()
(gdb)
While you’re probably not supposed to meddle with ucrtbase.dll
like
this, the general principle of export renames is reasonable. I don’t
expect I’ll ever need to do it, but I like that I have the option.
Win32 has two interfaces for interacting with environment variables:
The first, which I’ll call get/set, is the easy interface, with Windows doing all the searching and sorting on your behalf. It’s also the only supported interface through which a process can manipulate its own variables. It has no function for enumerating variables.
The second, which I’ll call get/free, allocates a copy of the environment block. Calls to get/set does not modify existing copies. Similarly, manipulating this block has no effect on the environment as viewed through get/set. In other words, it’s read only. We can enumerate our environment variables by walking the environment block. As I will discuss below, enumeration is it’s only consistently useful purpose!
Technically it’s possible to access the actual environment block through undocumented fields in the PEB. It’s the same content as returned by get/free except that it’s not a copy. It cannot be accessed safely, so I’m ignoring this route.
The environment block format is a null-terminated block of null-terminated strings:
keyA=a\0keyBB=bb\0keyCCC=ccc\0\0
Each string begins with a character other than =
and contains at least
one =
. In my tests this rule was strictly enforced by Windows, and I
could not construct an environment block that broke this rule. This list
is usually, but not always, sorted. It may contain repeated variables, but
they’re always assigned the same value, which is also strictly enforced by
Windows.
The get/free interface has no “set” function, and a process cannot set its own environment block to a custom buffer. There is one interface where a process gets to provide a raw environment block: CreateProcess. That is, a parent can construct one for its children.
wchar_t env[] = L"HOME=C:\\Users\\me\0PATH=C:\\bin;C:\\Windows\0";
CreateProcessW(L"example.exe", ..., env, ...);
Windows imposes some rules upon this environment block:
If an element begins with =
or does not contain =
, CreateProcess
fails.
Repeated variables are modified to match the first instance. If you’re potentially overriding using a duplicate, put the override first.
Some cases of bad formatting become memory access violations.
As usual for Win32, there are no rules against ill-formed UTF-16, and I could always pass such “UTF-16” through into the child environment block. Keep that in mind even when using the get/set interface.
The SetEnvironmentVariable documentation gives a maximum variable size:
The maximum size of a user-defined environment variable is 32,767 characters. There is no technical limitation on the size of the environment block.
At least on more recent versions of Windows, my experiments proved exactly the opposite. There is no limit on a user-defined environment variables, but environment blocks are limited to 2GiB, for both 32-bit and 64-bit processes. I could even create such huge environments in large address aware 32-bit processes, though the interfaces are prone to error due to allocations problems.
There’s one special case where CreateProcess is illogical, and it’s
certainly a case of confusion within its implementation. An environment
block is not allowed to be empty. An empty environment is represented as
a block containing one empty (zero length) element. That is, two null
terminators in a row. It’s the one case where an environment block may
contain an element without a =
. The logical empty environment block
would be just one null terminator, to terminate the block itself, because
it contains no variables. You can safely pretend that’s the case when
parsing an environment block, as this special case is superfluous.
However, CreateProcess partially enforces this silly, unnecessary special case! If an environment block begins with a null terminator, the next character must be in a mapped memory region because it will read this character. If it’s not mapped, the result is a memory access violation. Its actual value doesn’t matter, and CreateProcess will treat it as though it was another null terminator. Surely someone at Microsoft would have noticed by now that this behavior makes no sense, but I guess it’s kept for backwards compatibility?
The CreateProcess documentation says that “the system uses a sorted environment” but this made no difference in my tests. The word “must” appears in this sentence, but it’s unclear if it applies to sorting, or even outside the special case being discussed. GetEnvironmentVariable works fine on an unsorted environment block. SetEnvironmentVariable maintains sorting, but given an unsorted block it goes somewhere in the middle, probably wherever a bisection happens to land. Perhaps look-ups in sorted blocks are faster, but environment blocks are so small — a maximum of 32K characters — that, in practice, it really does not matter.
Suppose you’re meticulous and want to sort your environment block before spawning a process. How do you go about it? There’s the rub: The official documentation is incomplete! The Changing Environment Variables page says:
All strings in the environment block must be sorted alphabetically by name. The sort is case-insensitive, Unicode order, without regard to locale.
What do they mean by “case-insensitive” sort? Does “Unicode order” mean
case folding? A reasonable guess, but no, that’s not how get/set
works. Besides, how does “Unicode order” apply to ill-formed UTF-16?
Worse, get/set sorting is certainly not “Unicode order” even outside of
case-insensitivity! For example, U+1F31E
(SUN WITH FACE) sorts ahead of
U+FF01
(FULLWIDTH EXCLAMATION MARK) because the former encodes in UTF-16
as U+D83C U+DF1E
. Maybe it’s case-insensitive only in ASCII? Nope, π
(U+03C0
) and Π (U+03A0
) are considered identical. Windows uses some
kind of case-insensitive, but not case-folded, undocumented early 1990s
UCS-2 sorting logic for environment variables.
Update: John Doty suspects the RtlCompareUnicodeString function for sorting. It lines up perfectly with get/set for all possible inputs.
Without better guidance, the only reliable way to “correctly” sort an environment block is to build it with get/set, then retrieve the result with get/free. The algorithm looks like:
Unfortunately that’s all global state, so you can only construct one new environment block at a time.
If you know all your variable names ahead of time, then none of this is a problem. Determine what Windows thinks the order should be, then use that in your program when constructing the environment block. It’s the general case where this is a challenge, such as a language runtime designed to operate on arbitrary environment variables with behavior congruent to the rest of the system.
There are similar issues with looking up variables in an environment block. How does case-insensitivity work? Sorting is “without regard to locale” but what about when comparing variable names? The documentation doesn’t say. When enumerating variables using get/free, you might read what get/set considers to be duplicates, though at least values will always agree with get/set, i.e. they’re aliases of one variables. Windows maintains that invariant in my tests. The above algorithm would also delete these duplicates.
For example, if someone passed you a “dirty” environment with duplicates, or that was unsorted, this would clean it up in a way that allows get/free to be traversed in order without duplicates.
wchar_t *env = GetEnvironmentStringsW();
// Clear out the environment
for (wchar_t *var = env; *var;) {
size_t len = wcslen(var);
size_t split = wcscspn(var, L"=");
var[split] = 0;
SetEnvironmentVariableW(var, 0);
var[split] = '=';
var += len + 1;
}
// Restore the original variables
for (wchar_t *var = env; *var;) {
size_t len = wcslen(var);
size_t split = wcscspn(var, L"=");
var[split] = 0;
SetEnvironmentVariableW(var, var+split+1);
var += len + 1;
}
FreeEnvironmentStringsW(env);
On the second pass, SetEnvironmentVariableW will gobble up all the duplicates.
As a final note, the CreateProcess page had said this up until February 2023 about the environment block parameter:
If this parameter is
NULL
and the environment block of the parent process contains Unicode characters, you must also ensure thatdwCreationFlags
includesCREATE_UNICODE_ENVIRONMENT
.
That seems to indicate it’s virtually always wrong to call CreateProcess without that flag — that is, Windows will trash the child’s environment unless this flag is passed — which is a bonkers default. Fortunately this appears to be wrong, which is probably why the documentation was finally corrected (after several decades). Omitting this flag was fine under all my tests, and I was unable to produce surprising behavior on any system.
In summary:
=
=
CREATE_UNICODE_ENVIRONMENT
necessary only for non-null environmentpthread_once
, using an integer.
We’ll need only three basic atomic operations — store, load, and increment
— and futex wait/wake. It will be zero-initialized and the entire source
small enough to fit on an old-fashioned terminal display. The interface
will also get an overhaul, more to my own tastes.
If you’d like to skip ahead: once.c
What’s the purpose? Suppose a concurrent program requires initialization, but has no definite moment to do so. Threads are already in motion, and it’s unpredictable which will arrive first, and when. It might be because this part of the program is loaded lazily, or initialization is expensive and only done lazily as needed. A “once” object is a control allowing the first arrival to initialize, and later arrivals to wait until initialization done.
The pthread version has this interface:
pthread_once_t once = PTHREAD_ONCE_INIT;
int pthread_once(pthread_once_t *, void (*init)(void));
It’s deliberately quite limited, and the specification refers to it merely as “dynamic package initialization.” That is, it’s strictly for initializing global package data, not individual objects, and a “once” object must be a static variable, not dynamically allocated. Also note the lack of context pointer for the callback. No pthread implementation I examined was actually so restricted, but the specification is written for the least common denominator, and the interface is clearly designed against more general use.
An example of lazily static table initialization for a cipher:
// Blowfish subkey tables (constants)
static uint32_t blowfish_p[20];
static uint32_t blowfish_s[256];
static pthread_once_t once = PTHREAD_ONCE_INIT;
static void init(void)
{
// ... populate blowfish_p and blowfish_s with pi ...
}
void blowfish_encrypt(struct blowfish *ctx, void *buf, size_t len)
{
pthread_once(&once, init);
// ... lookups into blowfish_p and blowfish_s ...
}
The pthread_once
allows blowfish_encrypt
to be called concurrently (on
different context objects). The first call populates lookup tables and
others wait as needed. A good pthread_once
will speculate initialization
has already completed and make that the fast path. The tables do not
require locks or atomics because pthread_once
establishes a
synchronization edge: initialization happens-before the return from
pthread_once
.
Go’s sync.Once
has a similar interface:
func (o *Once) Do(f func())
It’s more flexible and not restricted to global data, but retains the callback interface.
Callbacks are clunky, especially without closures, so in my re-imagining I wanted to remove it from the interface. Instead I broke out exit and entry. The in-between takes the place of the callback and it runs in its original context.
_Bool do_once(int *);
void once_done(int *);
This is similar to breaking “push” and “pop” each into two steps in my
concurrent queue. do_once
returns true if initialization is required,
otherwise it returns false after initialization has completed, i.e. it
blocks. The initializing thread signals that initialization is complete by
calling once_done
. As mentioned, the “once” object would be
zero-initialized. Reworking the above example:
// Blowfish subkey tables (constants)
static uint32_t blowfish_p[20];
static uint32_t blowfish_s[256];
static int once = 0;
void blowfish_encrypt(struct blowfish *ctx, void *buf, size_t len)
{
if (do_once(&once)) {
// ... populate blowfish_p and blowfish_s with pi ...
once_done(&once);
}
// ... lookups into blowfish_p and blowfish_s ...
}
It gets more interesting when taken beyond global initialization. Here each object is lazily initialized by the first thread to use it:
typedef struct {
int once;
// ...
} Thing;
static void expensive_init(Thing *, ptrdiff_t);
static double compute(Thing *t, ptrdiff_t index)
{
if (do_once(&t->once)) {
expensive_init(t, index);
once_done(&t->once);
}
// ...
}
int main(void)
{
// ...
Thing *things = calloc(1000000, sizeof(Thing));
#pragma omp parallel for
for (int i = 0; i < iterations; i++) {
ptrdiff_t which = random_access(i);
double r = compute(&things[which], which);
// ...
}
// ...
}
A “once” object must express at least these three states:
To support zero-initialization, (1) must map into zero. A thread observing (1) must successfully transition to (2) before attempting to initialize. A thread observing (2) must wait for a transition to (3). Observing (3) is the fast path, and the implementation should optimize for it.
The trickiest part is the state transition from (1) to (2). If multiple threads are attempting the transition concurrently, only one should “win”. The obvious choice is a compare-and-swap atomic, which will fail if another thread has already made the transition. However, with a more careful selection of state representation, we can do this with just an atomic increment!
The secret sauce: (2) will be any positive value and (3) will be any negative value. The “winner” is the thread that increments from zero to one. Other threads that also observed zero will increment to a different value, after which they behave as though they did not observe (1) in the first place.
I chose shorthand names for the three atomic and two futex operations. Each can be defined with a single line of code — the atomics with compiler intrinsics and the futex with system calls, as they interact with the system scheduler. (See the “four elements” of the wait group article.) Technically it will still work correctly if the futex calls are no-ops, though it would waste time spinning on the slow path. In a real program you’d probably use less pithy names.
static int load(int *);
static void store(int *, int);
static int incr(int *);
static void wait(int *, int);
static void wake(int *);
From here it’s useful to work backwards, starting with once_done
,
because there’s an important detail, another secret sauce ingredient:
void once_done(int *once)
{
store(once, INT_MIN);
wake(once);
}
Recall that the “initialized” state (3) is negative. We don’t just pick
any arbitrary negative, especially not the obvious -1, but the most
negative value. Keep that in mind. Once set, wake up any waiters. Since
this is the slow path, we don’t care to avoid the system call if there are
no waiters. Now do_once
:
_Bool do_once(int *once)
{
int r = load(once);
if (r < 0) {
return 0;
} else if (r == 0) {
r = incr(once);
if (r == 1) {
return 1;
}
}
while (r > 0) {
wait(once, r);
r = load(once);
}
return 0;
}
First, check for the fast path. If we’re already in state (3), return
immediately. If do_once
will be placed in a separate translation unit
from the caller, we might extract this check such that it can be inlined
at the call site. Once initialization has settled, nobody will be mutating
*once
, so this will be a fast, uncontended atomic load, though mind your
cache lines for false sharing.
If we’re in state (1), try to transition to state (2). If we incremented
to 1, we won so tell the caller to initialize. Otherwise continue as
though we never saw state (1). There’s an important subtlety easy to miss:
Initialization may have already completed before the increment. That is,
*once
may have been negative for the increment! Fortunately since we
chose INT_MIN
in once_done
, it will stay negative. (Assuming you
have less than 2 billion threads contending *once
. Ha!) So it’s vital to
check r
again for negative after the increment, hence while
instead of
do while
.
Losers continuing to increment *once
may interfere with the futex wait,
but, again, this is the slow path so that’s fine. Eventually we will wake
up and observe (3), then give control back to the caller.
That’s all there is to it. If you haven’t already, check out the source
including tests for for Windows and Linux: once.c
. Suggested
experiments to try, particularly under a debugger:
INT_MIN
to -1
.while (r > 0) { ... }
to do { ... } while (r > 0);
.NTHREADS
.)The Two Sum exercise, restated:
Given an integer array and target, return the distinct indices of two elements that sum to the target.
In particular, the solution doesn’t find elements, but their indices. The exercise also constrains input ranges — important but easy to overlook:
count
<= 104nums[i]
<= 109target
<= 109Notably, indices fit in a 16-bit integer with lots of room to spare. In fact, it will fit in a 14-bit address space (16,384) with still plenty of overhead. Elements fit in a signed 32-bit integer, and we can add and subtract elements without overflow, if just barely. The last constraint isn’t redundant, but it’s not readily exploitable either.
The naive solution is to linearly search the array for the complement. With nested loops, it’s obviously quadratic time. At 10k elements, we expect an abysmal 25M comparisons on average.
int16_t count = ...;
int32_t *nums = ...;
for (int16_t i = 0; i < count-1; i++) {
for (int16_t j = i+1; j < count; j++) {
if (nums[i]+nums[j] == target) {
// found
}
}
}
The nums
array is “keyed” by index. It would be better to also have the
inverse mapping: key on elements to obtain the nums
index. Then for each
element we could compute the complement and find its index, if any, using
this second mapping.
The input range is finite, so an inverse map is simple. Allocate an array, one element per integer in range, and store the index there. However, the input range is 2 billion, and even with 16-bit indices that’s a 4GB array. Feasible on 64-bit hosts, but wasteful. The exercise is certainly designed to make it so. This array would be very sparse, at most less than half a percent of its elements populated. That’s a hint: Associative arrays are far more appropriate for representing such sparse mappings. That is, a hash table.
Using Go’s built-in hash table:
func TwoSumWithMap(nums []int32, target int32) (int, int, bool) {
seen := make(map[int32]int16)
for i, num := range nums {
complement := target - num
if j, ok := seen[complement]; ok {
return int(j), i, true
}
seen[num] = int16(i)
}
return 0, 0, false
}
In essence, the hash table folds the sparse 2 billion element array onto a smaller array, with collision resolution when elements inevitably land in the same slot. For this exercise, that small array could be as small as 10,000 elements because that’s the most we’d ever need to track. For folding the large key space onto the smaller, we could use modulo. For collision resolution, we could keep walking the table.
int16_t seen[10000] = {0};
// Find or insert nums[index].
int16_t lookup(int32_t *nums, int16_t index)
{
int i = nums[index] % 10000;
for (;;) {
int16_t j = seen[i] - 1; // unbias
if (j < 0) { // empty slot
seen[i] = index + 1; // insert biased index
return -1;
} else if (nums[j] == nums[index]) {
return j; // match found
}
i = (i + 1) % 10000; // keep looking
}
}
Take note of a few details:
An empty slot is zero, and an empty table is a zero-initialized array. Since zero is a valid value, and all values are non-negative, it biases values by 1 in the table.
The nums
array is part of the table structure, necessary for lookups.
The two mappings — element-by-index and index-by-element — share
structure.
It uses open addressing with linear probing, and so walks the table until it either either finds the element or hits an empty slot.
The “hash” function is modulo. If inputs are not random, they’ll tend to bunch up in the table. Combined with linear probing makes for lots of collisions. For the worst case, imagine sequentially ordered inputs.
Sometimes the table will almost completely fill, and lookups will be no better than the linear scans of the naive solution.
Most subtle of all: This hash table is not enough for the exercise. The
keyed-on element may not even be in nums
, and when lookup fails, that
element is not inserted in the table. Instead, a different element is
inserted. The conventional solution has at least two hash table
lookups. In the Go code, it’s seen[complement]
for lookups and
seen[num]
for inserts.
To solve (4) we’ll use a hash function to more uniformly distribute elements in the table. We’ll also probe the table in a random-ish order that depends on the key. In practice there will be little bunching even for non-random inputs.
To solve (5) we’ll use a larger table: 214 or 16,384 elements. This has breathing room, and with a power of two we can use a fast mask instead of a slow division (though in practice, compilers usually implement division by a constant denominator with modular multiplication).
To solve (6) we’ll key complements together under the same key. It looks for the complement, but on failure it inserts the current element in the empty slot. In other words, this solution will only need a single hash table lookup per element!
Laying down some groundwork:
typedef struct {
int16_t i, j;
_Bool ok;
} TwoSum;
TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
TwoSum r = {0};
int16_t seen[1<<14] = {0};
for (int16_t n = 0; n < count; n++) {
// ...
}
return r;
}
The seen
array is a 32KiB hash table large enough for all inputs, small
enough that it can be a local variable. In the loop:
int32_t complement = target - nums[n];
int32_t key = complement>nums[n] ? complement : nums[n];
uint32_t hash = key * 489183053u;
unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
unsigned step = hash>>13 | 1;
Compute the complement, then apply a “max” operation to derive a key. Any commutative operation works, though obviously addition would be a poor choice. XOR is similar enough to cause many collisions. Multiplication works well, and is probably better if the ternary produces a branch.
The hash function is multiplication with a randomly-chosen prime.
As we’ll see in a moment, step
will also add-shift the hash before use.
The initial index will be the bottom 14 bits of this hash. For step
,
recall from the MSI article that it must be odd so that every slot is
eventually probed. I shift out 13 bits and then override the 14th bit, so
step
effectively skips over the 14 bits used for the initial table
index.
I used unsigned
because I don’t really care about the width of the hash
table index, but more importantly, I want defined overflow from all the
bit twiddling, even in the face of implicit promotion. As a bonus, it can
help in reasoning about indirection: seen
indices are unsigned
, nums
indices are int16_t
.
for (unsigned i = hash;;) {
i = (i + step) & mask;
int16_t j = seen[i] - 1; // unbias
if (j < 0) {
seen[i] = n + 1; // bias and insert
break;
} else if (nums[j] == complement) {
r.i = j;
r.j = n;
r.ok = 1;
return r;
}
}
The step is added before using the index the first time, helping to scatter the start point and reduce collisions. If it’s an empty slot, insert the current element, not the complement — which wouldn’t be possible anyway. Unlike conventional solutions, this doesn’t require another hash and lookup. If it finds the complement, problem solved, otherwise keep going.
Putting it all together, it’s only slightly longer than solutions using a generic hash table:
TwoSum twosum(int32_t *nums, int16_t count, int32_t target)
{
TwoSum r = {0};
int16_t seen[1<<14] = {0};
for (int16_t n = 0; n < count; n++) {
int32_t complement = target - nums[n];
int32_t key = complement>nums[n] ? complement : nums[n];
uint32_t hash = key * 489183053u;
unsigned mask = sizeof(seen)/sizeof(*seen) - 1;
unsigned step = hash>>13 | 1;
for (unsigned i = hash;;) {
i = (i + step) & mask;
int16_t j = seen[i] - 1; // unbias
if (j < 0) {
seen[i] = n + 1; // bias and insert
break;
} else if (nums[j] == complement) {
r.i = j;
r.j = n;
r.ok = 1;
return r;
}
}
}
return r;
}
Applying this technique to Go:
func TwoSumWithBespoke(nums []int32, target int32) (int, int, bool) {
var seen [1 << 14]int16
for n, num := range nums {
complement := target - num
hash := int(num * complement * 489183053)
mask := len(seen) - 1
step := hash>>13 | 1
for i := hash; ; {
i = (i + step) & mask
j := int(seen[i] - 1) // unbias
if j < 0 {
seen[i] = int16(n) + 1 // bias
break
} else if nums[j] == complement {
return j, n, true
}
}
}
return 0, 0, false
}
With Go 1.20 this is an order of magnitude faster than map[int32]int16
,
which isn’t surprising. I used multiplication as the key operator because,
in my first take, Go produced a branch for the “max” operation — at a 25%
performance penalty on random inputs.
A full-featured, generic hash table may be overkill for your problem, and a bit of hashed indexing with collision resolution over a small array might be sufficient. The problem constraints might open up such shortcuts.
]]>windows.h
. This header has an enormous
number of definitions and declarations and so, for C programs, it tends to
dominate the build time of those translation units. Most programs,
especially systems software, only needs a tiny portion of it. For example,
when compiling u-config with GCC, two thirds of the debug build was
spent processing windows.h
just for 4 types, 16 definitions, and 16
prototypes.
To give a sense of the numbers, here’s empty.c
, which does nothing but
include windows.h
.
#include <windows.h>
With the current Mingw-w64 headers, that’s ~82kLOC (non-blank):
$ gcc -E empty.c | grep -vc '^$'
82041
With w64devkit this takes my system ~450ms to compile with GCC:
$ time gcc -c empty.c
real 0m 0.45s
user 0m 0.00s
sys 0m 0.00s
Compiling an actually empty source file takes ~10ms, so it really is
spending practically all that time processing headers. MSVC is a faster
compiler, and this extends to processing an even larger windows.h
that
crosses over 100kLOC (VS2022). It clocks in at 120ms on the same system:
$ cl /nologo /E empty.c | grep -vc '^$'
empty.c
100944
$ time cl /nologo /c empty.c
empty.c
real 0m 0.12s
user 0m 0.09s
sys 0m 0.01s
That’s just low enough to be tolerable, but I’d like the situation with
GCC to be better. Defining WIN32_LEAN_AND_MEAN
reduces the number of
included headers, which has a significant effect:
$ gcc -E -DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
55025
$ time gcc -c -DWIN32_LEAN_AND_MEAN empty.c
real 0m 0.30s
user 0m 0.00s
sys 0m 0.00s
$ cl /nologo /E /DWIN32_LEAN_AND_MEAN empty.c | grep -vc '^$'
empty.c
41436
$ time cl /nologo /c /DWIN32_LEAN_AND_MEAN empty.c
empty.c
real 0m 0.07s
user 0m 0.01s
sys 0m 0.01s
The official solution is precompiled headers. Put all the system header
includes, or similar, into a dedicated header, then compile that
header into a special format. For example, headers.h
:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
Then main.c
includes windows.h
through this header:
#include "headers.h"
int mainCRTStartup(void)
{
return 0;
}
If I ask GCC to compile headers.h
:
$ gcc headers.h
It produces headers.h.gch
. When a source includes headers.h
, GCC first
searches for an appropriate .gch
. Not only must the name match, but so
must all the definitions at the moment of inclusion: headers.h
should
always be the first included header, otherwise it may not work. Now when I
compile main.c
:
$ time gcc -c main.c
real 0m 0.04s
user 0m 0.00s
sys 0m 0.00s
Much better! MSVC has a conventional name for this header recognizable to
every Visual Studio user: stdafx.h
. It works a bit differently, and I’ve
never used it myself, but I trust it has similar results.
Precompiled headers requires some extra steps that vary by toolchain. Can we do better? That depends on your definition of “better!”
As mentioned, systems software tends to need only a few declarations: open, read, write, stat, etc. What if I wrote these out manually? A bit tedious, but it doesn’t require special precompiled header handling. It also creates some new possibilities. To illustrate, a CRT-free “hello world” program:
#include <windows.h>
int mainCRTStartup(void)
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
char message[] = "Hello, world!\n";
DWORD len;
return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}
This takes my system half a second to compile — quite long to produce just 26 assembly instructions:
$ time cc -nostartfiles -o hello.exe hello.c
real 0m 0.50s
user 0m 0.00s
sys 0m 0.00s
$ ./hello.exe
Hello, world!
The program requires prototypes only for GetStdHandle and WriteFile, a
definition for STD_OUTPUT_HANDLE
, and some typedefs. Starting with the
easy stuff, the definition and types look like this:
#define STD_OUTPUT_HANDLE ((DWORD)-11)
typedef int BOOL;
typedef void *HANDLE;
typedef unsigned long DWORD;
By the way, here’s a cheat code for quickly finding preprocessor definitions, faster than looking them up elsewhere:
$ echo '#include <windows.h>' | gcc -E -dM - | grep 'STD_\w*_HANDLE'
#define STD_INPUT_HANDLE ((DWORD)-10)
#define STD_ERROR_HANDLE ((DWORD)-12)
#define STD_OUTPUT_HANDLE ((DWORD)-11)
Did you catch the pattern? It’s -10 - fd
, where fd
is the conventional
unix file descriptor number: a kind of mnemonic.
Prototypes are a little trickier, especially if you care about 32-bit. The Windows API uses the “stdcall” calling convention, which is distinct from the “cdecl” calling convention on x86, though the same on x64. Of course, you must already be aware of this merely using the API, as your own callbacks must usually be stdcall themselves. Further, API functions are DLL imports and should be declared as such. Putting it together, here’s GetStdHandle:
__declspec(dllimport)
HANDLE __stdcall GetStdHandle(DWORD);
This works with both Mingw-w64 and MSVC. MSVC requires __stdcall
between
the return type and function name, so don’t get clever about it. If you
only care about GCC then you can declare both at once using attributes:
HANDLE GetStdHandle(DWORD)
__attribute__((dllimport,stdcall));
I like to hide all this behind a macro, with a “table” of all my imports listed just below:
#define W32(r) __declspec(dllimport) r __stdcall
W32(HANDLE) GetStdHandle(DWORD);
W32(BOOL) WriteFile(HANDLE, const void *, DWORD, DWORD *, void *);
In WriteFile you may have noticed I’m taking shortcuts. The “official”
definition uses an ugly pointer typedef, LPCVOID
, instead of pointer
syntax, but I skipped that type definition. I also replaced the last
argument, an OVERLAPPED
pointer, with a generic pointer. I only need to
pass null. I can keep sanding it down to something more ergonomic:
W32(int) WriteFile(void *, void *, int, int *, void *);
That’s how I typically write these prototypes. I dropped the const
because it doesn’t help me. I used signed sizes because I like them better
and it’s what I’m usually holding at the call site. But doesn’t
changing the signedness potentially break compatibility? It makes no
difference to any practical ABI: It’s passed the same way. In general,
signedness is a matter for operators, and only some of them — mainly
comparisons (<
, >
, etc.) and division. It’s a similar story for
pointers starting with the 32-bit era, so I can choose whatever pointer
types are convenient.
In general, I can do anything I want so long as I know my compiler will
produce an appropriate function call. These are not standard functions,
like printf
or memcpy
, which are implemented in part by the compiler
itself, but foreign functions. It’s no different than teaching an
FFI how to make a call. This is also, in essence, how OpenGL and
Vulkan work, with applications defining the API for themselves.
Considering all this, my new hello world:
#define W32(r) __declspec(dllimport) r __stdcall
W32(void *) GetStdHandle(int);
W32(int) WriteFile(void *, void *, int, int *, void *);
int mainCRTStartup(void)
{
void *stdout = GetStdHandle(-10 - 1);
char message[] = "Hello, world!\n";
int len;
return !WriteFile(stdout, message, sizeof(message)-1, &len, 0);
}
You know, there’s a kind of beauty to a program that requires no external definitions. It builds quickly and produces a binary bit-for-bit identical to the original:
$ time cc -nostartfiles -o hello.exe main.c
real 0m 0.04s
user 0m 0.00s
sys 0m 0.00s
$ time cl /nologo hello.c /link /subsystem:console kernel32.lib
hello.c
real 0m 0.03s
user 0m 0.00s
sys 0m 0.00s
I’ve also been using this to patch over API rough edges. For example, WSARecvFrom takes WSAOVERLAPPED, but GetQueuedCompletionStatus takes OVERLAPPED. These types are explicitly compatible, and only defined separately for annoying technical reasons. I must use the same overlapped object with both APIs at once, meaning I would normally need ugly pointer casts on my Winsock calls, or vice versa with I/O completion ports. But because I’m writing all these definitions myself, I can define a common overlapped structure for both!
Perhaps you’re worried that this would be too fragile. Well, as a legacy software aficionado, I enjoy building and running my programs on old platforms. So far these programs still work properly going back 30 years to Windows NT 3.5 and Visual C++ 4.2. When I do hit a snag, it’s always been a bug (now long fixed) in the old operating system, not in my programs or these prototypes. So, in effect, this technique has worked well for the past 30 years!
Writing out these definitions is a bit of a chore, but after paying that price I’ve been quite happy with the results. I will likely continue doing it in the future, at least for non-graphical applications.
]]>The major compilers have an enormous number of knobs. Most are
highly specialized, but others are generally useful even if uncommon. For
warnings, the venerable -Wall -Wextra
is a good start, but
circumstances improve by tweaking this warning set. This article covers
high-hitting development-time options in GCC, Clang, and MSVC that ought
to get more consideration.
There’s an irony that the more you use these options, the less useful they become. Given a reasonable workflow, they are a harsh mistress in a fast, tight feedback loop quickly breaking the habits that cause warnings and errors. It’s a kind of self-improvement, where eventually most findings will be false positives. With heuristics internalized, you will be able spot the same issues just reading code — a handy skill during code review.
Traditionally, C and C++ compilers are by default conservative with
warnings. Unless configured otherwise, they only warn about the most
egregious issues where it’s highly confident. That’s too conservative. For
gcc
and clang
, the first order of business is turning on more warnings
with -Wall
. Despite the name, this doesn’t actually enable all
warnings. (clang
has -Weverything
which does literally this, but
trust me, you don’t want it.) However, that still falls short, and you’re
better served enabling extra warnings on with -Wextra
.
$ cc -Wall -Wextra ...
That should be the baseline on any new project, and closer to what these compilers should do by default. Not using these means leaving value on the table. If you come across such a project, there’s a good chance you can find bugs statically just by using this baseline. Some warnings only occur at higher optimization levels, so leave these on for your release builds, too.
For MSVC, including clang-cl
, a similar baseline is /W4
. Though it
goes a bit far, warning about use of unary minus on unsigned types
(C4146), and sign conversions (C4245). If you’re using a CRT, also
disable the bogus and irresponsible “security” warnings. Putting it
together, the warning baseline becomes:
$ cl /W4 /wd4146 /wd4245 /D_CRT_SECURE_NO_WARNINGS ...
As for gcc
and clang
, I dislike unused parameter warnings, so I often
turn it off, at least while I’m working: -Wno-unused-parameter
.
Rarely is it a defect to not use a parameter. It’s common for a function
to fit a fixed prototype but not need all its parameters (e.g. WinMain
).
Were it up to me, this would not be part of -Wextra
.
I also dislike unused functions warnings: -Wno-unused-function
.
I can’t say this is wrong for the baseline since, in most cases, ultimately
I do want to know if there are unused functions, e.g. to be deleted. But
while I’m working it’s usually noise.
If I’m working with OpenMP, I may also disable warnings about
unknown pragmas: -Wno-unknown-pragmas
. One cool feature of
OpenMP is that the typical case gracefully degrades to single-threaded
behavior when not enabled. That is, compiling without -fopenmp
.
I’ll test both ways to ensure I get deterministic results, or just to ease
debugging, and I don’t want warnings when it’s disabled. It’s fine for the
baseline to have this warning, but sometimes it’s a poor match.
When working with single-precision floats, perhaps on games or graphics,
it’s easy to accidentally introduce promotion to double precision, which
can hurt performance. It could be neglecting an f
suffix on a constant
or using sin
instead of sinf
. Use -Wdouble-promotion
to
catch such mistakes. Honestly, this is important enough that it should go
into the baseline.
#define PI 3.141592653589793
float degs = ...;
float rads = degs * PI / 180; // warns about promotion
It can be awkward around variadic functions, particularly printf
, which
cannot receive float
arguments, and so implicitly converts. You’ll need
a explicit cast to disable the warning. I imagine this is the main reason
the warning is not part of -Wextra
.
float x = ...;
printf("%.17g\n", (double)x);
Finally, an advanced option: -Wconversion -Wno-sign-conversion
.
It warns about implicit conversions that may result in data loss. Sign
conversions do not have data loss, the implicit conversions are useful,
and in my experience they’re not a source of defects, so I disable that
part using the second flag (like MSVC /wd4245
). The important warning
here is truncation of size values, warning about unsound uses of sizes and
subscripts. For example:
// NOTE: would be declared/defined via windows.h
typedef uint32_t DWORD;
BOOL WriteFile(HANDLE, const void *, DWORD, DWORD *, OVERLAPPED *);
void logmsg(char *msg, size_t len)
{
HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
DWORD out;
WriteFile(err, msg, len, &out, 0); // len truncation warning
}
On 64-bit targets, it will warn about truncating the 64-bit len
for the
32-bit parameter. To dismiss the warning, you must either address it by
using a loop to call WriteFile
multiple times, or acknowledge the
truncation with an explicit cast and accept the consequences. In this case
I may know from context it’s impossible for the program to even construct
such a large message, so I’d use an assertion and truncate.
void logmsg(char *msg, size_t len)
{
HANDLE err = GetStdHandle(STD_ERROR_HANDLE);
DWORD out;
assert(len <= 0xffffffff);
WriteFile(err, msg, (DWORD)len, &out, 0);
}
You might consider changing the interface instead:
void logmsg(char *msg, uint32_t len);
That probably passes the buck and doesn’t solve the underlying problem.
The caller may be holding a size_t
length, so the truncation happens
there instead. Or maybe you keep propagating this change backwards until
it, say, dissipates on a known constant. -Wconversion
leads to
these ripple effects that improves the overall program, which is why I
like it.
The catch is that the above warning only happens for 64-bit targets. So you might miss it. The inverse is true in other cases. This is one area where cross-architecture testing can pay off.
Unfortunately since this warning is off the beaten path, it seems like it doesn’t quite get the attention it could use. It warns about simple cases where truncation has been explicitly handled/avoided. For example:
int x = ...;
char digit = '0' + x%10; // false warning
The '0'
is a known constant. The operation x%10
has a known range (-9
to 9). Therefore the addition result has a known range, and all results
can be represented in a char
. Yet it still warns. This often comes up
dealing with character data like this.
In my logmsg
fix I had used an assertion to check that no truncation
actually occurred. But wouldn’t it be nice if the compiler could generate
that for us somehow? That brings us to dynamic checks.
Sanitizers have been around for nearly a decade but are still criminally
underused. They insert run-time assertions into programs at the flip of a
switch typically at a modest performance cost — less than the cost of a
debug build. All three major compilers support at least one sanitizer on
all targets. In most cases, failing to use them is practically the same as
not even trying to find defects. Every beginner tutorial ought to be using
sanitizers from page 1 where they teach how to compile a program with
gcc
. (That this is universally not the case, and that these same
tutorials also do not begin with teaching a debugger, is a major, on-going
education failure.)
There are multiple different sanitizers with lots of overlap, but Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan) are the most general. They are compatible with each other and form a solid, general baseline. To use address sanitizer, at both compile and link time do:
$ cc ... -fsanitize=address ...
It’s even spelled the same way in MSVC. It’s needed at link time because it includes a runtime component. When working properly it’s aware of all allocations and checks all memory accesses that might be out of bounds, producing a run-time error if that occurs. It’s not always appropriate, but most projects that can use it probably should.
UBSan is enabled similarly:
$ cc ... -fsanitize=undefined ...
It adds checks around operations that might be undefined, emitting a
run-time error if it occurs. It has an optional runtime component to
produce a helpful diagnostic. You can instead insert a trap instruction,
which is how I prefer to use it: -fsanitize-trap=undefined
.
(Until recently it was -fsanitize-undefined-trap-on-error
.)
This works on platforms where the UBSan runtime is unsupported. Some
instrumentation is only inserted at higher optimization levels.
For me, the most useful UBSan check is signed overflow — e.g. computing the wrong result — and it’s instrumentation I miss when not working in C. In programs where this might be an issue, combine it with a fuzzer to search for inputs that cause overflows. This is yet another argument in favor of signed sizes, as UBSan can detect such overflows. (Yes, UBSan optionally instruments unsigned overflow, too, but then you must somehow distinguish intentional from unintentional overflow.)
On Linux, ASan and UBSan strangely do not have debugger-oriented defaults. Fortunately that’s easy to address with a couple of environment variables, which cause them to break on error instead of uselessly exiting:
export ASAN_OPTIONS=abort_on_error=1:halt_on_error=1
export UBSAN_OPTIONS=abort_on_error=1:halt_on_error=1
Also, when compiling you can combine sanitizers like so:
$ cc ... -fsanitize=address,undefined ...
As of this writing, MSVC does not have UBSan, but it does have a similar
feature, run-time error checks. Three sub-flags (c
, s
, u
)
enable different checks, and /RTCcsu
turns them all on. The c
flag
generates the assertion I had manually written with -Wconversion
,
and traps any truncation at run time. There’s nothing quite like this in
UBSan! It’s so extreme that it’s compatible with neither standard runtime
libraries (fortunately not a big deal) nor with ASan.
Caveat: Explicit casts aren’t enough, you must actually truncate variables
using a mask in order to pass the check. For example, to accept truncation
in the logmsg
function:
WriteFile(err, msg, len&0xffffffff, &out, 0);
Thread Sanitizer (TSan) is occasionally useful for finding — or, more often, proving the presence of — data races. It has a runtime component and so must be used at compile time and link time.
$ cc ... -fsanitize=thread ...
Unfortunately it only works in a narrow context. The target must use pthreads, not C11 threads, OpenMP, nor direct cloning. It must only synchronize through code that was compiled with TSan. That means no synchronization through system calls, especially no futexes. Most non-trivial programs do not meet the criteria.
Another common mistake in tutorials is using plain old -g
instead
of -g3
(read: “debug level 3”). That’s like using -O
instead of -O3
. It adds a lot more debug information to the
output, particularly enums and macros. The extra information is useful and
you’re better off having it!
$ cc ... -g3 ...
All the major build systems — CMake, Autotools, Meson, etc. — get this
wrong in their standard debug configurations. Producing a fully-featured
debug build from these systems is a constant battle for me. Often it’s
easier to ignore the build system entirely and cc -g3 **/*.c
(plus
sanitizers, etc.).
(Short term note: GCC 11, released in March 2021, switched to DWARF5 by
default. However, GDB could not access the extra -g3
debug
information in DWARF5 until GDB 13, released February 2023. If you have a
toolchain from that two year window — except mine because I patched
it — then you may also need -gdwarf-4
to switch back to DWARF4.)
What about -Og
? In theory it enables optimizations that do not
interfere with debugging, and potentially some additional warnings. In
practice I still get far too many “optimized out” messages from GDB when I
use it, so I don’t bother. Fortunately C is such a simple language that
debug builds are nearly as fast as release builds anyway.
On MSVC I like having debug information embedded in binaries, as GCC does,
which is done using /Z7
.
$ cl ... /Z7 ...
Though I certainly understand the value of separate debug information,
/Zi
, in some cases. Sometimes I wish the GNU toolchain made this easier.
My personal rigorous baseline for development using gcc
and clang
looks like this (all platforms):
$ cc -g3 -Wall -Wextra -Wconversion -Wdouble-promotion
-Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion
-fsanitize=undefined -fsanitize-trap ...
While ASan is great for quickly reviewing and evaluating other people’s projects, I don’t find it useful for my own programs. I avoid that class of defects through smarter paradigms (region-based allocation, no null terminated strings, etc.). I also prefer the behavior of trap instruction UBSan versus a diagnostic, as it behaves better under debuggers.
For cl
and clang-cl
, my personal baseline looks like this:
$ cl /Z7 /W4 /wd4146 /wd4245 /RTCcsu ...
I don’t normally need /D_CRT_SECURE_NO_WARNINGS
since I don’t use a CRT
anyway.
The catch is that there’s no way to avoid using a bit of assembly. Neither
the clone
nor clone3
system calls have threading semantics compatible
with C, so you’ll need to paper over it with a bit of inline assembly per
architecture. This article will focus on x86-64, but the basic concept
should work on all architectures supported by Linux. The glibc clone(2)
wrapper fits a C-compatible interface on top of the raw system call,
but we won’t be using it here.
Before diving in, the complete, working demo: stack_head.c
On Linux, threads are spawned using the clone
system call with semantics
like the classic unix fork(2)
. One process goes in, two processes come
out in nearly the same state. For threads, those processes share almost
everything and differ only by two registers: the return value — zero in
the new thread — and stack pointer. Unlike typical thread spawning APIs,
the application does not supply an entry point. It only provides a stack
for the new thread. The simple form of the raw clone API looks something
like this:
long clone(long flags, void *stack);
Sounds kind of elegant, but it has an annoying problem: The new thread begins life in the middle of a function without any established stack frame. Its stack is a blank slate. It’s not ready to do anything except jump to a function prologue that will set up a stack frame. So besides the assembly for the system call itself, it also needs more assembly to get the thread into a C-compatible state. In other words, a generic system call wrapper cannot reliably spawn threads.
void brokenclone(void (*threadentry)(void *), void *arg)
{
// ...
long r = syscall(SYS_clone, flags, stack);
// DANGER: new thread may access non-existant stack frame here
if (!r) {
threadentry(arg);
}
}
For odd historical reasons, each architecture’s clone
has a slightly
different interface. The newer clone3
unifies these differences, but it
suffers from the same thread spawning issue above, so it’s not helpful
here.
I figured out a neat trick eight years ago which I continue to use
today. The parent and child threads are in nearly identical states when
the new thread starts, but the immediate goal is to diverge. As noted, one
difference is their stack pointers. To diverge their execution, we could
make their execution depend on the stack. An obvious choice is to push
different return pointers on their stacks, then let the ret
instruction
do the work.
Carefully preparing the new stack ahead of time is the key to everything,
and there’s a straightforward technique that I like call the stack_head
,
a structure placed at the high end of the new stack. Its first element
must be the entry point pointer, and this entry point will receive a
pointer to its own stack_head
.
struct __attribute((aligned(16))) stack_head {
void (*entry)(struct stack_head *);
// ...
};
The structure must have 16-byte alignment on all architectures. I used an
attribute to help keep this straight, and it can help when using sizeof
to place the structure, as I’ll demonstrate later.
Now for the cool part: The ...
can be anything you want! Use that area
to seed the new stack with whatever thread-local data is necessary. It’s a
neat feature you don’t get from standard thread spawning interfaces. If I
plan to “join” a thread later — wait until it’s done with its work — I’ll
put a join futex in this space:
struct __attribute((aligned(16))) stack_head {
void (*entry)(struct stack_head *);
int join_futex;
// ...
};
More details on that futex shortly.
I call the clone
wrapper newthread
. It has the inline assembly for the
system call, and since it includes a ret
to diverge the threads, it’s a
“naked” function just like with setjmp
. The compiler will
generate no prologue or epilogue, and the function body is limited to
inline assembly without input/output operands. It cannot even reliably
reference its parameters by name. Like clone
, it doesn’t accept a thread
entry point. Instead it accepts a stack_head
seeded with the entry
point. The whole wrapper is just six instructions:
__attribute((naked))
static long newthread(struct stack_head *stack)
{
__asm volatile (
"mov %%rdi, %%rsi\n" // arg2 = stack
"mov $0x50f00, %%edi\n" // arg1 = clone flags
"mov $56, %%eax\n" // SYS_clone
"syscall\n"
"mov %%rsp, %%rdi\n" // entry point argument
"ret\n"
: : : "rax", "rcx", "rsi", "rdi", "r11", "memory"
);
}
On x86-64, both function calls and system calls use rdi
and rsi
for
their first two parameters. Per the reference clone(2)
prototype above:
the first system call argument is flags
and the second argument is the
new stack
, which will point directly at the stack_head
. However, the
stack pointer arrives in rdi
. So I copy stack
into the second argument
register, rsi
, then load the flags (0x50f00
) into the first argument
register, rdi
. The system call number goes in rax
.
Where does that 0x50f00
come from? That’s the bare minimum thread spawn
flag set in hexadecimal. If any flag is missing then threads will not
spawn reliably — as discovered the hard way by trial and error across
different system configurations, not from documentation. It’s computed
normally like so:
long flags = 0;
flags |= CLONE_FILES;
flags |= CLONE_FS;
flags |= CLONE_SIGHAND;
flags |= CLONE_SYSVSEM;
flags |= CLONE_THREAD;
flags |= CLONE_VM;
When the system call returns, it copies the stack pointer into rdi
, the
first argument for the entry point. In the new thread the stack pointer
will be the same value as stack
, of course. In the old thread this is a
harmless no-op because rdi
is a volatile register in this ABI. Finally,
ret
pops the address at the top of the stack and jumps. In the old
thread this returns to the caller with the system call result, either an
error (negative errno) or the new thread ID. In the new thread
it pops the first element of stack_head
which, of course, is the
entry point. That’s why it must be first!
The thread has nowhere to return from the entry point, so when it’s done
it must either block indefinitely or use the exit
(not exit_group
)
system call to terminate itself.
The caller side looks something like this:
static void threadentry(struct stack_head *stack)
{
// ... do work ...
__atomic_store_n(&stack->join_futex, 1, __ATOMIC_SEQ_CST);
futex_wake(&stack->join_futex);
exit(0);
}
__attribute((force_align_arg_pointer))
void _start(void)
{
struct stack_head *stack = newstack(1<<16);
stack->entry = threadentry;
// ... assign other thread data ...
stack->join_futex = 0;
newthread(stack);
// ... do work ...
futex_wait(&stack->join_futex, 0);
exit_group(0);
}
Despite the minimalist, 6-instruction clone wrapper, this is taking the shape of a conventional threading API. It would only take a bit more to hide the futex, too. Speaking of which, what’s going on there? The same principal as a WaitGroup. The futex, an integer, is zero-initialized, indicating the thread is running (“not done”). The joiner tells the kernel to wait until the integer is non-zero, which it may already be since I don’t bother to check first. When the child thread is done, it atomically sets the futex to non-zero and wakes all waiters, which might be nobody.
Caveat: It’s not safe to free/reuse the stack after a successful join. It
only indicates the thread is done with its work, not that it exited. You’d
need to wait for its SIGCHLD
(or use CLONE_CHILD_CLEARTID
). If this
sounds like a problem, consider your context more carefully: Why do
you feel the need to free the stack? It will be freed when the process
exits. Worried about leaking stacks? Why are you starting and exiting an
unbounded number of threads? In the worst case park the thread in a thread
pool until you need it again. Only worry about this sort of thing if
you’re building a general purpose threading API like pthreads. I know it’s
tempting, but avoid doing that unless you absolutely must.
What’s with the force_align_arg_pointer
? Linux doesn’t align the stack
for the process entry point like a System V ABI function call. Processes
begin life with an unaligned stack. This attribute tells GCC to fix up the
stack alignment in the entry point prologue, just like on Windows.
If you want to access argc
, argv
, and envp
you’ll need more
assembly. (I wish doing really basic things without libc on Linux
didn’t require so much assembly.)
__asm (
".global _start\n"
"_start:\n"
" movl (%rsp), %edi\n"
" lea 8(%rsp), %rsi\n"
" lea 8(%rsi,%rdi,8), %rdx\n"
" call main\n"
" movl %eax, %edi\n"
" movl $60, %eax\n"
" syscall\n"
);
int main(int argc, char **argv, char **envp)
{
// ...
}
Getting back to the example usage, it has some regular-looking system call wrappers. Where do those come from? Start with this 6-argument generic system call wrapper.
long syscall6(long n, long a, long b, long c, long d, long e, long f)
{
register long ret;
register long r10 asm("r10") = d;
register long r8 asm("r8") = e;
register long r9 asm("r9") = f;
__asm volatile (
"syscall"
: "=a"(ret)
: "a"(n), "D"(a), "S"(b), "d"(c), "r"(r10), "r"(r8), "r"(r9)
: "rcx", "r11", "memory"
);
return ret;
}
I could define syscall5
, syscall4
, etc. but instead I’ll just wrap it
in macros. The former would be more efficient since the latter wastes
instructions zeroing registers for no reason, but for now I’m focused on
compacting the implementation source.
#define SYSCALL1(n, a) \
syscall6(n,(long)(a),0,0,0,0,0)
#define SYSCALL2(n, a, b) \
syscall6(n,(long)(a),(long)(b),0,0,0,0)
#define SYSCALL3(n, a, b, c) \
syscall6(n,(long)(a),(long)(b),(long)(c),0,0,0)
#define SYSCALL4(n, a, b, c, d) \
syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),0,0)
#define SYSCALL5(n, a, b, c, d, e) \
syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),0)
#define SYSCALL6(n, a, b, c, d, e, f) \
syscall6(n,(long)(a),(long)(b),(long)(c),(long)(d),(long)(e),(long)(f))
Now we can have some exits:
__attribute((noreturn))
static void exit(int status)
{
SYSCALL1(SYS_exit, status);
__builtin_unreachable();
}
__attribute((noreturn))
static void exit_group(int status)
{
SYSCALL1(SYS_exit_group, status);
__builtin_unreachable();
}
Simplified futex wrappers:
static void futex_wait(int *futex, int expect)
{
SYSCALL4(SYS_futex, futex, FUTEX_WAIT, expect, 0);
}
static void futex_wake(int *futex)
{
SYSCALL3(SYS_futex, futex, FUTEX_WAKE, 0x7fffffff);
}
And so on.
Finally I can talk about that newstack
function. It’s just a wrapper
around an anonymous memory map allocating pages from the kernel. I’ve
hardcoded the constants for the standard mmap allocation since they’re
nothing special or unusual. The return value check is a little tricky
since a large portion of the negative range is valid, so I only want to
check for a small range of negative errnos. (Allocating a arena looks
basically the same.)
static struct stack_head *newstack(long size)
{
unsigned long p = SYSCALL6(SYS_mmap, 0, size, 3, 0x22, -1, 0);
if (p > -4096UL) {
return 0;
}
long count = size / sizeof(struct stack_head);
return (struct stack_head *)p + count - 1;
}
The aligned
attribute comes into play here: I treat the result like an
array of stack_head
and return the last element. The attribute ensures
each individual elements is aligned.
That’s it! There’s not much to it other than a few thoughtful assembly instructions. It took doing this a few times in a few different programs before I noticed how simple it can be.
]]>I no longer call these “freestanding” programs since that term is, at best, inaccurate. In fact, we will be actively avoiding GCC features associated with that label. Instead I call these CRT-free programs, where CRT stands for the C runtime the Windows-oriented term for libc. This term communicates both intent and scope.
You should already know that main
is not the program’s entry point, but
a C application’s entry point. The CRT provides the entry point, where it
initializes the CRT, including parsing command line options, then
calls the application’s main
. The real entry point doesn’t have a name.
It’s just the address of the function to be called by the loader without
arguments.
You might naively assume you could continue using the name main
and tell
the linker to use it as the entry point. You would be wrong. Avoid the
name main
! It has a special meaning in C gets special treatment. Using
it without a conventional CRT will confuse your tools an may cause build
issues.
While you can use almost any other name you like, the conventional names
are mainCRTStartup
(console subsystem) and WinMainCRTStartup
(windows
subsystem). It’s easy to remember: Append CRTStartup
to the name you’d
use in a normal CRT-linking application. I strongly recommend using these
names because it reduces friction. Your tools are already familiar with
them, so you won’t need to do anything special.
int mainCRTStartup(void); // console subsystem
int WinMainCRTStartup(void); // windows subsystem
The MSVC linker documentation says the entry point uses the __stdcall
calling convention. Ignore this and do not use __stdcall
for your
entry point! Since entry points take no arguments, there is no practical
difference from the __cdecl
calling convention, so it does not actually
matter. Rather, the goal is to avoid __stdcall
function decorations.
In particular, the GNU linker --entry
option does not understand them,
nor can it find decorated entry points on its own. If you use __stdcall
,
then the 32-bit GNU linker will silently (!) choose the beginning of your
.text
section as the entry point.
If you’re using C++, then of course you will also need to use extern "C"
so that it’s not name-mangled. Otherwise the results are similarly bad.
If using -fwhole-program
, you will need to mark your entry point as
externally visible for GCC so that it knows its an entry point. While
linkers are familiar with conventional entry point names, GCC the
compiler is not. Normally you do not need to worry about this.
__attribute((externally_visible)) // for -fwhole-program
int mainCRTStartup(void)
{
return 0;
}
The entry point returns int
. If there are no other threads then the
process will exit with the returned value as its exit status. In practice
this is only useful for console programs. Windows subsystem programs have
threads started automatically, without warning, and it’s almost certain
your main thread is not the last thread. You probably want to use
ExitProcess
or even TerminateProcess
instead of returning. The latter
exits more abruptly and can avoid issues with certain subsystems, like
DirectSound, not shutting down gracefully: It doesn’t even let them try.
int WinMainCRTStartup(void)
{
// ...
TerminateProcess(GetCurrentProcess(), 0);
}
Starting with the GNU toolchain, you have two ways to get into “CRT-free
mode”: -nostartfiles
and -nostdlib
. The former is more dummy-proof,
and it’s what I use in build documentation. The latter can be a more
complicated, but when it succeeds you get guarantees about the result. I
use it in build scripts I intend to run myself, which I want to fail if
they don’t do exactly what I expect. To illustrate, consider this trivial
program:
#include <windows.h>
int mainCRTStartup(void)
{
ExitProcess(0);
}
This program uses ExitProcess
from kernel32.dll
. Compiling is easy:
$ cc -nostartfiles example.c
The -nostartfiles
prevents it from linking the CRT entry point, but it
still implicitly passes other “standard” linker flags, including libraries
-lmingw32
and -lkernel32
. Programs can use kernel32.dll
functions
without explicitly linking that DLL. But, hey, isn’t -lmingw32
the CRT,
the thing we’re avoiding? It is, but it wasn’t actually linked because the
program didn’t reference it.
$ objdump -p a.exe | grep -Fi .dll
DLL Name: KERNEL32.dll
However, -nostdlib
does not pass any of these libraries, so you need to
do so explicitly.
$ cc -nostdlib example.c -lkernel32
The MSVC toolchain behaves a little like -nostartfiles
, not linking a
CRT unless you need it, semi-automatically. However, you’ll need to list
kernel32.dll
and tell it which subsystem you’re using.
$ cl example.c /link /subsystem:console kernel32.lib
However, MSVC has a handy little feature to list these arguments in the source file.
#ifdef _MSC_VER
#pragma comment(linker, "/subsystem:console")
#pragma comment(lib, "kernel32.lib")
#endif
This information must go somewhere, and I prefer the source file rather than a build script. Then anyone can point MSVC at the source without worrying about options.
$ cl example.c
I try to make all my Windows programs so simply built.
On Windows, it’s expected that stacks will commit dynamically. That is, the stack is merely reserved address space, and it’s only committed when the stack actually grows into it. This made sense 30 years ago as a memory saving technique, but today it no longer makes sense. However, programs are still built to use this mechanism.
To function properly, programs must touch each stack page for the first
time in order. Normally that’s not an issue, but if your stack frame
exceeds the page size, there’s a chance it might step over a page. When a
function has a large stack frame, GCC inserts a call to a “stack probe” in
libgcc
that touches its pages in the prologue. It’s not unlike stack
clash protection.
For example, if I have a 4kiB local variable:
int mainCRTStartup(void)
{
char buf[1<<12] = {0};
return 0;
}
When I compile with -nostdlib
:
$ cc -nostdlib example.c
ld: ... undefined reference to `___chkstk_ms'
It’s trying to link the CRT stack probe. You can disable this behavior
with -mno-stack-arg-probe
.
$ cc -mno-stack-arg-probe -nostdlib example.c
Or you can just link -lgcc
to provide a definition:
$ cc -nostdlib example.c -lgcc
Had you used -nostartfiles
, you wouldn’t have noticed because it passes
-lgcc
automatically. It’s “dummy-proof” because this sort of issue goes
away before it comes up, though for the same reason it’s harder to tell
exactly what went into a program.
If you disable the probe altogether — my preference — you’ve only solved the linker problem, but the underlying stack commit problem remains and your program may crash. You can solve that by telling the linker to ask the loader to commit a larger stack up front rather than grow it at run time. Say, 2MiB:
$ cc -mno-stack-arg-probe -Xlinker --stack=0x200000,0x200000 example.c
Of course, I wish that this was simply the default behavior because it’s far more sensible! A much better option is to avoid large stack frames in the first place. Allocate locals larger than, say, 1KiB in a scratch arena instead of on the stack.
MSVC doesn’t have libgcc
of course, but it still generates stack probes
both for growing the stack and for security checks. The latter requires
kernel32.dll
, so if I compile the same program with MSVC, I get a bunch
of linker failures:
$ cl example.c /link /subsystem:console
... unresolved external symbol __imp_RtlCaptureContext ...
... and 7 more ...
Using /Gs1000000000
turns off the stack probes, /GS-
turns off the
checks, /stack
commits a larger stack:
$ cl /GS- /Gs1000000000 example.c /link
/subsystem:console /stack:0x200000,200000
Though, as before, better to avoid large stack frames in the first place.
The three major C and C++ compilers — GCC, MSVC, Clang — share a common,
evil weakness: “built-in” functions. No matter what, they each assume
you will supply definitions for standard string functions at link time,
particularly memset
and memcpy
. They do this no matter how many
“seriously now, do not use standard C functions” options you pass. When
you don’t link a CRT, you may need to define them yourself.
With GCC there’s a catch: it will transform your memset
definition —
that is, in a function named memset
— into a call to itself. After
all, it looks an awful lot like memset
! This typically manifests as an
infinite loop. Use -fno-builtin
to prevent GCC from mis-compiling
built-in functions.
Even with -fno-builtin
, both GCC and Clang will continue inserting calls
to built-in functions elsewhere. For example, making an especially large
local variable (and using volatile
to prevent it from being optimized
out):
int mainCRTStartup(void)
{
volatile char buf[1<<14] = {0};
return 0;
}
As of this writing, the latest GCC and Clang will generate a memset
call
despite -fno-builtin
:
$ cc -mno-stack-arg-probe -fno-builtin -nostdlib example.c
ld: ... undefined reference to `memset' ...
To be absolutely pure, you will need to address this in just about any
non-trivial program. On the other hand, -nostartfiles
will grab a
definition from msvcrt.dll
for you:
$ cc -nostartfiles example.c
$ objdump -p a.exe | grep -Fi .dll
DLL Name: msvcrt.dll
To be clear, this is a completely legitimate and pragmatic route! You get the benefits of both worlds: the CRT is still out of the way, but there’s also no hassle from misbehaving compilers. If this sounds like a good deal, then do it! (For on-lookers feeling smug: there is no such easy, general solution for this problem on Linux.)
When you write your own definitions, I suggest putting each definition in
its own section so that they can be discarded via -Wl,--gc-sections
when
unused:
__attribute((section(".text.memset")))
void *memset(void *d, int c, size_t n)
{
// ...
}
So far, for all three compilers, I’ve only needed to provide definitions
for memset
and memcpy
.
GCC expects a 16-byte aligned stack and generates code accordingly. Such is dictated by the x64 ABI, so that’s a given on 64-bit Windows. However, the x86 ABIs only guarantee 4-byte alignment. If no care is taken to deal with it, there will likely be unaligned loads. Some may not be valid (e.g. SIMD) leading to a crash. UBSan disapproves, too. Fortunately there’s a function attribute for this:
__attribute((force_align_arg_pointer))
int mainCRTStartup(void)
{
// ...
}
GCC will now align the stack in this function’s prologue. Adjustment is
only necessary at entry points, as GCC will maintain alignment through its
own frames. This includes all entry points, not just the program entry
point, particularly thread start functions. Rule of thumb for i686 GCC:
If WINAPI
or __stdcall
appears in a definition, the stack likely
requires alignment.
__attribute((force_align_arg_pointer))
DWORD WINAPI mythread(void *arg)
{
// ...
}
It’s harmless to use this attribute on x64. The prologue will just be a
smidge larger. If you’re worried about it, use #ifdef __i686__
to limit
it to 32-bit builds.
If I’ve written a graphical application with WinMainCRTStartup
, used
large stack frames, marked my entry point as externally visible, plan to
support 32-bit builds, and defined a couple of needed string functions, my
optimal entry point may look something like:
#ifdef __GNUC__
__attribute((externally_visible))
#endif
#ifdef __i686__
__attribute((force_align_arg_pointer))
#endif
int WinMainCRTStartup(void)
{
// ...
}
Then my “optimize all the things” release build may look something like:
$ cc -O3 -fno-builtin -Wl,--gc-sections -s -nostdlib -mwindows
-fno-asynchronous-unwind-tables -o app.exe app.c -lkernel32
Or with MSVC:
$ cl /O2 /GS- app.c /link kernel32.lib /subsystem:windows
Or if I’m taking it easy maybe just:
$ cc -O3 -fno-builtin -s -nostartfiles -mwindows -o app.exe app.c
Or with MSVC (linker flags in source):
$ cl /O2 app.c
When not using the C standard library, how does one deal with
formatted output? Re-implementing the entirety of printf
from scratch
seems like a lot of work, and indeed it would be. Fortunately it’s rarely
necessary. With the right mindset, and considering your program’s actual
formatting needs, it’s not as difficult as it might appear. Since it goes
hand-in-hand with buffering, I’ll cover both topics at once, including
sprintf
-like capabilities, which is where we’ll start.
Buffering amortizes the costs of write (and read) system calls. Many small writes are queued via the buffer into a few large writes. This isn’t just an implementation detail. It’s key in the mindset to tackle formatted output: Printing is appending.
The mindset includes the reverse: Appending is like printing. Consider
this next time you reach for strcat
or similar. Is this the appropriate
destination for this data, or am I just going to print it — i.e. append it
to another, different buffer — afterward?
This concept may sound obvious, but consider that there are major, popular programming paradigms where the norm is otherwise. I’ll pick on Python to illustrate, but it’s not alone.
print(f"found {count} items")
This line of code allocates a buffer; formats the value of the variable
count
into it; allocates a second buffer; copies into it the prefix
("found "
), the first buffer, and the suffix (" items"
); copies the
contents of this second buffer into the standard output buffer; then
discards the two temporary buffers. To see for yourself, use the CPython
bytecode disassembler on it. (It is pretty neat that string
formatting is partially implemented in the compiler and partially parsed
at compile time.)
With the print-is-append mindset, you know it’s ultimately being copied into the standard output buffer, and that you can skip the intermediate appending and copying. Avoiding that pessimization isn’t just about the computer’s time, it’s even more about saving your own time implementing formatted output.
In C that line looks like:
printf("found %d items\n", count);
The format string is a domain-specific language (DSL) that is (usually) parsed and evaluated at run time. In essence it’s a little program that says:
"found "
to the output buffer" items\n"
to the output bufferFor sprintf
the output buffer is caller-supplied instead of a buffered
stream.
In this implementation we’re doing to skip the DSL and express such
“format programs” in C itself. It’s more verbose at the call site, but it
simplifies the implementation. As a bonus, it’s also faster since the
format program is itself compiled by the C compiler. In your own formatted
output implementation you could write a printf
that, following the
format string, calls the append primitives we’ll build below.
Let’s begin by defining an output buffer. An output buffer tracks the
total capacity and how much has been written. I’ll include a sticky error
flag to simplify error checks. For a first pass we’ll start with a
sprintf
rather than full-blown printf
because there’s nowhere yet for
the data to go.
#define MEMBUF(buf, cap) {buf, cap, 0, 0}
struct buf {
unsigned char *buf;
int cap;
int len;
_Bool error;
};
I’m using unsigned char
since these are bytes, best understood as
unsigned (0–255), particularly important when dealing with encodings. I
also wrote a “constructor” macro, MEMBUF
, to help with initialization.
Next we need a function to append bytes — the core operation:
void append(struct buf *b, unsigned char *src, int len)
{
int avail = b->cap - b->len;
int amount = avail<len ? avail : len;
for (int i = 0; i < amount; i++) {
b->buf[b->len+i] = src[i];
}
b->len += amount;
b->error |= amount < len;
}
If there wasn’t room, it copies as much as possible and sets the error flag to indicate truncation. It doesn’t return the error. Rather than check after each append, the caller will check after multiple appends, effectively batching the checks into one check. The typical, expected case is that there is no error, so make that path fast.
Since it’s an easy point to miss: append
is the only place in the entire
implementation where bounds checking comes into play. Everything else can
confidentially throw bytes at the buffer without worrying if it fits. If
it doesn’t, the sticky error flag will indicate such at a more appropriate
time.
I could have used memcpy
for the loop, but the goal is not to use libc.
Besides, not using memcpy
means we can pass a null pointer without
making it a special exception.
append(b, 0, 0); // append nothing (no-op)
I expect that static strings are common sources for append, so I’ll add a helper macro which gets the length as a compile-time constant. The null terminator will not be used.
#define APPEND_STR(b, s) append(b, s, sizeof(s)-1)
If that’s not clear yet, it will be once you see an example. It’s also useful to append single bytes:
void append_byte(struct buf *b, unsigned char c)
{
append(b, &c, 1);
}
With primitive appends done, we can build ever “higher-level” appends. For
example, to append a formatted long
to the buffer:
void append_long(struct buf *b, long x)
{
unsigned char tmp[64];
unsigned char *end = tmp + sizeof(tmp);
unsigned char *beg = end;
long t = x>0 ? -x : x;
do {
*--beg = '0' - t%10;
} while (t /= 10);
if (x < 0) {
*--beg = '-';
}
append(b, beg, end-beg);
}
By working from the negative end — recall that the negative range is
larger than the positive — it supports the full range of signed long
,
whatever it happens to be on this host. With less than 50 lines of code we
now have enough to format the example:
char message[256];
struct buf b = MEMBUF(message, sizeof(message));
APPEND_STR(&b, "found ");
append_long(&b, count);
APPEND_STR(&b, "items\n");
if (b.error) {
// truncated
}
We can continue defining append functions for whatever types we need.
void append_ptr(struct buf *b, void *p)
{
APPEND_STR(b, "0x");
uintptr_t u = (uintptr_t)p;
for (int i = 2*sizeof(u) - 1; i >= 0; i--) {
append_byte(b, "0123456789abcdef"[(u>>(4*i))&15]);
}
}
struct vec2 { int x, y; };
void append_vec2(struct buf *b, struct vec2 v)
{
APPEND_STR(&b, "vec2{");
append_long(&b, v.x);
APPEND_STR(&b, ", ");
append_long(&b, v.y);
append_byte(&b, '}');
}
Perhaps you want features like field width? Add a parameter for it… but only if you need it!
As mentioned before, precise float formatting is challenging
because it’s full of edge cases. However, if you only need to output a
simple format at reduced precision, it’s not difficult. To illustrate,
this nearly matches %f
, built atop append_long
:
void append_double(struct buf *b, double x)
{
long prec = 1000000; // i.e. 6 decimals
if (x < 0) {
append_byte(b, '-');
x = -x;
}
x += 0.5 / prec; // round last decimal
if (x >= (double)(-1UL>>1)) { // out of long range?
APPEND_STR(b, "inf");
} else {
long integral = x;
long fractional = (x - integral)*prec;
append_long(b, integral);
append_byte(b, '.');
for (long i = prec/10; i > 1; i /= 10) {
if (i > fractional) {
append_byte(b, '0');
}
}
append_long(b, fractional);
}
}
So far this writes output to a buffer and truncates when it runs out of space. Usually we want this going to a sink, like a kernel object whether that be a file, pipe, socket, etc. to which we have a handle like a file descriptor. Instead of truncating, we flush the buffer to this sink, at which point there’s room for more output. The error flag is set if the flush fails, but this is essentially the same concept as before.
In these examples I will use a file descriptor int
, but you can use
whatever sort of handle is appropriate. I’ll add an fd
field to the
buffer and a new constructor macro:
#define MEMBUF(buf, cap) {buf, cap, 0, -1, 0}
#define FDBUF(fd, buf, cap) {buf, cap, 0, fd, 0}
struct buf {
unsigned char *buf;
int cap;
int len;
int fd;
Bool error;
};
The buffered stream will be polymorphic: Output can go to a memory buffer
or to an operating system handle using the same append interface. This is
a handy feature standard C doesn’t even have, though POSIX does in the
form of fmemopen
. Nothing else changes except append
,
which, if given a valid handle, will flush when full. Attempting to flush
a memory buffer sets the error flag.
_Bool os_write(int fd, void *, int);
void flush(struct buf *b)
{
b->error |= b->fd < 0;
if (!b->error && b->len) {
b->error |= !os_write(b->fd, b->buf, b->len);
b->len = 0;
}
}
I’ve arranged so that output stops when there’s an error. Also I’m using a
hypothetical os_write
in the platform layer as a full, unbuffered write.
Note that unix write(2)
experiences partial writes and so must be used
in a loop. Win32 WriteFile
doesn’t have partial writes, so on Windows an
os_write
could pass its arguments directly to the operating system.
The program will need to call flush
directly when it’s done writing
output, or to display output early, e.g. line buffering. In append
we’ll
use a loop to continue appending and flushing until the input is consumed
or an error occurs.
void append(struct buf *b, unsigned char *src, int len)
{
unsigned char *end = src + len;
while (!b->error && src<end) {
int left = end - src;
int avail = b->cap - b->len;
int amount = avail<left ? avail : left;
for (int i = 0; i < amount; i++) {
b->buf[b->len+i] = src[i];
}
b->len += amount;
src += amount;
if (amount < left) {
flush(b);
}
}
}
That completes formatted output! We can now do stuff like:
int main(void)
{
unsigned char mem[1<<10]; // arbitrarily-chosen 1kB buffer
struct buf stdout = FDBUF(1, mem, sizeof(mem));
for (long i = 0; i < 1000000; i++) {
APPEND_STR(&stdout, "iteration ");
append_long(&stdout, i);
append_byte(&stdout, '\n');
// ...
}
flush(&stdout);
return stdout.error;
}
Except for the lack of format DSL, this should feel familiar.
]]>Yesterday I wrote that setjmp
is handy and that it would be nice
to have without linking the C standard library. It’s conceptually simple,
after all. Today let’s explore some differently-portable implementation
possibilities with distinct trade-offs. At the very least it should
illuminate why setjmp
sometimes requires the use of volatile
.
First, a quick review: setjmp
and longjmp
are a form of non-local
goto.
typedef void *jmp_buf[N];
int setjmp(jmp_buf);
void longjmp(jmp_buf, int);
Calling setjmp
saves the execution context in a jmp_buf
, and longjmp
restores this context, returning the thread to this previous point of
execution. This means setjmp
returns twice: (1) after saving the
context, and (2) from longjmp
. To distinguish these cases, the first
time it returns zero and the second time it returns the value passed to
longjmp
.
jmp_buf
is an array of some platform-specific type and length. I’ll be
using void pointers in this article because it’s a register-sized type
that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal
addresses which eased in working it out.
Let’s start with the easiest option. GCC has two intrinsics doing
all the hard work for us: __builtin_setjmp
and __builtin_longjmp
. Its
worst case jmp_buf
is length 5, but the most popular architectures only
use the first 3 elements. Clang supports these intrinsics as well for GCC
compatibility.
Be mindful that the semantics are slightly different from the standard C
definition, namely that you cannot use longjmp
from the same function as
setjmp
. It also doesn’t touch the signal mask. However, it’s easier to
use and you don’t need to worry about volatile
.
// NOTE to copy-pasters: semantics differ slightly from standard C
typedef void *jmp_buf[5];
#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp
If you only care about GCC and/or Clang, then that’s it! It works as-is on every supported target and nothing more is needed. As a bonus, it will be more efficient than the libc version, though I should hope that won’t matter in practice. These are so awesome and convenient that I’m already second-guessing myself: “Do I really need to support other compilers…?”
If I want to support more compilers I’ll need to write it myself. It’s
also an excuse to dig into the details. The execution context is no more
than an array of saved registers, and longjmp
is merely restoring those
registers. One of the registers is the instruction pointer, and setting
the instruction pointer is called a jump.
Since we’re talking about registers, that means assembly. We’ll also need to know the target’s calling convention, so this really narrows things down. This implementation will target x86-64, a.k.a x64, Windows, but it will support MSVC as an additional compiler. So it’s a different kind of portability. I’ll start with GCC via w64devkit then massage it into something MSVC can use.
I mentioned before that setjmp
returns twice. So to return a second time
we just need to simulate a normal function return. Obviously that
includes restoring the stack pointer like the ret
instruction, but it
means preserving all the non-volatile registers a callee is supposed to
preserve. These will all go in the execution context.
The x64 calling convention specifies 9 non-volatile rsp
, rsp
,
rbx
, rdi
, rsi
, r12
, r13
, r14
, and r15
. We’ll also need the
instruction pointer, rip
, making it 10 total.
typedef void *jmp_buf[10];
The tricky issue is that we need to save the registers immediately inside
setjmp
before the compiler has manipulated them in a function prologue.
That will take more than mere inline assembly. We’ll start with a naked
function, which means that GCC will not create a prologue or epilogue.
However, that means no local variables, and the function body will be
limited to inline assembly, including a ret
instruction for the
epilogue.
__attribute__((naked))
int setjmp(jmp_buf buf)
{
__asm(
// ...
);
}
The x64 calling convention uses rcx
for the first pointer argument, so
that’s where we’ll find buf
. I’ve arbitrarily decided to store rip
first, then the other registers in order. However, the current value of
rip
isn’t the one we need. The rip
we need was just pushed on top of
the stack by the caller. I’ll read that off the stack into a scratch
register, rax
, and then store it in the first element of buf
.
mov (%rsp), %rax
mov %rax, 0(%rcx)
The stack pointer, rsp
, is also indirect since I want the pointer just
before rip
was pushed, as it would be just after a ret
. I use a lea
,
load effective address, to add 8 bytes (recall: stack grows down),
placing the result in a scratch register, then write it into the second
element of buf
(i.e. 8 bytes into %rcx
).
lea 8(%rsp), %rax
mov %rax, 8(%rcx)
Everything else is a matter of elbow grease.
mov %rbp, 16(%rcx)
mov %rbx, 24(%rcx)
mov %rdi, 32(%rcx)
mov %rsi, 40(%rcx)
mov %r12, 48(%rcx)
mov %r13, 56(%rcx)
mov %r14, 64(%rcx)
mov %r15, 72(%rcx)
With all work complete, return zero to the caller.
xor %eax, %eax
ret
Putting it altogether, and avoiding a -Wunused-variable
:
__attribute__((naked,returns_twice))
int setjmp(jmp_buf buf)
{
(void)buf;
__asm(
"mov (%rsp), %rax\n"
"mov %rax, 0(%rcx)\n"
"lea 8(%rsp), %rax\n"
"mov %rax, 8(%rcx)\n"
"mov %rbp, 16(%rcx)\n"
"mov %rbx, 24(%rcx)\n"
"mov %rdi, 32(%rcx)\n"
"mov %rsi, 40(%rcx)\n"
"mov %r12, 48(%rcx)\n"
"mov %r13, 56(%rcx)\n"
"mov %r14, 64(%rcx)\n"
"mov %r15, 72(%rcx)\n"
"xor %eax, %eax\n"
"ret\n"
);
}
Also take note of the returns_twice
attribute. It informs GCC of this
function’s unusual nature, saying the function doesn’t preserve most
non-volatile registers, and induces -Wclobbered
diagnostics. Technically
this means we could get away with saving only rip
, rsp
, and rbp
—
exactly as __builtin_setjmp
does — but we’ll need the others for MSVC
anyway.
In longjmp
we need to restore all those registers. For purely aesthetic
reasons I’ve decided to do it in reverse order. Everything but rip
is
easy.
mov 72(%rcx), %r15
mov 64(%rcx), %r14
mov 56(%rcx), %r13
mov 48(%rcx), %r12
mov 40(%rcx), %rsi
mov 32(%rcx), %rdi
mov 24(%rcx), %rbx
mov 16(%rcx), %rbp
mov 8(%rcx), %rsp
The instruction set doesn’t have direct access to rip
. It will be a
jmp
instead of mov
, but before jumping we’ll need to prepare the
return value. The x64 calling convention says the second argument is
passed in rdx
, so move that to rax
, then jmp
to the caller. It’s
only a 32-bit operand, C int
, so edx
instead of rdx
.
mov %edx, %eax
jmp *0(%rcx)
Putting it all together, and adding the noreturn
attribute:
__attribute__((naked,noreturn))
void longjmp(jmp_buf buf, int ret)
{
(void)buf;
(void)ret;
__asm(
"mov 72(%rcx), %r15\n"
"mov 64(%rcx), %r14\n"
"mov 56(%rcx), %r13\n"
"mov 48(%rcx), %r12\n"
"mov 40(%rcx), %rsi\n"
"mov 32(%rcx), %rdi\n"
"mov 24(%rcx), %rbx\n"
"mov 16(%rcx), %rbp\n"
"mov 8(%rcx), %rsp\n"
"mov %edx, %eax\n"
"jmp *0(%rcx)\n"
);
}
The C standard says that if ret
is zero then longjmp
will return 1
from setjmp
instead. I leave that detail as a reader exercise. Otherwise
this is a complete, working setjmp
. It works perfectly when I swap it in
for setjmp.h
in my u-config test suite.
Now that you’ve seen the guts, let’s talk about volatile
and why it’s
necessary. Consider this function, example
, which calls a work
function that may return through setjmp
(e.g. on failure).
void work(jmp_buf);
int example(void)
{
int r = 0;
jmp_buf buf;
if (!setjmp(buf)) {
// first return
r = 1;
work(buf);
} else {
// second return
}
return r;
}
It stores to r
after the first setjmp
return, then loads r
after the
second setjmp
return. However, r
may have been stored in the execution
context. Since it’s used across function calls, it would be reasonable to
store this variable in non-volatile register like ebx
. If so, it will be
restored to its value at the moment of the first call to setbuf
, in
which case the old r
would be read after restoration by longjmp
. If
it’s not stored in a register, but on the stack, then on the second return
the function will read the latest value out of the stack. In practice, if
work
returns through longjmp
, this function may return either 0 or 1,
probably determined by the optimization level.
The solution is to qualify r
with volatile
, which forces the compiler
to store the variable on the stack and never cache it in a register.
volatile int r = 0;
Though since our setbuf
is marked returns_twice
, GCC will never store
r
in a register across setjmp
calls. This potentially hides a bug in
the program that would occur under some other compilers, but GCC will
(usually) warn about it.
MSVC doesn’t understand __attribute__
nor the inline assembly, so it
cannot compile these functions. I could compile my setjmp
with GCC and
the rest of the program with MSVC, which means I need two compilers.
Instead, I’ll move to pure assembly, assemble with GNU as
(TODO: port
to MASM?) so we’ll only need a tiny piece of the GNU toolchain.
.global setjmp
setjmp:
mov (%rsp), %rax
mov %rax, 0(%rcx)
lea 8(%rsp), %rax
mov %rax, 8(%rcx)
mov %rbp, 16(%rcx)
mov %rbx, 24(%rcx)
mov %rdi, 32(%rcx)
mov %rsi, 40(%rcx)
mov %r12, 48(%rcx)
mov %r13, 56(%rcx)
mov %r14, 64(%rcx)
mov %r15, 72(%rcx)
xor %eax, %eax
ret
.globl longjmp
longjmp:
mov 72(%rcx), %r15
mov 64(%rcx), %r14
mov 56(%rcx), %r13
mov 48(%rcx), %r12
mov 40(%rcx), %rsi
mov 32(%rcx), %rdi
mov 24(%rcx), %rbx
mov 16(%rcx), %rbp
mov 8(%rcx), %rsp
mov %edx, %eax
jmp *0(%rcx)
Then some declarations in C:
typedef void *jmp_buf[10];
int setjmp(jmp_buf);
_Noreturn void longjmp(jmp_buf, int);
I’ll need to enable C11 for that _Noreturn
in MSVC. Assemble, compile,
and link:
$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj
That generally works! If I rename to xsetjmp
and xlongjmp
to avoid
conflicting with the CRT definitions, drop them into the u-config test
suite in place of setjmp.h
, then compile with MSVC, it passes all tests
using my alternate implementation in MSVC as well as GCC. Pretty cool!
I’m not sure if I’ll ever use the assembly, but writing this article led me to try the GCC intrinsics, and I’m so impressed I’m still thinking about ways I can use them. My main thought is out-of-memory situations in arena allocators, using a non-local exit to roll back to a savepoint, even if just to return an error. This is nicer than either terminating the program or handling OOM errors on every allocation. Very roughly:
typedef struct {
size_t cap;
size_t off;
void *jmp_buf[5];
} Arena;
// Place an arena and savepoint an out-of-memory jump.
#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))->jmp_buf)
// Place a new arena at the front of the buffer.
Arena *place(void *mem, size_t size)
{
assert(size >= sizeof(Arena));
Arena *a = mem;
a->cap = size;
a->off = sizeof(Arena);
return a;
}
void *alloc(Arena *a, size_t size)
{
size_t avail = a->cap - a->off;
if (avail < size) {
__builtin_longjmp(a->jmp_buf, 1);
}
void *p = (char *)a + a->off;
a->off += size;
return p;
}
Usage would look like:
int compute(void *workmem, size_t memsize)
{
Arena *arena;
if (OOM(arena, workmem, memsize)) {
// jumps here when out of memory
return COMPUTE_OOM;
}
Thing *t = PUSHSTRUCT(arena, Thing);
// ...
return COMPUTE_OK;
}
More granular snapshots can be made further down the stack by allocating subarenas out of the main arena. I have yet to try this out in a practical program.
]]>In general, when working in C I avoid the standard library, libc, as much as possible. If possible I won’t even link it. For people not used to working and thinking this way, the typical response is confusion. Isn’t that like re-inventing the wheel? For me, libc is a wheel barely worth using — too many deficiencies in both interface and implementation. Fortunately, it’s easy to build a better, simpler wheel when you know the terrain ahead of time. In this article I’ll review the functions and function-like macros of the C standard library and discuss practical issues I’ve faced with them.
Fortunately the flexibility of C-in-practice makes up for the standard library. I already have all the tools at hand to do what I need — not beholden to any runtime.
How does one write portable software while relying little on libc? Implement the bulk of the program as platform-agnostic, libc-free code then write platform-specific code per target — a platform layer — each in its own source file. The platform code is small in comparison: mostly unportable code, perhaps raw system calls, graphics functions, or even assembly. It’s where you get access to all the coolest toys. On some platforms it will still link libc anyway because it’s got useful platform-specific features, or because it’s mandatory.
The discussion below is specifically about standard C. Some platforms provide special workarounds for their standard function shortcomings, but that’s irrelevant. If I need to use a non-standard function then I’m already writing platform-specific code and I might as well take full advantage of that fact, bypassing the original issue entirely by calling directly into the platform.
The rest of this article goes through the standard library listing in the C18 draft mostly in order.
I wrote about the assert
macro last year. While C assertions
are better than the same in any other language I know — a trap without
first unwinding the stack — the typical implementation doesn’t have the
courtesy to trap in the macro itself, creating friction. Or worse, it
doesn’t trap at all and instead exits the process normally with a non-zero
status. It’s not optimized for debuggers.
My non-trivial programs quickly pick up this definition instead, adjusted later as needed:
#define ASSERT(c) if (!(c)) __builtin_trap()
There’s no diagnostic, but I usually don’t want that anyway. The vast majority of the time these are caught in a debugger, and I don’t need or want a diagnostic.
I have no objections to static_assert
, but it’s also not part of the
runtime.
By this I mean all the stuff in math.h
, complex.h
, etc. It’s good that
these are, in practice, pseudo-intrinsics. They’re also one of the more
challenging parts of libc to replace. It prioritizes precision more than I
usually need, but that’s a reasonable default.
Includes isalnum
, isalpha
, isascii
, isblank
, iscntrl
, isdigit
,
isgraph
, islower
, isprint
, ispunct
, isspace
, isupper
,
isxdigit
, tolower
, and toupper
. The interface is misleading, almost
maliciously so, and these functions are misused in every case I’ve seen in
the wild. If you see #include <ctype.h>
in a source file then it’s
probably defective. I’ve been guilty of it myself. When it’s up to me,
these functions are banned without exception.
Their prototypes are all shaped like so:
int isXXXXX(int);
However, the domain of the input is unsigned char
plus EOF
. Negative
arguments, aside from EOF
, are undefined behavior, despite the obvious
use case being strings. So this is incorrect:
char *s = ...;
if (isdigit(s[0])) { // WRONG!
...
}
If char
is signed, as it is on x86, then it’s undefined for arbitrary
strings, s
. Some implementations even crash on such inputs.
If the argument was unsigned char
, then it would at least truncate into
range, usually leading to the desired result. (Though not so if passing
Unicode code points, which is an odd mistake to make.) Except that
it has to accommodate EOF
. Why that? These functions are defined for
use with fgetc
, not strings!
You could patch over it with truncation by masking:
if (isdigit(s[0] & 255)) {
...
}
However, you’re still left with locales. This is a bit of global state
that changes how a number of libc functions behave, including
character classification. While locales have some niche uses, most of the
time the behavior is surprising and undesirable. It’s also bad for
performance. I’ve developed a habit of using LC_ALL=C
before some GNU
programs so that they behave themselves. If you’re parsing a fixed format
that doesn’t adapt to locale — virtually everything — you definitely do
not want locale-based character classification of input.
Since the interface and behavior both unsuited for most uses, you’re
better off making your own range checks or lookup tables for your use
case. When you name it, probably avoid starting the function with is
since it’s reserved.
_Bool xisdigit(char c)
{
return c>='0' && c<='9';
}
I used char
, but this still works fine for naive UTF-8 parsing.
Without libc you don’t have to use this global, hopefully thread-local, pseudo-variable. Good riddance. Return your errors, and use a struct if necessary.
As discussed, locales have some niche uses — formatting dates comes to
mind — but what little use they have is trapped behind global state set by
setlocale
, making it sometimes impossible to use correctly.
On Windows I’ve instead used GetLocaleInfoW to get information like, “What is the local name of the current month?”
Sometimes tricky to use correctly, particularly with regard to qualifying
local variables as volatile
. It can compose with region-based allocation
to automatically and instantly free all objects created between set
and jump. These macros are fine, but don’t overdo it.
Variadic functions are occasionally useful, and the va_start
/va_end
macros make them possible. These are, unfortunately, notoriously complex
because calling conventions do not go out of their way to make them any
simpler. They require compiler assistance, and in practice they’re
implemented as part of the compiler rather than libc. They’re okay, but I
can live without it.
While important on unix-like systems, signals as defined in the C standard library are essentially useless. If you’re dealing with signals, or even something like signals, it will be in platform-specific code that goes beyond the C standard library.
I’ve used the _Atomic
qualifier in examples since it helps with
conciseness, but I hardly use it in practice. In part because it has the
inconvenient effect of bleeding into APIs and ABIs. As with volatile
, C
is using the type system to indirectly achieve a goal. Types are not
atomic, loads and stores are atomic. Predating standardization, C
implementations have been expressing these loads and stores using
intrinsics, functions, or macros rather than through types.
The _Atomic
qualifier provides access to the most basic and most strict
atomic operations without libc. That is, it’s implemented purely in the
compiler. However, everything outside that involves libc, and potentially
even requires linking a special atomics library.
Even more, one major implementation (MSVC) still doesn’t support C11 atomics. Anywhere I care about using C atomics, I can already use the richer set of GCC built-ins, which Clang also supports. If I’m writing code intended for Windows, I’ll use the interlocked macros, which work across all the compilers for that platform.
Standard input and output, stdio, is perhaps the primary driving factor for my own routing around libc. Nearly every program does some kind of input or output, but going through stdio makes things harder.
To read or write a file, one must first open it, e.g. fopen
. However,
all the implementations for one platform in particular does not allow
fopen
to access most of the file system, so using libc immediately
limits the program’s capabilities on that platform.
The standard library distinguishes between “text” and “binary” streams. It makes no difference on unix-like platforms, but it does on others, where input and output are translated. Besides destroying your data, text streams have terrible performance. Opening everything in binary mode is a simple enough work around, but standard input, output, and error are opened as text streams, and there is no standard function for changing them to binary streams.
When using fread
, some implementations use the entire buffer as a
temporary work space, even if it returns a length less than the
entire buffer. So the following won’t work reliably:
char buf[N] = {0};
fread(buf, N-1, 1, f);
puts(buf);
It may print junk after the expected output because fread
overwrote the
zeroes beyond it.
Streams are buffered, and there’s no reliable access to unbuffered input
and output, such as when an application is already buffering, perhaps as a
natural consequence of how it works. There’s setvbuf
and _IONBF
(“unbuffered”), but in at least one case this really just means
“one byte at a time.” It’s common for my libc-using programs to end up
with double buffering since I can’t reliably turn off stdio buffering.
Typical implementations assume streams will be used by multiple threads, and so every access goes through a mutex. This causes terrible performance for small reads and writes — exactly the case buffering is supposed to most help. Not only is this unusual, such programs are probably broken anyway — oblivious to the still-present race conditions — and so stdio is optimized for the unusual, broken case at the cost of the most needed typical case.
There is no reliable way to interactively input and display Unicode text. The C standard makes vague concessions for dealing with “wide characters” but it’s useless in practice. I’ve tried! The most common need for me is printing a path to standard error such that it displays properly to the user.
Seek offsets are limited to long
. Some real implementations can’t even
open files large than 2GiB.
Rather than deal with all this, I add a couple of unbuffered I/O functions to the platform layer, then put a small buffered stream implementation in the application which flushes to the platform layer. UTF-8 for text input and output, and if the platform layer detects it’s connected to a terminal or console, it does the appropriate translation. It doesn’t take much to get something more reliable than stdio. The details are the topic for a future article, especially since you might be wondering about formatted output.
As for formatted input, don’t ever bother with scanf
.
Float conversion is generally a difficult problem, especially if you care about round trips. It’s one of the better and most useful parts of libc. Though even with libc it’s still difficult to get the simplest or shortest round-trip representation. Also, this is an area where changing locales can be disastrous!
The question is then: How much does this matter in your application’s context? There’s a good chance you only need to display a rounded, low-precision representation of a float to users — perhaps displaying a player’s position in a debug window, etc. Or you only need to parse medium-precision non-integral inputs following a relatively simple format. These are not so difficult.
Parsing (atoi
, strtol
, strtod
, etc.) requires null-terminated
strings, which is generally inconvenient. These integers likely came from
something not null-terminated like a file, and so I need to first append
a null terminator. I can’t just feed it a token from a memory-mapped file.
Even when using libc, I often write my own integer parser anyway since the
libc parsers lack an appropriate interface.
Update: NRK points out that unsigned integer parsing treats negative
inputs as in range. This is both surprising and rarely useful.
Looking more closely at the specification, I see it is also affected by
locale. Given these revelations, I would ban without exception atoi
,
atol
, strtoul
, and strtoull
, and avoid strtol
and strtoll
.
Formatting integers is easy. Parsing integers within in narrow range (e.g. up to a million) is easy. Parsing integers to the very limits of the numeric type is tricky because every operation must guard against overflow regardless of signed or unsigned. Fortunately the first two are common and the last is rarely necessary!
We have rand
, srand
, and RAND_MAX
. As a PRNG enthusiast, I
could never recommend using this under any circumstances. It’s a PRNG with
mediocre output, poor performance, and global state. RAND_MAX
being
unknown ahead of time makes it even more difficult to make effective use
of rand
. You can do better on all dimensions with just a few lines of
code.
To make matters worse, typical implementations expect it to be accessed concurrently from multiple threads, so they wrap it in a mutex. Again, it optimizes for the unusual, broken case — threads fighting each other over non-deterministic racy results from a deterministic PRNG — at the cost of the typical, sensible case. Programs relying on that mutex are already broken.
Includes malloc
, calloc
, realloc
, free
, etc. Okay, but in practice
used too granularly and too much such that many C programs are tangles of
lifetimes. Sometimes I wish there was a standard region allocator
so that independently-written libraries could speak a common, sensible,
caller-controlled allocation interface.
A major standardization failure here has been not moving size computations
into the allocators themselves. calloc
is a start: You say how big and
how many, and it works out the total allocation, checking for overflow.
There should be more of this, even if just to discourage individual
allocations and encourage group allocations.
There are some edge cases around zero sizes, like malloc(0)
, and the
standard leaves the behavior a bit too open ended. However, if your
program is so poorly structured such that it may possibly pass zero to
malloc
then you have bigger problems anyway.
getenv
is straightforward, though I’d prefer to just access the
environment block directly, a la the non-standard third argument to
main
.
exit
is fine, but atexit
is jank.
system
is essentially useless in practice.
qsort
is finepoor because it lacks a context argument.
Quality varies. Not difficult to implement from scratch if
necessary. I rarely need to sort.
Similar story for bsearch
. Though if I need a binary search over an
array, bsearch
probably isn’t sufficient because I usually want to find
lower and upper bounds of a range.
mblen
, mbtowc
, mbtowc
, wctomb
, mbstowcs
, and wcstombs
are
connected to the locale system and don’t necessarily operate on any
particular encodings like UTF-8, which makes them unreliable. This is the
case for all the other wide character functionality, which is quite a few
functions. Fortunately I only ever need wide characters on one platform in
particular, not in portable code.
More recently are mbrtoc16
, c16rtomb
, mbrtoc32
, and c32rtomb
where
the “wide” side is specified (UTF-16, UTF-32) but not the multi-byte side.
Limited support in implementations and not particularly useful.
Like ctype.h
, string.h
is another case where everything is terrible,
and some functions are virtually always misused.
memcpy
, memmove
, memset
, and memcmp
are fine except for one issue:
it is undefined behavior to pass a null pointer to these functions, even
with a zero size. That’s ridiculous. A null pointer legitimately and
usefully points to a zero-sized object. As mentioned, even malloc(0)
is
permitted to behave this way. These functions would be fine if not for
this one defect.
strcpy
, strncpy
, strcat
, and strncat
have no legitimate
uses and their use indicates confusion. As such, any code calling
them is suspect and should receive extra scrutiny. In fact, I have yet
to see a single correct use of strncpy
in a real program. (Usage hint:
the length argument should refer to the destination, not the source.) When
it’s up to me, these functions are banned without exception. This applies
equally to non-standard versions of these functions like strlcpy
.
strlen
has legitimate uses, but is used too often. It should only appear
at system boundaries when receiving strings of unknown size (e.g. argv
,
getenv
), and should never be applied to a static string. (Hint: you can
use sizeof
on those.)
When I see strchr
, strcmp
or strncmp
I wonder why you don’t know the
lengths of your strings. On the other hand, strcspn
, strpbrk
,
strrchr
, strspn
, and strstr
do not have mem
equivalents, though
the null termination requirement hurts their usefulness.
strcoll
and strxfrm
depend on locale and so are at best niche.
Otherwise unpredictable. Avoid.
memchr
is fine except for the aforementioned null pointer restriction,
though it comes up less often here.
strtok
has hidden global state. Besides that, how long is the returned
token? It knew the length before it returned. You mean I have to call
strlen
to find out? Banned.
strerror
has an obvious, simple, robust solution: return a pointer to a
static string in a lookup table corresponding to the error number. No
global state, thread-safe, re-entrant, and the returned string is good
until the program exits. Some implementations do this, but unfortunately
it’s not true for at least one real world implementation,
which instead writes to a shared, global buffer. Hopefully you were
avoiding errno
anyway.
Introduced in C11, but never gained significant traction. Anywhere you can use C threads you can use pthreads, which are better anyway.
Besides, thread creation probably belongs in the platform layer anyway.
Fairly niche, and I can’t remember using any of these except for time
and clock
for seeding.
I hand-waved away a long list of vestigial wide character functions, but
the above is pretty much all there is to the C standard library. The only
things I miss when avoiding it altogether are the math functions, and
occasionally setjmp
/longjmp
. Everything else I can do better
myself, with little difficulty, starting from the platform layer.
All of the C implementations I had in mind above are very old. They will rarely, if ever, change, just accrue. There isn’t a lot of innovation happening in this space, which is fine since I like stable targets. If you would like to see interesting innovation, check out what Cosmopolitan Libc is up to. It’s what I imagine C could be if it continued evolving along practical dimensions.
]]>In my common SDL2 mistakes listing, the first was about winging it
instead of using the sdl2-config
script. It’s just one of three official
options for portably configuring SDL2, but I had dismissed the others from
consideration. One is the pkg-config facility common to unix-like
systems. However, the SDL maintainers recently announced SDL3, which will
not have a sdl3-config
. The concept has been deprecated in favor of the
existing pkg-config option. I’d like to support this on w64devkit, except
that it lacks pkg-config — not the first time this has come up. So last
weekend I wrote a new pkg-config from scratch with first-class Windows
support: u-config (“micro-config”). It will serve as pkg-config
in w64devkit starting in the next release.
Ultimately pkg-config’s entire job is to find named .pc
text files in
one of several predetermined locations, read fields from them, then write
those fields to standard output. Additional search directories may be
supplied through the $PKG_CONFIG_PATH
environment variable. At a high
level there’s really not much to it.
As a concrete example, here’s a hypothetical example.pc
which might live
in /usr/lib/pkgconfig
.
prefix = /usr
major = 1
minor = 2
patch = 3
version = ${major}.${minor}.${patch}
Name: Example Library
Description: An example of a .pc file
Version: ${version}
Requires: zlib >= 1.2, sdl2
Libs: -L${prefix}/lib -lexample
Libs.private: -lm
Cflags: -I${prefix}/include
Cflags.private: -DEXAMPLE_STATIC
If you invoke pkg-config with --cflags
you get the Cflags
field. With
--libs
, you get the Libs
field. With --static
, you also get the
“private” fields. It will also recursively pull in packages mentioned in
Requires
. The prefix
variable is more than convention and is designed
to be overridden (and u-config does so by default). In theory pkg-config
is supposed to be careful about maintaining argument order and removing
redundant arguments, but in practice… well, pkg-config’s actual behavior
often makes little sense. We’ll get to that.
For SDL2, where you might use:
$ cc app.c $(sdl2-config --cflags --libs)
You could instead use:
$ eval cc app.c $(pkg-config sdl2 --cflags --libs)
Which is still a build command that works uniformly for all supported
platforms, even cross-compiling, given a correctly-configured pkg-config.
For w64devkit, the first command requires placing the directory containing
sdl2-config
on your $PATH
. The second instead requires placing the
directory containing sdl2.pc
in your $PKG_CONFIG_PATH
. To upgrade to
SDL3, replace the sdl2
with sdl3
in the second command.
There are already two major, mostly-compatible pkg-config implementations: the original from freedesktop.org (2001), and pkgconf (2011). Both ostensibly support Windows, but in practice this support is second class, which is a reason why I hadn’t included one in w64devkit. A lot of hassle for what is a ultimately a relatively simple task.
As for the original pkg-config, I’ve been unable to produce a functioning
Windows build. It’s obvious from the compiler warnings that there are many
problems, and my builds immediately crash on start. I’d try debugging it,
except that I’ve been cross-compiling this whole time. I cannot build it
on Windows because (1) GNU Autotools and (2) pkg-config requireswants
pkg-config as a build dependency. That’s right, you have to bootstrap
pkg-config! Remember, this is a tool whose entire job is to copy some
bits of text from a text file to its output. One could use pkg-config as a
case study of accidental complexity, and this is just the beginning.
Update: It was pointed out that I wouldn’t need the full, two-stage bootstrap just for my debugging scenario.
The bootstrap issue is part of pkgconf’s popularity as an alternative. It’s also a tidier code base, does a far better job of sorting and arranging its outputs than the original pkg-config, and its overall behavior makes more sense. However, despite its three independent build systems, pkgconf is still annoying to build, not to mention its memory corruption bugs. We’ll get to that, too.
Considering pkg-config’s relatively simple job, obtaining one shouldn’t be this difficult! I could muddle through until one or the other worked, or I could just write my own. I’m glad I did, since I’m extremely happy with the results.
As of this writing, u-config is about 2,000 lines of C. It doesn’t support
every last pkg-config feature, nor will it ever. The goal is to support
support existing pkg-config based builds, not make more of them. So, for
example, features for debugging .pc
files are omitted. Some features are
of dubious usefulness (--errors-to-stdout
) even if they’d be simple to
implement; there are already way too many flags. Other features clearly
don’t work correctly — either not as documented or the results don’t make
sense — so I skipped those as well.
It comes in two flavors: “generic” C and Windows. The former works on any system with a C99 compiler. In fact, it only uses these 9 standard library functions:
exit
fclose
ferror
fflush
fopen
fread
fwrite
getenv
malloc
That is, it needs to open .pc
files, read from them, close those
handles, write to standard output and standard error, check for I/O
errors, and exactly once call malloc
to allocate a block of memory for
an arena allocator. It’s not even important the streams are buffered
because u-config does its own buffering. Not that it would be useful, but
porting to an unhosted 16-bit microcontroller, with fopen
implemented as
a virtual file system, would be trivial. (You know… it could be dropped
into busybox-w32 as a new app with little effort…)
It’s also a unity build — compiled as a single translation unit — so building u-config is as easy as it gets:
$ cc -o pkg-config generic_main.c
Reminder: the original pkg-config cannot even be built without a bootstrapping step.
Since standard C functions are implemented poorly on Windows, but
also so that it can do some smarter self-configuration at run-time based
on the .exe
location, the Windows platform layer calls directly into
Win32 and no C runtime (CRT) is used. Input .pc
files are memory mapped.
Internally u-config is all UTF-8, and the platform layer does the Unicode
translations at the Win32 boundaries for paths, arguments, environment
variables, and console outputs.
Building is slightly more complicated:
$ cc -o pkg-config -nostartfiles win32_main.c
Greenfield projects present a great opportunity for trying new things, and
this is no exception. Contrary to my usual style, I decided I would make
substantial use of typedef
and capitalize all the type names.
typedef int Bool;
typedef unsigned char Byte;
typedef struct {
Byte *s;
Size len;
} Str;
typedef struct {
Str head;
Str tail;
Bool ok;
} Cut;
I like it! It makes the type names stand apart, avoids conflicts with
variable names, and cuts down the visual noise of struct
. I’ve more
recently realized that const
is doing virtually nothing for me — it has
never prevented me from making a mistake — so I left it out (aside from
static lookup tables). That’s even more visual noise gone, and reduced
cognitive load.
In recent years I’ve been convinced that unsigned sizes were a serious
error, probably even one of the great early computing mistakes, and that
sizes and subscripts should be signed. Not only that, pkg-config
has no business dealing with gigantic objects! We’re talking about short
strings and tiny files. If it ends up with a large object, then there’s a
defect somewhere — either in itself or the system — and it should abort.
Therefore sizes and subscripts are a natural int
!
typedef int Size;
typedef unsigned Usize;
#define Size_MAX (Size)((Usize)-1 >> 1)
#define SIZEOF(x) (Size)(sizeof(x))
The Usize
is just for the occasional bit-twiddling, like in Size_MAX
,
and not for regular use. However, u-config objects are no smaller by this
decision because the unused space is nearly always padded on 64-bit
machines. Further, the x86-64 code is about 5% larger with 32-bit sizes
compared to 64-bit sizes — opposite my expectation. Curious.
You might have noticed that Str
type above. Aside from interfaces with
the host that make it mandatory, u-config makes no use of null-terminated
strings anywhere. Every string is a pointer and a size. There’s even a
macro to do this for string literals:
#define S(s) (Str){(Byte *)s, SIZEOF(s)-1}
Then I can use and pass them casually:
if (equals(realname, S("pkg-config"))) {
// ...
}
*insert(arena, &global, S("pc_sysrootdir")) = S("/");
return startswith(arg, S("-I"));
Like strings in other languages, I can also slice out the middle of
strings without copying, handy for parsing and constructing paths. It also
works well with memory-mapped .pc
files since I can extract tokens from
them for use directly in data structures without copying.
That leads into the next item: How does one free or manipulate a data structure where the different parts are arbitrarily allocated across static storage, heap storage, and memory mapped files? The hash tables in u-config are exactly this, the keys themselves allocated in every possible fashion. Don’t you have to keep track of how pointed-at part is allocated? No! The individual objects do not have individual lifetimes due to the arena allocator. The gist of it:
typedef struct {
Str mem;
Size off;
} Arena;
static void *alloc(Arena *a, Size size)
{
ASSERT(size >= 0);
Size avail = a->mem.len - a->off;
if (avail < size) {
oom();
}
Byte *p = a->mem.s + a->off;
a->off += size;
return p;
}
Since it’s passed often, arena parameters are conventionally named a
throughout the program and are always the first argument when needed. If
it runs out of memory, it bails. On 32-bit and 64-bit hosts, the default
arena is 256MiB. If pkg-config needs more than that, then something’s
seriously wrong and it should give up.
While u-config could quite reasonably never “free” (read: reuse) memory, it does do so in practice. In some cases it computes a temporary result, then resets the arena to an earlier state to discard its allocations. A simplified, hypothetical:
for (int i = 0; ...) {
Arena tmparena = *a;
// Use only tmparena in the loop
Env env = {0};
Str value = fmtint(&tmparena, i);
*insert(&tmparena, &env, S("i")) = value;
// ...
// allocations freed when tmparena goes out of scope
}
I had mentioned that u-config does its own output buffering. It’s an
object I call an Out
, modeled loosely after a Plan 9 bio
or a Go
bufio.Writer
. It has a destination “file descriptor”, a memory buffer,
and an integer to track the fill level of the buffer.
typedef struct {
Str buf;
Size fill;
Arena *a;
int fd;
} Out;
Output bytes are copied into the buffer. When it fills, the buffer is automatically emptied into the file descriptor. The caller can manually flush the buffer at any time, and it’s up to the caller to do so before exiting the program.
But wait, what’s the Arena
pointer doing in there? That’s a little extra
feature of my own invention! I can open a stream on an arena, writes into
the stream go into a growing buffer, and “closing” the stream gives me a
string allocated in the arena with the written content. The arena is held
in order to manage all this. It’s also locked out from other allocations
until the stream is closed. The entire implementation is only about a
dozen lines of code.
What use is this? It’s nice when I might want to output either to standard output or to a memory buffer for further use. It’s even more useful when I need to build a string but don’t know its final length ahead of time.
The variable expansion function is both cases. Given a string like
${version}
I want to recursively interpolate until there’s nothing left
to interpolate. The output could go to standard output to print it out, or
into a string for further use. For example, here I have my global variable
environment global
, a package pkg
, its environment (pkg->env
), and I
want to expand its Version:
field, pkg->version
.
Out mem = newmembuf(a);
expand(&mem, global, pkg, pkg->version);
Str version = finalize(&mem);
Or I just print it to standard output, and the value is free to expand beyond what would fit in memory since it flushes as it goes:
Out out = newoutput(1); // 1 == standard output
expand(&out, global, pkg, pkg->version);
flush(&out);
I’m particularly happy about this, and I’m sure I’ll use such “arena streams” again in the future.
While pkgconf tries, and succeeds at, being a faithful (if smarter) clone,
in certain ways u-config more closely follows pkg-config’s behavior. For
example, pkg-config behaves as though it concatenates all its positional
arguments with commas in between, then re-tokenizes them like a Requires
field. For example, these commands are all equivalent:
$ pkg-config 'sdl2 > 2' --libs
$ pkg-config 'sdl2 >' --libs 2
$ pkg-config sdl2 --libs '> 2'
$ pkg-config --libs 'sdl2 > 2'
pkgconf does not copy this behavior, but u-config does. Similarly, the
original .pc
format has undocumented, arcane quoting syntax that sort of
works like shell quotes. I tried to match this closely in u-config, while
pkgconf tries to be more logical. For example, pkg-config allows this:
quote = "
Cflags: "-I${prefix}/include${quote}
Where the ${quote}
will actually close the quote. I retained this but
pkgconf did not.
Does anyone use quoting? On my own system I have one package using quotes,
but it’s probably a mistake since they’re used improperly. In theory,
everyone should be quoting almost everything. For example, this is a very
common Cflags
:
Cflags: -I${prefix}/include
If a crazy person — or well-known multinational corporation — comes along
puts has a space in their system’s installation “prefix”, this .pc
will
not work. The output would be:
-I/Program Files/include
Actually, that’s a lie. I suspect that’s the intended output, and it’s the output of pkgconf and u-config, but pkg-config instead outputs this head-scratcher:
Files/include -I/Program
Seeing this sort of thing repeatedly is why I have little concern with matching every last pkg-config nuance. Regardless, this parses as two arguments, but if written with quotes:
Cflags: "-I${prefix}/include"
Then pkg-config will escape spaces in the expansion:
-I/Program\ Files/include
This will actually work correctly in the eval
context where pkg-config
is intended for use (read: not command substitution). I’ve made
u-config automatically quote the prefix if it contains spaces, so it will
work correctly despite the lack of .pc
file quotes when the library is
under a path containing a space.
Here’s a fun input. pkg-config has its own billion laughs:
v9=lol
v8=${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}${v9}
v7=${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}${v8}
v6=${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}${v7}
v5=${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}${v6}
v4=${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}${v5}
v3=${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}${v4}
v2=${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}${v3}
v1=${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}${v2}
v0=${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}${v1}
Name: One Billion Laughs
Version: ${v0}
Description: Don't install this!
That expands to 1,000,000,001 “lol” (an extra for good luck!) and in
theory --modversion
will print it out:
$ pkg-config --modversion lol.pc
Some different outcomes:
pkg-config will expand it in memory and see it to the bitter end, using
however many GiBs are necessary. Add a few more lines and your computer
will thrash. By the way, bash-completion will ask pkg-config load .pc
files named in the command when completing further arguments. Ask me how
I know.
u-config could fully output it with only a few kB of memory if directed
to a “file descriptor” output, but alas, the Version
field must be
processed in memory for comparison with another version string, so it
doesn’t attempt to do so. It runs out of arena memory and gives up.
That’s a feature, especially if you’re using bash-completion.
pkgconf I had built with Address Sanitizer in case it found anything, and boy did it. This input overflows a stack variable and then ASan kills it. I’m unsure what’s supposed to happen next, but I suspect silent truncation.
But that’s a crazy edge case right? Well, it also overflows on empty
.pc
files, or for all sorts of inputs. I probed both pkg-config and
pkgconf with weird inputs to learn how it’s supposed to work, and it was
rather irritating having pkgconf crash for so many of them. Someone on the
project ought to do testing with ASan sometime. Important note: This is
not a security vulnerability!
Further, as you might notice when you build it, pkgconf first tries to
link the system strlcpy
, if it exists. Failing that, it uses its own
version. That’s one of the annoying details about building it. However,
using strlcpy
never, ever makes sense! Now that I think about
it, there’s probably a connection with those buffer overflows.
In general, neither pkg-config nor pkgconf fare well when fuzz tested with sanitizers.
I had a lot of fun writing u-config, and I’m excited about this new
addition to w64devkit. Despite my pkg-config grumbling, it is neat that
it’s established this de facto standard and encouraged a distributed
database of .pc
files to exist, at least as documentation if not for a
mechanical process like this.
For u-config, there’s still more testing to do, and I’m still open to
picking up more behaviors from pkg-config or pkgconf where they make
sense. Though given its primary use case — building software on Windows
without a package manager — it will probably never be stressed hard enough
to matter. Further, w64devkit does not include any .pc
files of its own,
and since I do not intend to add libraries — that is, beyond the standard
language libraries and Windows SDK — that probably won’t change.
If you’d like to try it early, build it with w64devkit, toss in on your
PATH
, point PKG_CONFIG_PATH
at a library with .pc
files, and try it
out. It already works flawlessly with at least SDL2.
SDL has grown on me over the past year. I didn’t understand its value until viewing it in the right lens: as a complete platform and runtime replacing the host’s runtime, possibly including libc. Ideally an SDL application links exclusively against SDL and otherwise not directly against host libraries, though in practice it’s somewhat porous. With care — particularly in avoiding mistakes covered in this article — that ideal is quite achievable for C applications that fit within SDL’s feature set.
SDL applications are always interesting one way or another, so I like to dig in when I come across them. The items in this article are mistakes I’ve either made myself or observed across many such passion projects in the wild.
sdl2-config
This shell script comes with SDL2 and smooths over differences between
platforms, even when cross compiling. It informs your compiler where to
find and how to link SDL2. The script even works on Windows if you have a
unix shell, such as via w64devkit. Use it as a command substitution at
the end of the build command, particularly when using --libs
. A one-shot
or unity build (my preference) looks like so:
$ cc app.c $(sdl2-config --cflags --libs)
Or under separate compilation:
$ cc -c app.c $(sdl2-config --cflags)
$ cc app.o $(sdl2-config --libs)
Alternatively, static link by replacing --libs
with --static-libs
,
though this is discouraged by the SDL project. When dynamically linked,
users can, and do, trivially substitute a different SDL2 binary, such as
one patched for their system. In my experience, static linking works
reliably on Windows but poorly on Linux.
Alternatively, use the general purpose pkg-config
. Don’t forget eval
!
$ eval cc app.c $(pkg-config sdl2 --cflags --libs)
I wrote a pkg-config for Windows specifically for this case.
Caveats:
Some circumstances require special treatment, and sdl2-config
may be
too blunt a tool. That’s fine, but generally prefer sdl2-config
as the
default approach.
sdl2-config
does not support extensions such as SDL2_image
, so you
will need to use pkg-config
. Personally I don’t think they’re worth
the trouble when there’s stb, or QOI instead of PNG.
There’s an alternative build option using CMake, without any use of
sdl2-config
, but I won’t discuss it here.
SDL2/SDL.h
A lot of examples, including tutorials linked from the official SDL
website, have SDL2/
in their include paths. That’s because they’re
making mistake 1, not using sdl2-config
, and are instead relying on
Linux distributions having installed SDL2 in a place coincidentally
accessible through that include path.
This is annoying when SDL2 not installed there, or if I don’t want it using the system’s SDL2. Worse, it can result in subtly broken builds as it mixes and matches different SDL installations. The correct SDL2 include is the following:
#include "SDL.h"
Note the quotes, which helps prevent picking up an arbitrary system header by accident. When carefully and narrowly targeting SDL-the-platform, this will be the only “system” include anywhere in your application.
main
A conventional SDL application has a main
function defined in its
source, but despite the name, this is distinct from C main
. To smooth
over platform differences, SDL may rename the application’s main
to SDL_main
and substitute its own C main
. Because of this, main
must have the conventional argc
/argv
prototype and must return a
value. (As a special case, C permits main
to implicitly return 0
, so
it’s an easy mistake to make.)
With this in mind, the bare minimum SDL2 application:
#include "SDL.h"
int main(int argc, char **argv)
{
return 0;
}
Caveat: Like with sdl2-config
, some special circumstances require
control over the application entry point — see SDL_MAIN_HANDLED
and
SDL_SetMainReady
— but that should be reserved until there’s a need.
One such special case is avoiding linking a CRT on Windows. In principle it’s this simple:
#include "SDL.h"
int WinMainCRTStartup(void)
{
SDL_SetMainReady();
// ...
return 0;
}
Then it’s the usual compiler and linker flags:
$ cc -nostdlib -o app.exe app.c $(sdl2-config --cflags --libs)
This will create a tiny .exe
that doesn’t link any system DLL, just
SDL2.dll
. Quite platform agnostic indeed!
$ objdump -p app.exe | grep -Fi .dll
DLL Name: SDL2.dll
Alas, as of this writing, this does not work reliably. SDL2’s accelerated
renderers on Windows do not clean up properly in SDL_QuitSubSystem
nor
SDL_Quit
, so the process cannot exit without calling ExitProcess in
kernel32.dll
(or similar). This is still an open experiment.
The SDL wiki is not authoritative documentation, merely a convenient web-linkable — and downloadable (see “offline html”) — information source. However, anyone who’s spent time on it can tell you it’s incomplete. The authoritative API documentation is the SDL headers, which fortunately are already on hand for building SDL applications. The SDL maintainers themselves use the headers, not the wiki.
If, like me, you’re using ctags, this is actually good news! With a
bit of configuration, you can jump to any bit of SDL documentation at any
time in your editor, treating the SDL headers like a hyperlinked wiki
built into your editor. Just like building, sdl2-config
can tell ctags
where find those headers:
$ ctags -a -R --kinds-c=dept $(sdl2-config --prefix)/include/SDL2
I’m using -a
(--append
) to append to the tags file I’ve already
generated for my own program, -R
(--recurse
) to automatically find all
the headers, and --kinds-c=dept
capture exactly the kinds of symbols I
care about — #define
, enum
, prototypes, typedef
— no more no less.
In Vim I CTRL-]
over any SDL symbol to jump to its documentation, and
then I can use it again within its documentation comment to jump further
still to any symbols it mentions, then finally use the jump or tag stack
to return. As long as I have t
in 'complete'
('cpt'
), which
is the default, I can also “tab”-complete any SDL symbol using the tags
table. There are a few rough edges here and there, but overall it’s a
solid editing paradigm.
By the way, with sdl2-config
in your $PATH
, all the above works out of
the box in w64devkit! That’s where I’ve mostly been working with SDL.
A common bit of code in real SDL programs and virtually every tutorial:
if (SDL_Init(...)) {
fprintf(stderr, "SDL_Init(): %s\n", SDL_GetError());
return 1;
}
This is not ideal:
fprintf
is not part of the SDL platform. This is going behind SDL’s
back, reaching around the abstraction to a different platform. Strictly
speaking, this API may not even be available to an SDL application.
SDL applications are graphical, so stderr
is likely disconnected from
anything useful. Few would ever see this message.
Fortunately SDL provides two alternatives:
SDL_Log
: like C printf
, but SDL will strive to connect it to
somewhere useful. If the application was launched from a terminal or
console, SDL will find it and hook it up to the logger. On Windows, if
there’s a debugger attached, SDL will use OutputDebugString to
send logs to the debugger.
SDL_ShowSimpleMessageBox
: using any means possible, attempt to display
a message to the user. Like SDL_Log
, it’s safe to use before/without
initializing SDL subsystems.
If you’re paranoid, you could even use both:
if (SDL_Init(...)) {
SDL_ShowSimpleMessageBox(
SDL_MESSAGEBOX_ERROR, "SDL_Init()", SDL_GetError(), 0
);
SDL_Log("SDL_Init(): %s", SDL_GetError());
return 1;
}
Though note that SDL_ShowSimpleMessageBox
can fail, which will set a
new, different error message for SDL_Log
!
There’s a similar story again with fopen
and loading assets. SDL has an
I/O API, SDL_RWops
. It’s probably better than the host’s C equivalent,
particularly with regards to paths. If you’re not already embedding your
assets, use the SDL API instead.
SDL_RENDERER_ACCELERATED
This flag — and its surrounding bit set, SDL_RendererFlags
— are a
subtle design flaw in the SDL2 API. Its existence is misleading, causing
to widespread misuse. It does not help that the documentation, both header
and wiki, is incomplete and unclear. The SDL_CreateRenderer
function
accepts a bit set as its third argument, and it serves two simultaneous
purposes:
Indicates mandatory properties of the renderer. Examples: “must use accelerated rendering,” “must use software rendering,” “must support vertical synchronization (vsync).” Drivers without the chosen properties are skipped.
If SDL_RENDERER_PRESENTVSYNC
is set, also enables vsync in the created
render.
The common mistake is thinking that this bit indicates preference: “prefer an accelerated renderer if possible”. But it really means “accelerated renderer or bust.”
Given a zero for renderer flags, SDL will first attempt to create an accelerated renderer. Failing that, it will then attempt to create a software renderer. A software renderer fallback is exactly the behavior you want! After all, this fallback is one of the primary features of the SDL renderer API. This is so straightforward there are no caveats.
For a game, you probably ought to enable vsync in your renderer. The hint:
You’re using SDL_PollEvent
in your main event loop. Otherwise you will
waste lots of resources rendering thousands of frames per second. If my
laptop fan spins up running your SDL application, it’s probably because
you didn’t do this. The following should be the most conventional SDL
renderer configuration:
r = SDL_CreateRenderer(window, -1, SDL_RENDERER_PRESENTVSYNC);
The software renderer supports vsync, so it will not be excluded from the driver search when vsync is requested.
That’s only for SDL renderers. If you’re using OpenGL, set a non-zero
SDL_GL_SetSwapInterval
so that SDL_GL_SwapWindow
synchronizes. For the
other rendering APIs, consult their documentation. (I can only speak to
SDL and OpenGL from experience.)
Caveat: Beware accidentally relying on vsync for timing in your game. You don’t want your game’s physics to depend on the host’s display speed. Even the pros make this mistake from time to time.
However, if you’re not making a game – perhaps instead an IMGUI
application without active animations — there’s a good chance you don’t
need or want vsync. The hint: You’re using SDL_WaitEvent
in your main
event loop.
In summary, graphical SDL applications fall into one of two cases:
SDL_PollEvent
with vsyncSDL_WaitEvent
without vsyncassert.h
instead of SDL_assert
Alright, this one isn’t so common, but I’d like to highlight it. The
SDL_assert
macro is fantastic, easily beating assert.h
which
doesn’t even break in the right place. It uses SDL to present a
user interface to the assertion, with support for retrying and ignoring.
It also works great under debuggers, breaking exactly as it should. I have
nothing but praise for it, so don’t pass up the chance to use it when you
can.
While I’m at it: during developing and testing, always always always run your application under a debugger. Don’t close the debugger, just launch through it again after rebuilding. Also, enable UBSan and ASan when available for the extra assertions.
For months I had wondered why SDL provides no memory allocation API. I’m
fine if it doesn’t have a general purpose allocator since I just want to
grab a chunk of host memory for an arena. However, SDL does
have allocations functions — SDL_malloc
, etc. I didn’t know about them
until I stopped making mistake 4.
It was the same story again with math functions: I’d like not to stray
from SDL as a platform, but what if I need transcendental functions? I
could whip up crude implementations myself, but I’d prefer not. SDL has
those too: SDL_sin
, etc. Caveat: The math.h
functions are built-ins,
and compilers use that information to better optimize programs, e.g. cool
stuff like -mrecip
, or SIMD vectorization. That cannot be done with
SDL’s equivalents.
I’m surprised SDL has no random number generator considering how important
it is to games. Since I prefer to handle this myself, I don’t mind
that so much, but it does leave a lot of toy programs out there calling C
rand
. I would like SDL if provided a single, good seed early during
startup. There isn’t even a wall clock function for the classic
srand(time(0))
seeding event! My solution has been to mix event
timestamps into the random state:
static Uint32 rand32(Uint64 *);
Uint64 rng = 0;
for (SDL_Event e; SDL_PollEvent(&e);) {
rng ^= e.common.timestamp;
rand32(&rng); // stir
switch (e.type) { /* ... */ }
}
As I learn more in the future, I may come back and add to this list. At the very least I expect to use SDL increasingly in my own projects.
]]>The Quite OK Image (QOI) format was announced late last year and finalized into a specification a month later. Initially dismissive, a revisit has shifted my opinion to impressed. The format hits a sweet spot in the trade-off space between complexity, speed, and compression ratio. Also considering its alpha channel support, QOI has become my default choice for embedded image assets. It’s not perfect, but at the very least it’s a solid foundation.
Since I’m now working with QOI images, I need a good QOI viewer, and so I added support to my ill-named pbmview tool, which I wrote to serve the same purpose for Netpbm. I will continue to use Netpbm as an output format, especially for raw video output, but no longer will I use it for an embedded asset (nor re-invent yet another RLE over Netpbm).
I was dismissive because the website claimed, and still claims today, QOI images are “a similar size” to PNG. However, for the typical images where I would use PNG, QOI is around 3x larger, and some outliers are far worse. The 745 PNGs on my blog — a perfect test corpus for my own needs — convert to QOIs 2.8x larger on average. The official QOI benchmark has much better results, 1.3x larger, but that’s because it includes a lot of photography where PNG and QOI both do poorly, making QOI seem more comparable.
However, as I said, QOI’s strength is its trade-off sweet spot. The specification is one page, and an experienced developer can write a complete implementation from scratch in a single sitting. My own implementation is about 100 lines of libc-free C for each of the encoder and decoder. With error checking removed, my decoder is ~600 bytes of x86 object code — a great story for embedding alongside assets. It’s more complex than Netpbm or farbfeld, but it’s far simpler than BMP. I’ve already begun experimenting with converting assets to QOI, and the results have so far exceeded my expectations.
To my surprise, the encoder was easier to write than the decoder. The format is so straightforward such that two different encoders will produce the identical files. There’s little room for specialized optimization, and no meaningful “compression level” knob.
There are a lot of dimensions on which QOI could be improved, but most cases involve trade-offs, e.g. more complexity for better compression. The areas where QOI could have been strictly better, the dimensions on which it is not on the Pareto frontier, are more meaningful criticisms — missed opportunities. My criticisms of this kind:
Big endian fields are an odd choice for a 2020s file format. Little endian dominates the industry, and it would have made for a slightly smaller decoder footprint on typical machines today if QOI used little endian.
The header has two flags and spends an entire byte on each. It should have instead had a flag byte, with two bits assigned to these flags. One flag indicates if the alpha channel is important, and the other selects between two color spaces (sRGB, linear). Both flags are only advisory.
The 4-channel encoded pixel format is ABGR (or RGBA), placing the alpha channel next to the blue channel. This is somewhat unconventional. A decoder is likely to use a single load into 32-bit integer, and ideally it’s already in the desired format or close to it. A few times already I’ve had to shuffle the RGB bytes within the 32-bit sample to be compatible with some other format. QOI channel ordering is arbitrary, and I would have chosen ARGB (when viewed as little endian).
The QOI hash function operates on channels individually, with individual overflow, making it slower and larger than necessary. The hash function should have been over a packed 32-bit sample. I would have used a multiplication by a carefully-chosen 32-bit integer, then a right shift using the highest 6 bits of the result for the index.
More subjective criticisms that might count as having trade-offs:
Given a “flag byte” (mentioned above) it would have been free to assign another flag bit indicating pre-multiplied alpha, also still advisory. You want to use pre-multiplied alpha for your assets, and the option store them this way would help.
There’s an 8-byte end-of-stream marker — a bit excessive — deliberately an invalid encoding so that reads past the end of the image will result in a decoding error. I probably would have chosen a dead simple 32-bit checksum of packed 32-bit images samples, even if literally a sum.
Of course, you’re not obligated to follow QOI exactly to spec for your own assets, so you could always use a modified QOI with one or more of these tweaks. That’s what I meant about it being a solid foundation: You don’t have to start from scratch with some custom RLE. Since the format is so simple, you can easily build your own tools — as I’ve already begun doing myself — so you don’t need to rely on tools supporting your QOI fork.
I’m really happy with my QOI implementation, particularly since it’s another example of a minimalist C API: no allocating, no input or output, and no standard library use. As usual, the expectation is that it’s in the same translation unit where it’s used, so it’s likely inlined into callers.
The encoder is streaming — it accepts and returns only a little bit of input and output at a time. It has three functions and one struct with no “public” fields:
struct qoiencoder qoiencoder(void *buf, int w, int h, const char *flags);
int qoiencode(struct qoiencoder *, void *buf, unsigned color);
int qoifinish(struct qoiencoder *, void *buf);
The first function initializes an encoder and writes a fixed-length header
into the QOI buffer. The flags
field is a mode string, like fopen
. I
would normally use bit flags, but this is a little experiment. The
second function encodes a single pixel into the QOI buffer, returning the
number of bytes written (possibly zero). The last flushes any encoding
state and writes the end-of-stream marker. There are no errors. My typical
use so far looks like:
char buf[16];
struct qoiencoder q = qoiencoder(buf, width, height, "a");
fwrite(buf, QOIHDRLEN, 1, file);
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
// ... compute 32-bit ABGR sample at (x, y) ...
fwrite(buf, qoiencode(&q, buf, abgr), 1, file);
}
}
fwrite(buf, qoifinish(&q, buf), 1, file);
fflush(file);
return ferror(file);
This appends encoder outputs to a buffered stream, but it could just as well accumulate directly into a larger buffer, advancing the write pointer a little after each call.
The decoder is two functions, but its struct has some “public” fields.
struct qoidecoder {
int width, height;
_Bool alpha, srgb, error;
// ...
};
struct qoidecoder qoidecoder(const void *buf, int len);
static unsigned qoidecode(struct qoidecoder *);
The input is not streamed and the entire buffer must be loaded into memory at once — not too bad since it’s compressed, and perhaps even already loaded as part of the executable image — but the output is streamed, delivering one packed 32-bit ABGR sample per call. The decoder makes no assumptions about the output format, and the caller unpacks samples and stores them in whatever format is appropriate (shader texture, etc.).
To make it easier to use, my decoder range checks to guarantee that width and height can be multiplied without overflow. Unlike encoding, there may be errors due to invalid input, including that failed range check. The decoder error flag is “sticky” and the decoder returns zero samples when in an error state, so callers can wait to check for errors until the end. (Though if you’re only decoding embedded assets, then there are no practical errors, and checks can be removed/ignored.)
Example usage, copied almost verbatim from a real program:
int loadimage(Image *image, const uint8_t *qoi, int len)
{
struct qoidecoder q = qoidecoder(qoi, len);
if (/* image dimensions too large */) {
return 0;
}
image->width = q.width;
image->height = q.height;
int count = q.width * q.height;
for (int i = 0; i < count; i++) {
unsigned abgr = qoidecode(&q);
image->data[4*i+0] = abgr >> 16;
image->data[4*i+1] = abgr >> 8;
image->data[4*i+2] = abgr >> 0;
image->data[4*i+3] = abgr >> 24;
}
return !q.error;
}
Note the aforementioned awkward RGB shuffle.
It’s safe to say that I’m excited about QOI, and that it now has a permanent slot on my developer toolbelt.
]]>The source: dandelions.c
The game is played on a 5-by-5 grid where one player plays the dandelions, the other plays the wind. Players alternate, dandelions placing flowers and wind blowing in one of the eight directions, spreading seeds from all flowers along the direction of the wind. Each side gets seven moves, and the wind cannot blow in the same direction twice. The dandelions’ goal is to fill the grid with seeds, and the wind’s goal is to prevent this.
Try playing a few rounds with a friend, and you will probably find that dandelions is difficult, at least in your first games, as though it cannot be won. However, my engine proves the opposite: The dandelions always win with perfect play. In fact, it’s so lopsided that the dandelions’ first move is irrelevant. Every first move is winnable. If the dandelions blunder, typically wind has one narrow chance to seize control, after which wind probably wins with any (or almost any) move.
For reasons I’ll discuss later, I only solved the 5-by-5 game, and the situation may be different for the 6-by-6 variant. Also, unlike British Square, my engine does not exhaustively explore the entire game tree because it’s far too large. Instead it does a minimax search to the bottom of the tree and stops when it finds a branch where all leaves are wins for the current player. Because of this, it cannot maximize the outcome — winning as early as possible as dandelions or maximizing the number of empty grid spaces as wind. I also can’t quantify the exact size of tree.
Like with British Square, my game engine only has a crude user interface
for interactively exploring the game tree. While you can “play” it in a
sense, it’s not intended to be played. It also takes a few seconds to
initially explore the game tree, so wait for the >>
prompt.
I used bitboards of course: a 25-bit bitboard for flowers, a 25-bit bitboard for seeds, and an 8-bit set to track which directions the wind has blown. It’s especially well-suited for this game since seeds can be spread in parallel using bitwise operations. Shift the flower bitboard in the direction of the wind four times, ORing it into the seeds bitboard on each shift:
int wind;
uint32_t seeds, flowers;
flowers >>= wind; seeds |= flowers;
flowers >>= wind; seeds |= flowers;
flowers >>= wind; seeds |= flowers;
flowers >>= wind; seeds |= flowers;
Of course it’s a little more complicated than this. The flowers must be
masked to keep them from wrapping around the grid, and wind may require
shifting in the other direction. In order to “negative shift” I actually
use a rotation (notated with >>>
below). Consider, to rotate an N-bit
integer left by R, one can right-rotate it by N-R
— ex. on a 32-bit
integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a
negative wind
that goes in the other direction:
flowers >>> (wind & 31);
With such a “programmable shift” I can implement the bulk of the game rules using a couple of tables and no branches:
// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &= mask[dir]; f >>>= rot[i] & 31; s |= f;
f &= mask[dir]; f >>>= rot[i] & 31; s |= f;
f &= mask[dir]; f >>>= rot[i] & 31; s |= f;
f &= mask[dir]; f >>>= rot[i] & 31; s |= f;
The masks clear out the column/row about to be shifted “out” so that it doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5 times.
The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58 bits, which fits in a 64-bit integer with bits to spare. How incredibly convenient! So I represent the game state using a 64-bit integer, using a packing like I did with British Square. The bottom 25 bits are the seeds, the next 25 bits are the flowers, and the next 8 is the wind set.
000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS
Even more convenient, I could reuse my bitboard canonicalization code from British Square, also a 5-by-5 grid packed in the same way, saving me the trouble of working out all the bit sieves. I only had to figure out how to transpose and flip the wind bitset. Turns out that’s pretty easy, too. Here’s how I represent the 8 wind directions:
567
4 0
321
Flipping this vertically I get:
321
4 0
567
Unroll these to show how old maps onto new:
old: 01234567
new: 07654321
The new is just the old rotated and reversed. Transposition is the same
story, just a different rotation. I use a small lookup table to reverse
the bits, and then an 8-bit rotation. (See revrot
.)
To determine how many moves have been made, popcount the flower bitboard and wind bitset.
int moves = POPCOUNT64(g & 0x3fffffffe000000);
To test if dandelions have won:
int win = (g&0x1ffffff) == 0x1ffffff;
Since the plan is to store all the game states in a big hash table — an MSI double hash in this case — I’d like to reserve the zero value as a “null” board state. This lets me zero-initialize the hash table. To do this, I invert the wind bitset such that a 1 indicates the direction is still available. So the initial game state looks like this (in the real program this is accounted for in the previously-discussed turn popcount):
#define GAME_INIT ((uint64_t)255 << 50)
The remaining 6 bits can be used to cache information about the rest of tree under this game state, namely who wins from this position, and this serves as the “value” in the hash table. Turns out the bitboards are already noisy enough that a single xorshift makes for a great hash function. The hash table, including hash function, is under a dozen lines of code.
// Find the hash table slot for the given game state.
uint64_t *lookup(uint64_t *ht, uint64_t g)
{
uint64_t hash = g ^ g>>32;
size_t mask = (1L << HASHTAB_EXP) - 1;
size_t step = hash>>(64 - HASHTAB_EXP) | 1;
for (size_t i = hash;;) {
i = (i + step)&mask;
if (!ht[i] || ht[i]&0x3ffffffffffffff == g) {
return ht + i;
}
}
}
To explore a 6-by-6 grid I’d need to change my representation, which is part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit integer, so I’d need to double my storage requirements, which are already strained.
Due to the way seeds spread, game states resulting from different moves rarely converge back to a common state later in the tree, so the hash table isn’t doing much deduplication. Exhaustively exploring the entire game tree, even cutting it down to an 8th using canonicalization, requires substantial computing resources, more than I personally have available for this project. So I had to stop at the slightly weaker form, find a winning branch rather than maximizing a “score.”
I configure the program to allocate 2GiB for the hash table, but if you run just a few dozen games off the same table (same program instance), each exploring different parts of the game tree, you’ll exhaust this table. A 6-by-6 doubles the memory requirements just to represent the game, but it also slows the search and substantially increases the width of the tree, which grows 44% faster. I’m sure it can be done, but it’s just beyond the resources available to me.
As a side effect, I wrote a small routine to randomly play out games in search for “mate-in-two”-style puzzles. The dandelions have two flowers to place and can force a win with two specific placements — and only those two placements — regardless of how the wind blows. Here are two of the better ones, each involving a small trick that I won’t give away here (note: arrowheads indicate directions wind can still blow):
There are a variety of potential single-player puzzles of this form.
There could be a whole “crossword book” of such dandelion puzzles.
]]>In case you’re not familiar with it, a typical WaitGroup use case in Go:
var wg sync.WaitGroup
for _, task := range tasks {
wg.Add(1)
go func(t Task) {
// ... do task ...
wg.Done()
}(task)
}
wg.Wait()
I zero-initialize the WaitGroup, the main goroutine increments the counter before starting each task goroutine, each goroutine decrements the counter when done, and the main goroutine waits until the counter reaches zero. My goal is to build the same mechanism in C:
void workfunc(task t, int *wg)
{
// ... do task ...
waitgroup_done(wg);
}
int main(void)
{
// ...
int wg = 0;
for (int i = 0; i < ntasks; i++) {
waitgroup_add(&wg, 1);
go(workfunc, tasks[i], &wg);
}
waitgroup_wait(&wg);
// ...
}
When it’s done, the WaitGroup is back to zero, and no cleanup is required.
I’m going to take it a little further than that: Since its meaning and
contents are explicit, you may initialize a WaitGroup to any non-negative
task count! In other words, waitgroup_add
is optional if the total
number of tasks is known up front.
int wg = ntasks;
for (int i = 0; i < ntasks; i++) {
go(workfunc, tasks[i], &wg);
}
waitgroup_wait(&wg);
A sneak peek at the full source: waitgroup.c
To build this WaitGroup, we’re going to need four primitives from the host
platform, each operating on an int
. The first two are atomic operations,
and the second two interact with the system scheduler. To port the
WaitGroup to a platform you need only implement these four functions,
typically as one-liners.
static int load(int *); // atomic load
static int addfetch(int *, int); // atomic add-then-fetch
static void wait(int *, int); // wait on change at address
static void wake(int *); // wake all waiters by address
The first two should be self-explanatory. The wait
function waits for
the pointed-at integer to change its value, and the second argument is its
expected current value. The scheduler will double-check the integer before
putting the thread to sleep in case it changes at the last moment — in
other words, an atomic check-then-maybe-sleep. The wake
function is the
other half. After changing the integer, a thread uses it to wake all
threads waiting for the pointed-at integer to change. Together, this
mechanism is known as a futex.
I’m going to simplify the WaitGroup semantics a bit in order to make my
implementation even simpler. Go’s WaitGroup allows adding negatives, and
the Add
method essentially does double-duty. My version forbids adding
negatives. That means the “add” operation is just an atomic increment:
void waitgroup_add(int *wg, int delta)
{
addfetch(wg, delta);
}
Since it cannot bring the counter to zero, there’s nothing else to do. The “done” operation can decrement to zero:
void waitgroup_done(int *wg)
{
if (!addfetch(wg, -1)) {
wake(wg);
}
}
If the atomic decrement brought the count to zero, we finished the last task, so we need to wake the waiters. We don’t know if anyone is actually waiting, but that’s fine. Some futex use cases will avoid making the relatively expensive system call if nobody’s waiting — i.e. don’t waste time on a system call for each unlock of an uncontended mutex — but in the typical WaitGroup case we expect a waiter when the count finally goes to zero. That’s the common case.
The most complicated of the three is waiting:
void waitgroup_wait(int *wg)
{
for (;;) {
int c = load(wg);
if (!c) {
break;
}
wait(wg, c);
}
}
First check if the count is already zero and return if it is. Otherwise use the futex to wait for it to change. Unfortunately that’s not exactly the semantics we want, which would be to wait for a certain target. This doesn’t break the wait, but it’s a potential source of inefficiency. If a thread finishes a task between our load and wait, we don’t go to sleep, and instead try again. However, in practice, I ran thousands of threads through this thing concurrently and I couldn’t observe such a “miss.” As far as I can tell, it’s so rare it doesn’t matter.
If this was a concern, the WaitGroup could instead be a pair of integers: the counter and a “latch” that is either 0 or 1. Waiters wait on the latch, and the latch is modified (atomically) when the counter transitions to or from zero. That gives waiters a stable value on which to wait, proxying the counter. However, since this doesn’t seem to matter in practice, I prefer the elegance and simplicity of the single-integer WaitGroup.
With the WaitGroup done at a high level, we now need the per-platform parts. Both GCC and Clang support GNU-style atomics, so I’ll just assume these are available on Linux without worrying about the compiler. The first two functions wrap these built-ins:
static int load(int *p)
{
return __atomic_load_n(p, __ATOMIC_SEQ_CST);
}
static int addfetch(int *p, int addend)
{
return __atomic_add_fetch(p, addend, __ATOMIC_SEQ_CST);
}
For wait
and wake
we need the futex(2)
system call. In an
attempt to discourage its direct use, glibc doesn’t wrap this system call
in a function, so we must make the system call ourselves.
static void wait(int *p, int current)
{
syscall(SYS_futex, p, FUTEX_WAIT, current, 0, 0, 0);
}
static void wake(int *p)
{
syscall(SYS_futex, p, FUTEX_WAKE, INT_MAX, 0, 0, 0);
}
The INT_MAX
means “wake as many as possible.” The other common value is
1 for waking a single waiter. Also, these system calls can’t meaningfully
fail, so there’s no need to check the return value. If wait
wakes up
early (e.g. EINTR
), it’s going to check the counter again anyway. In
fact, if your kernel is more than 20 years old, predating futexes, and
returns ENOSYS
(“Function not implemented”), it will still work
correctly, though it will be incredibly inefficient.
Windows didn’t support futexes until Windows 8 in 2012, and were still supporting Windows without it into 2020, so they’re still relatively “new” for this platform. Nonetheless, they’re now mature enough that we can count on them being available.
I’d like to support both GCC-ish (via Mingw-w64) and MSVC-ish
compilers. Mingw-w64 provides a compatible intrin.h
, so I can stick to
MSVC-style atomics and cover both at once. On the other hand, MSVC doesn’t
define atomics for int
(or even int32_t
), strictly long
, so I have
to sneak in a little cast. (Recall: sizeof(long) == sizeof(int)
on every
version of Windows supporting futexes.) The other option is to typedef
the WaitGroup so that it’s int
on Linux (for the futex) and long
on
Windows (for atomics).
static int load(int *p)
{
return _InterlockedOr((long *)p, 0);
}
static int addfetch(int *p, int addend)
{
return addend + _InterlockedExchangeAdd((long *)p, addend);
}
The official, sanctioned futex functions are WaitOnAddress and
WakeByAddressAll. They used to be in kernel32.dll
, but as of
this writing they live in API-MS-Win-Core-Synch-l1-2-0.dll
, linked via
-lsynchronization
. Gross. Since I can’t stomach this, I instead call the
low-level RTL functions where it’s actually implemented: RtlWaitOnAddress
and RtlWakeAddressAll. These live in the nice neighborhood of ntdll.dll
.
They’re undocumented as far as I can tell, but thankfully Wine comes to
the rescue, providing both documentation and several different
implementations. Reading through it is educational, and hints at ways to
construct futexes on systems lacking them.
These functions aren’t declared in any headers, so I have to do it myself.
On the plus side, so far I haven’t paid the substantial compile-time costs
of including windows.h
, and so I can continue avoiding it. These
functions are listed in the ntdll.dll
import library, so I don’t need
to invent the import library entries.
__declspec(dllimport)
long __stdcall RtlWaitOnAddress(void *, void *, size_t, void *);
__declspec(dllimport)
long __stdcall RtlWakeAddressAll(void *);
Rather conveniently, the semantics perfectly line up with Linux futexes!
static void wait(int *p, int current)
{
RtlWaitOnAddress(p, ¤t, sizeof(*p), 0);
}
static void wake(int *p)
{
RtlWakeAddressAll(p);
}
Like with Linux, there’s no meaningful failure, so the return values don’t matter.
That’s the whole implementation. Considering just a single platform, a flexible, lightweight, and easy-to-use synchronization facility in ~50 lines of relatively simple code is a pretty good deal if you ask me!
]]>This article’s motivation is multi-threaded epoll. I mitigate TSan false positives each time it comes up, enough to have gotten the hang of it, so I ought to document it. On Windows I would also run into the same issue with the Win32 message queue, crossing the synchronization edge between PostMessage (release) and GetMessage (acquire), except for the general lack of TSan support in Windows tooling. The same technique would work there as well.
My typical epoll scenario looks like so:
epoll_create1
).epoll_wait
.accept
, adding sockets to epoll (epoll_ctl
).Between accept
and EPOLL_CTL_ADD
, the main thread allocates and
initializes the client session state, then attaches it to the epoll event.
The client socket is added with the EPOLLONESHOT
flag, and the
session state is not touched after the call to epoll_ctl
(note: sans
error checks):
for (;;) {
int fd = accept(...);
struct session *session = ...;
session->fd = fd;
// ...
struct epoll_event;
event.events = EPOLLET | EPOLLONESHOT | ...;
event.events.data.ptr = session;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
}
In this example, struct session
is defined by the application to contain
all the state for handling a session (file descriptor, buffers, state
machine, parser state, allocation arena, etc.). Everything
else is part of the epoll interface.
When a socket is ready, one of the worker threads receive it. Due to
EPOLLONESHOT
, it’s immediately disabled and no other thread can receive
it. The thread does as much work as possible (i.e. read/write until
EAGAIN
), then reactivates it with epoll_ctl
:
for (;;) {
struct epoll_event event;
epoll_wait(epfd, &event, 1, -1);
struct session *session = event.data.ptr;
int fd = session->fd;
// ...
event.events = EPOLLET | EPOLLONESHOT | ...;
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
}
The shared variables in session
are passed between threads through
epoll
using the event’s .user.ptr
. These variables are potentially
read and mutated by every thread, but it’s all perfectly safe without any
further synchronization — i.e. no need for mutexes, etc. All the necessary
synchronization is implicit in epoll.
In the initial hand-off, that EPOLL_CTL_ADD
must happen before the
corresponding epoll_wait
in a worker thread. This establishes that the
main thread and worker thread do not touch session variables concurrently.
After all, how could the worker see an event on the file descriptor before
it’s been added to epoll? The synchronization in epoll itself will also
ensure all the architecture-level stores are visible to other threads
before the hand-off. We can call the “add” a release and the “wait” an
acquire, forming a synchronization edge.
Similarly, in the hand-off between worker threads, the EPOLL_CTL_MOD
that reactivates the file descriptor must happen before the wait that
observes the next event because, until reactivation, it’s disabled. The
EPOLL_CTL_MOD
is another release in relation to the acquire wait.
Unfortunately TSan won’t see things this way. It can’t see into the kernel, and it doesn’t know these subtle epoll semantics, so it can’t see these synchronization edges. As far as it can tell, threads might be accessing a session concurrently, and TSan will reliably produce warnings about it. You could shrug your shoulders and give up on using TSan in this case, but there’s an easy solution: introduce redundant, semantically identical synchronization edges, but only when TSan is looking.
WARNING: ThreadSanitizer: data race
I prefer to solve this by introducing the weakest possible synchronization so that I’m not synchronizing beyond epoll’s semantics. This will help TSan catch real mistakes that stronger synchronization might hide.
The weakest option is memory fences. These wouldn’t introduce extra loads
or stores. At most it would be a fence instruction. I would use GCC’s
built-in __atomic_thread_fence
for the job. However, TSan does not
currently understand thread fences, so that defeats the purpose. Instead,
I introduce a new field to struct session
:
struct session {
int fd;
// ...
int _sync;
};
Then just before epoll_ctl
I’ll do a release store on this field,
“releasing” the session. All session stores are ordered before the
release.
// main thread
// ...
__atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
// worker thread
// ...
__atomic_store_n(&session->_sync, 0, __ATOMIC_RELEASE)
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
After epoll_wait
I add an acquire load, “acquiring” the session. All
session loads are ordered after the acquire.
epoll_wait(epfd, &event, 1, -1);
struct session *session = event.data.ptr;
__atomic_load_n(&session->_sync, __ATOMIC_ACQUIRE)
int fd = session->fd;
// ...
For this to work, the thread must not touch session variables in any way
before the acquire or after the release. For example, note how I obtained
the client file descriptor before the release, i.e. no session->fd
argument in the epoll_ctl
call.
That’s it! This redundantly establishes the happens before relationship
already implicit in epoll, but now it’s visible to TSan. However, I don’t
want to pay for this unless I’m actually running under TSan, so some
macros are in order. __SANITIZE_THREAD__
is automatically defined when
running under TSan:
#if __SANITIZE_THREAD__
# define TSAN_SYNCED int _sync
# define TSAN_ACQUIRE(s) __atomic_load_n(&(s)->_sync, __ATOMIC_ACQUIRE)
# define TSAN_RELEASE(s) __atomic_store_n(&(s)->_sync, 0, __ATOMIC_RELEASE)
#else
# define TSAN_SYNCED
# define TSAN_ACQUIRE(s)
# define TSAN_RELEASE(s)
#endif
This also makes it more readable, and intentions clearer:
struct session {
int fd;
// ...
TSAN_SYNCED;
};
// main thread
for (;;) {
// ...
TSAN_RELEASE(session);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
}
// worker thread
for (;;) {
epoll_wait(epfd, &event, 1, -1);
struct session *session = event.data.ptr;
TSAN_ACQUIRE(session);
int fd = session->fd;
// ...
TSAN_RELEASE(session);
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
}
Now I can use TSan again, and it didn’t cost anything in normal builds.
]]>I generally prefer C, so I’m accustomed to building whatever I need on the fly, such as heaps, linked lists, and especially hash tables. Few programs use more than a small subset of a data structure’s features, making their implementation smaller, simpler, and more efficient than the general case, which must handle every edge case. A typical hash table tutorial will describe a relatively lengthy program, but in practice, bespoke hash tables are only a few lines of code. Over the years I’ve worked out some basic principles for hash table construction that aid in quick and efficient implementation. This article covers the technique and philosophy behind what I’ve come to call the “mask-step-index” (MSI) hash table, which is my standard approach.
MSI hash tables are nothing novel, just a double hashed, open address hash table layered generically atop an external array. It’s best regarded as a kind of database index — a lookup index over an existing array. The array exists independently, and the hash table provides an efficient lookup into that array over some property of its entries.
The core of the MSI hash table is this iterator function:
// Compute the next candidate index. Initialize idx to the hash.
int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (64 - exp)) | 1;
return (idx + step) & mask;
}
The name should now make sense. I literally sound it out in my head when I
type it, like a mnemonic. Compute a mask, then a step size, finally an
index. The exp
parameter is a power-of-two exponent for the hash table
size, which may look familiar. I’ve used int32_t
for the index,
but it’s easy to substitute, say, size_t
. I try to optimize for the
common case, where a 31-bit index is more than sufficient, and a signed
type since subscripts should be signed. Internally it uses unsigned
types since overflow is both expected and harmless thanks to the
power-of-two hash table size.
It’s the caller’s responsibility to compute the hash, and the MSI iterator tells the caller where to look next. For insertion, the caller (maybe) looks either for an existing entry to override, or an empty slot. For lookup, the caller looks for a matching entry, giving up as soon as it find an empty slot. An insertion loop looks like this string intern table:
#define EXP 15
// Initialize all slots to an "empty" value (null)
#define HT_INIT { {0}, 0 }
struct ht {
char *ht[1<<EXP];
int32_t len;
};
char *intern(struct ht *t, char *key)
{
uint64_t h = hash(key, strlen(key)+1);
for (int32_t i = h;;) {
i = ht_lookup(h, EXP, i);
if (!t->ht[i]) {
// empty, insert here
if ((uint32_t)t->len+1 == (uint32_t)1<<EXP) {
return 0; // out of memory
}
t->len++;
t->ht[i] = key;
return key;
} else if (!strcmp(t->ht[i], key)) {
// found, return canonical instance
return t->ht[i];
}
}
}
The caller initializes the iterator to the hash result. This will probably
be out of range, even negative, but that doesn’t matter. The iterator
function will turn it into a valid index before use. This detail is key to
double hashing: The low bits of the hash tell it where to start, and the
high bits tell it how to step. The hash table size is a power of two, and
the step size is forced to an odd number (via | 1
), so it’s guaranteed
to visit each slot in the table exactly once before restarting. It’s
important that the search halts before looping, such as by guaranteeing
the existence of an empty slot (i.e. the “out of memory” check).
Note: The example out of memory check pushes the hash table to the absolute limit, and in practice you’d want to stop at a smaller load factor — perhaps even as low as 50% since that’s simple and fast. Otherwise it degrades into a linear search as the table approaches capacity.
Even if two keys start or land at the same place, they’ll quickly diverge
due to differing steps. For awhile I used plain linear probing — i.e.
step=1
— but double hashing came out ahead every time I benchmarked,
steering me towards this “MSI” construction. Ideally ht_lookup
would be
placed so that it’s inlined — e.g. in the same translation unit — so that
the mask and step are not actually recomputed each iteration.
What about deletion? First, consider how infrequently you delete entries
from a hash table. When was the last time you used del
on a dictionary
in Python, or delete
on a map
in Go? This operation is rarely needed.
However, when you do need it, reserve a gravestone value in addition to
the empty value.
static char gravestone[] = "(deleted)";
char *intern(struct ht *t, char *key)
{
char **dest = 0;
// ...
if (!t->ht[i]) {
// ...
dest = dest ? dest : &t->ht[i];
*dest = key;
return key;
} else if (t->ht[i] == gravestone) {
dest = dest ? dest : &t->ht[i];
} else if (!strcmp(...)) {
// ...
}
// ...
}
char *unintern(struct ht *t, char *key)
{
// ...
if (!t->ht[i]) {
return 0;
} else if (t->ht[i] == gravestone) {
// skip over
} else if (!strcmp(...)) {
char *old = t->ht[i];
t->ht[i] = gravestone;
return old;
}
// ...
}
When searching, skip over gravestones. Note that gravestones are compared
with ==
(identity), so this does not preclude a string "(deleted)"
.
When inserting, use the first gravestone found if no entry was found.
Iterating over the example string intern table is simple: Iterate over the underlying array, skipping empty slots (and maybe gravestones). Entries will be in a random order rather than, say, insertion order. This is a useful introductory example, but this isn’t where MSI most shines. As mentioned, it’s best when treated like a database index.
Let’s take a step back and consider the caller of intern
. How does it
allocate these strings? Perhaps they’re appended to a buffer, and
intern
indicates whether or not the string is unique so far.
struct buf {
// lookup table over the buffer
struct ht ht;
// a collection of strings
int32_t len;
char buf[BUFLEN];
};
Strings are only appended to the buffer when unique, and the hash table can make that determination in constant time.
char *buf_push(struct buf *b, char *s)
{
size_t len = strlen(s) + 1;
if (b->len+len > sizeof(b->buf)) {
return 0; // out of memory
}
char *candidate = b->buf + buf->len;
memcpy(candidate, s, len);
char *result = intern(&b->ht, candidate);
if (result == candidate) {
// string is unique, keep it
b->len += len;
}
return result;
}
In my first example, EXP
was fixed. This could be converted into a
dynamic allocation and the hash table resized as needed. Here’s a new
constructor, which I’m including since I think it’s instructive:
struct ht {
int32_t len;
int exp;
char **ht;
};
static struct ht
ht_new(int exp)
{
struct ht ht = {0, exp, 0};
assert(exp >= 0);
if (exp >= 32) {
return ht; // request too large
}
ht.ht = calloc((size_t)1<<exp, sizeof(ht.ht[0]));
return ht;
}
If intern
fails, the hash table can be replaced with a new table twice
as large, and since, like a database index, its contents are entirely
redundant, the hash table can be discarded and rebuilt from scratch. The
new and old table don’t need to exist simultaneously. Here’s a routine to
populate an empty hash table from the buffer:
void buf_rehash(struct buf *b)
{
assert(b->ht.len == 0);
for (int32_t off = 0; off < b->len;) {
char *s = b->buf + off;
int32_t len = strlen(s) + 1;
off += len;
uint64_t h = hash(s, len);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ht.exp, i);
if (!b->ht.ht[i]) {
b->ht.len++;
b->ht.ht[i] = s;
break;
}
}
}
}
Note how this iterates in insertion order, which may be useful in other
cases, too. On the rehash it doesn’t need to check for existing entries,
as all entries are already known to be unique. Later when intern
hits
its capacity:
char *result = intern(&b->ht, candidate);
if (!result) {
free(b->ht.ht);
b->ht = ht_new(ht.exp+1);
if (!b->ht) {
return 0; // out of memory
}
buf_rehash(b);
result = intern(&b->ht, candidate); // cannot fail
}
I freed and reallocated the table, but it would be trivial to use a
realloc
instead, unlike the case where the old table isn’t redundant.
An MSI hash table is trivially converted into a multimap, a hash table with multiple values per key. Callers just make one small change: Don’t stop searching until an empty slot is found. Each match is an additional multimap value. The “value array” is stored along the hash table itself, in insertion order, without additional allocations.
For example, imagine the strings in the string buffer have a namespace
prefix, delimited by a colon, like city:Austin
and state:Texas
. We’d
like a fast lookup of all strings under a particular namespace. The
solution is to add another hash table as you would an index to a database
table.
struct buf {
// ..
struct ht ns;
// ..
};
When a unique string is appended it’s also registered in the namespace multimap. It doesn’t check for an existing key, only for an empty slot, since it’s a multimap:
// Check outside the loop since it always inserts.
if (/* ... ns multimap lacks capacity ... */) {
// ... grow+rehash ns mutilmap ...
}
int32_t nslen = strcspn(s, ":") + 1;
uint64_t h = hash(s, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
b->ns.len++;
b->ns.ht[i] = s;
break;
}
}
It includes the :
as a terminator which simplifies lookups. Here’s a
lookup loop to print all strings under a namespace (includes terminal :
in the key):
char *ns = "city:";
int32_t nslen = strlen(ns);
// ...
uint64_t h = hash(ns, nslen);
for (int32_t i = h;;) {
i = ht_lookup(h, b->ns.exp, i);
if (!b->ns.ht[i]) {
break;
} else if (!strncmp(b.ns->ht[i], ns, nslen)) {
puts(b->ns.ht[i]+nslen);
}
}
An alternative approach to multimaps is to additionally key over a value
subscript. For example, the first city is keyed {"city", 0}
, the next
{"city", 1}
, etc. The value subscript could be mixed into the string
hash with an integer permutation (more on this below):
uint64_t h = hash64(val_idx ^ hash(s, nslen));
The lookup loop would compare both the string and the value subscript, and stop when it finds a match. The underlying hash table is not truly a multimap, but rather a plain hash table with a larger key. This requires extra bookkeeping — tracking individual subscripts and the number of values per key — but provides constant time random access on the multimap value array.
The MSI iterator leaves hashing up to the caller, who has better knowledge about the input and how to hash it, though this takes a bit of knowledge of how to build a hash function. The good news is that it’s easy, and less is more. Better to do too little than too much, and a faster, weaker hash function is worth a few extra collisions.
The first rule is to never lose sight of the goal: The purpose of the hash function is to uniformly distribute entries over a table. The better you know and exploit your input, the less you need to do in the hash function. Sometimes your keys already contain random data, and so your hash function can be the identity function! For example, if your keys are “version 4” UUIDs, don’t waste time hashing them, just load a few bytes from the end as an integer and you’re done.
// "Hash" a v4 UUID
uint64_t uuid4_hash(unsigned char uuid[16])
{
uint64_t h;
memcpy(&h, uuid+8, 8);
return h;
}
A reasonable start for strings is FNV-1a, such as this possible
implementation for my hash()
function above:
uint64_t hash(char *s, int32_t len)
{
uint64_t h = 0x100;
for (int32_t i = 0; i < len; i++) {
h ^= s[i] & 255;
h *= 1111111111111111111;
}
return h ^ h>>32;
}
The hash state is initialized to a basis, some arbitrary value. This a useful place to introduce a seed or hash key. It’s best that at least one bit above the low mix-in bits is set so that it’s not trivially stuck at zero. Above, I’ve chosen the most trivial basis with reasonable results, though often I’ll use the digits of π.
Next XOR some input into the low bits. This could be a byte, a Unicode
code point, etc. More is better, since otherwise you’re stuck doing more
work per unit, the main weakness of FNV-1a. Carefully note the byte mask,
& 255
, which inhibits sign extension. Do not mix sign-extended inputs
into FNV-1a — a widespread implementation mistake.
Multiply by a large, odd random-ish integer. A prime is a reasonable choice, and I usually pick my favorite prime, shown above: 19 ones in base 10.
Finally, my own touch, an xorshift finalizer. The high bits are much better mixed than the low bits, so this improves the overall quality. Though if you take time to benchmark, you might find that this finalizer isn’t necessary. Remember, do just enough work to keep the number of collisions low — not lowest — and no more.
If your input is made of integers, or is a short, fixed length, use an integer permutation, particularly multiply-xorshift. It takes very little to get a sufficient distribution. Sometimes one multiplication does the trick. Fixed-sized, integer-permutation hashes tend to be the fastest, easily beating fancier SIMD-based hashes, including AES-NI. For example:
// Hash a timestamp-based, version 1 UUID
uint64_t uuid1_hash(unsigned char uuid[16])
{
uint64_t s[2];
memcpy(s, uuid, 16);
s[0] += 0x3243f6a8885a308d; // digits of pi
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
s[0] += s[1];
s[0] *= 1111111111111111111;
s[0] ^= s[0] >> 33;
return s[0];
}
If I benchmarked this in a real program, I would probably cut it down even
further, deleting hash operations one at a time and measuring the overall
hash table performance. This memcpy
trick works well with floats, too,
especially packing two single precision floats into one 64-bit integer.
If you ever hesitate to build a hash table when the situation calls, I hope the MSI technique will make the difference next time. I have more hash table tricks up my sleeve, but since they’re not specific to MSI I’ll save them for a future article.
There have been objections to my claims about performance, so I’ve assembled some benchmarks. These demonstrate that:
debugbreak
command, now included in w64devkit. Though, of course, you
already have everything you need to build it and try it out right
now. I’ve also worked out a Linux implementation.
It’s named after an MSVC intrinsic and Win32 function. It takes no arguments, and its operation is indiscriminate: It raises a breakpoint exception in all debuggee processes system-wide. Reckless? Perhaps, but certainly convenient. You don’t need to tell it which process you want to pause. It just works, and a good debugging experience is one of ease and convenience.
The linchpin is DebugBreakProcess. The command walks the process list and fires this function at each process. Nothing happens for programs without a debugger attached, so it doesn’t even bother checking if it’s a debuggee. It couldn’t be simpler. I’ve used it on everything from Windows XP to Windows 11, and it’s worked flawlessly.
HANDLE s = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
PROCESSENTRY32W p = {sizeof(p)};
for (BOOL r = Process32FirstW(s, &p); r; r = Process32NextW(s, &p)) {
HANDLE h = OpenProcess(PROCESS_ALL_ACCESS, 0, p.th32ProcessID);
if (h) {
DebugBreakProcess(h);
CloseHandle(h);
}
}
I use it almost exclusively from Vim, where I’ve given it a leader mapping. With the editor focused, I can type backslash then d to pause the debuggee.
map <leader>d :call system("debugbreak")<cr>
With the debuggee paused, I’m free to add new breakpoints or watchpoints,
or print the call stack to see what the heck it’s busy doing. The
mechanism behind DebugBreakProcess is to create a new thread in the
target, with that thread raising the breakpoint exception. The debugger
will be stopped in this new thread. In GDB you can use the thread
command to switch over to the thread that actually matters, usually thr
1
.
On unix-like systems the equivalent of a breakpoint exception is a
SIGTRAP
. There’s already a standard command for sending signals,
kill
, so a debugbreak
command can be built using nothing more
than a few lines of shell script. However, unlike DebugBreakProcess,
signaling every process with SIGTRAP
will only end in tears. The script
will need a way to determine which processes are debuggees.
Linux exposes processes in the file system as virtual files under /proc
,
where each process appears as a directory. Its status
file includes a
TracerPid
field, which will be non-zero for debuggees. The script
inspects this field, and if non-zero sends a SIGTRAP
.
#!/bin/sh
set -e
for pid in $(find /proc -maxdepth 1 -printf '%f\n' | grep '^[0-9]\+$'); do
grep -q '^TracerPid:\s[^0]' /proc/$pid/status 2>/dev/null &&
kill -TRAP $pid
done
This script, now part of my dotfiles, has worked very well so
far, and effectively smoothes over some debugging differences between
Windows and Linux, reducing my context switching mental load. There’s
probably a better way to express this script, but that’s the best I could
do so far. On the BSDs you’d need to parse the output of ps
, though each
system seems to do its own thing for distinguishing debuggees.
I had originally planned for one flag, -k
. Rather than breakpoint
debugees, it would terminate all debuggee processes. This is especially
important on Windows where debuggee processes block builds due to file
locking shenanigans. I’d just run debugbreak -k
as part of the build.
However, it’s not possible to terminate debuggees paused in the debugger —
the common situation. I’ve given up on this for now.
assert
feature in basically every programming language implementation.
It ought to work better with debuggers.
An assertion verifies a program invariant, and so if one fails then there’s undoubtedly a defect in the program. In other words, assertions make programs more sensitive to defects, allowing problems to be caught more quickly and accurately. Counter-intuitively, crashing early and often makes for more robust and reliable software in the long run. For exactly this reason, assertions go especially well with fuzzing.
assert(i >= 0 && i < len); // bounds check
assert((ssize_t)size >= 0); // suspicious size_t
assert(cur->next != cur); // circular reference?
They’re sometimes abused for error handling, which is a reason they’ve also been (wrongfully) discouraged at times. For example, failing to open a file is an error, not a defect, so an assertion is inappropriate.
Normal programs have implicit assertions all over, even if we don’t usually think of them as assertions. In some cases they’re checked by the hardware. Examples of implicit assertion failures:
-ftrapv
)Programs are generally not intended to recover from these situations because, had they been anticipated, the invalid operation wouldn’t have been attempted in the first place. The program simply crashes because there’s no better alternative. Sanitizers, including Address Sanitizer (ASan) and Undefined Behavior Sanitizer (UBSan), are in essence additional, implicit assertions, checking invariants that aren’t normally checked.
Ideally a failing assertion should have these two effects:
Execution should immediately stop. The program is in an unknown state, so it’s neither safe to “clean up” nor attempt to recover. Additional execution will only make debugging more difficult, and may obscure the defect.
When run under a debugger — or visited as a core dump — it should break exactly at the failed assertion, ready for inspection. I should not need to dig around the call stack to figure out where the failure occurred. I certainly shouldn’t need to manually set a breakpoint and restart the program hoping to fail the assertion a second time. The whole reason for using a debugger is to save time, so if it’s wasting my time then it’s failing at its primary job.
I examined standard assert
features across various language
implementations, and none strictly meet the criteria. Fortunately, in some
cases, it’s trivial to build a better assertion, and you can substitute
your own definition. First, let’s discuss the way assertions disappoint.
My test for C and C++ is minimal but establishes some state and gives me a variable to inspect:
#include <assert.h>
int main(void)
{
for (int i = 0; i < 10; i++) {
assert(i < 5);
}
}
Then I compile and debug in the most straightforward way:
$ cc -g -o test test.c
$ gdb test
(gdb) r
(gdb) bt
The r
in GDB stands for run
, which immediately breaks because of the
assert
. The bt
prints a backtrace. On a typical Linux distribution
that shows this backtrace:
#0 __GI_raise
#1 __GI_abort
#2 __assert_fail_base
#3 __GI___assert_fail
#4 main
Well, actually, it’s much messier than this, but I manually cleaned it up:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linu
x/raise.c:50
#1 0x00007ffff7df4537 in __GI_abort () at abort.c:79
#2 0x00007ffff7df440f in __assert_fail_base (fmt=0x7ffff7f5d
128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x
55555555600b "i < 5", file=0x555555556004 "test.c", line=6, f
unction=<optimized out>) at assert.c:92
#3 0x00007ffff7e03662 in __GI___assert_fail (assertion=0x555
55555600b "i < 5", file=0x555555556004 "test.c", line=6, func
tion=0x555555556011 <__PRETTY_FUNCTION__.0> "main") at assert
.c:101
#4 0x0000555555555178 in main () at test.c:6
That’s a lot to take in at a glance, and about 95% of it is noise that will never contain useful information. Most notably, GDB didn’t stop at the failing assertion. Instead there’s four stack frames of libc junk I have to navigate before I can even begin debugging.
(gdb) up
(gdb) up
(gdb) up
(gdb) up
I must wade through this for every assertion failure. This is some of the friction that made me avoid the debugger in the first place. glibc loves indirection, so maybe the other libc implementations do better? How about musl?
#0 setjmp
#1 raise
#2 ??
#3 ??
#4 ??
#5 ??
#6 ??
#7 ??
#8 ??
#9 ??
#10 ??
#11 ??
Oops, without musl debugging symbols I can’t debug assertions at all
because GDB can’t read the stack, so it’s lost. If you’re on Alpine you
can install musl-dbg
, but otherwise you’ll probably need to build your
own from source. With debugging symbols, musl is no better than glibc:
#0 __restore_sigs
#1 raise
#2 abort
#3 __assert_fail
#4 main
Same with FreeBSD:
#0 thr_kill
#1 in raise
#2 in abort
#3 __assert
#4 main
OpenBSD has one fewer frame:
#0 thrkill
#1 _libc_abort
#2 _libc___assert2
#3 main
How about on Windows with Mingw-w64?
[Inferior 1 (process 7864) exited with code 03]
Oops, on Windows GDB doesn’t break at all on assert
. You must first set
a breakpoint on abort
:
(gdb) b abort
Besides that, it’s the most straightforward so far:
#0 msvcrt!abort
#1 msvcrt!_assert
#2 main
With MSVC (default CRT) I get something slightly different:
#0 abort
#1 common_assert_to_stderr
#2 _wassert
#3 main
#4 __scrt_common_main_seh
RemedyBG leaves me at the abort
like GDB does elsewhere. Visual Studio
recognizes that I don’t care about its stack frames and instead puts the
focus on the assertion, ready for debugging. The other stack frames are
there, but basically invisible. It’s the only case that practically meets
all my criteria!
I can’t entirely blame these implementations. The C standard requires that
assert
print a diagnostic and call abort
, and that abort
raises
SIGABRT
. There’s not much implementations can do, and it’s up to the
debugger to be smarter about it.
ASan doesn’t break GDB on assertion failures, which is yet another source of friction. You can work around this with an environment variable:
export ASAN_OPTIONS=abort_on_error=1:print_legend=0
This works, but it’s the worst case of all: I get 7 junk stack frames on
top of the failed assertion. It’s also very noisy when it traps, so the
print_legend=0
helps to cut it down a bit. I want this variable so often
that I set it in my shell’s .profile
so that it’s always set.
With UBSan you can use -fsanitize-undefined-trap-on-error
, which behaves
like the improved assertion. It traps directly on the defect with no junk
frames, though it prints no diagnostic. As a bonus, it also means you
don’t need to link libubsan
. Thanks to the bonus, it fully supplants
-ftrapv
for me on all platforms.
Update November 2022: This “stop” hook eliminates ASan friction by
popping runtime frames — functions with the reserved __
prefix — from
the call stack so that they’re not in the way when GDB takes control. It
requires Python support, which is the purpose of the feature-sniff outer
condition.
if !$_isvoid($_any_caller_matches)
define hook-stop
while $_thread && $_any_caller_matches("^__")
up-silently
end
end
end
This is now part of my .gdbinit
.
At least when under a debugger, here’s a much better assertion macro for GCC and Clang:
#define assert(c) if (!(c)) __builtin_trap()
__builtin_trap
inserts a trap instruction — a built-in breakpoint. By
not calling a function to raise a signal, there are no junk stack frames
and no need to breakpoint on abort
. It stops exactly where it should as
quickly as possible. This definition works reliably with GCC across all
platforms, too. On MSVC the equivalent is __debugbreak
. If you’re really
in a pinch then do whatever it takes to trigger a fault, like
dereferencing a null pointer. A more complete definition might be:
#ifdef DEBUG
# if __GNUC__
# define assert(c) if (!(c)) __builtin_trap()
# elif _MSC_VER
# define assert(c) if (!(c)) __debugbreak()
# else
# define assert(c) if (!(c)) *(volatile int *)0 = 0
# endif
#else
# define assert(c)
#endif
None of these print a diagnostic, but that’s unnecessary when a debugger is involved.
Unfortunately the situation mostly gets worse with other language implementations, and it’s generally not possible to build a better assertion. Assertions typically have exception-like semantics, if not literally just another exception, and so they are far less reliable. If a failed assertion raises an exception, then the program won’t stop until it’s unwound the stack — running destructors and such along the way — all the way to the top level looking for a handler. It only knows there’s a problem when nobody was there to catch it.
Go officially doesn’t have assertions, though panics are a kind of assertion. However, panics have exception-like semantics, and so suffer the problems of exceptions. A Go version of my test:
func main() {
defer fmt.Println("DEFER")
for i := 0; i < 10; i++ {
if i >= 5 {
panic(i)
}
}
}
If I run this under Go’s premier debugger, Delve, the unrecovered panic causes it to break. So far so good. However, I get two junk frames:
#0 runtime.fatalpanic
#1 runtime.gopanic
#2 main.main
#3 runtime.main
#4 runtime.goexit
It only knows to stop because the Go runtime called fatalpanic
, but the
backtrace is a fiction: The program continued to run after the panic,
enough to run all the registered defers (including printing “DEFER”),
unwinding the stack to the top level, and only then did it fatalpanic
.
Fortunately it’s still possible to inspect all those stack frames even if
some variables may have changed while unwinding, but it’s more like
inspecting a core dump than a paused process.
The situation in Python is similar: assert
raises AssertionError — a
plain old exception — and pdb
won’t break until the stack has unwound,
exiting context managers and such. Only once the exception reaches the top
level does it enter “post mortem debugging,” like a core dump. At least
there are no junk stack frames on top. If you’re using asyncio then your
program may continue running for quite awhile before the right tasks are
scheduled and the exception finally propagates to the top level, if ever.
The worst offender of all is Java. First jdb
never breaks for unhandled
exceptions. It’s up to you to set a breakpoint before the exception is
thrown. But it gets worse: assertions are disabled under jdb
. The Java
assert
statement is worse than useless.
The largest friction-reducing change I made is never exiting the debugger.
Previously I would enter GDB, run my program, exit, edit/rebuild, repeat.
However, there’s no reason to exit GDB! It automatically and reliably
reloads symbols and updates breakpoints on symbols. It remembers your run
configuration, so re-running is just r
rather than interacting with
shell history.
My workflow on all platforms (including Windows) is a vertically
maximized Vim window and a vertically maximized terminal window. The new
part for me: The terminal runs a long-term GDB session exclusively, with
file
set to the program I’m writing, usually set by initial the command
line.
$ gdb myprogram
gdb>
Alternatively use file
after starting GDB. Occasionally useful if my
project has multiple binaries, and I want to examine a different program.
gdb> file myprogram
I use make
and Vim’s :mak
command for building from within the editor,
so I don’t need to change context to build. The quickfix list takes me
straight to warnings/errors. Often I’m writing something that takes input
from standard input. So I use the run
(r
) command to set this up
(along with any command line arguments).
gdb> r <test.txt
You can redirect standard output as well. It remembers these settings for
plain run
later, so I can test my program by entering r
and nothing
else.
gdb> r
My usual workflow is edit, :mak
, r
, repeat. If I want to test a
different input or use different options, change the run configuration
using run
again:
gdb> r -a -b -c <test2.txt
On Windows you cannot recompile while the program is running. If GDB is
sitting on a breakpoint but I want to build, use kill
(k
) to stop it
without exiting GDB.
gdb> k
GDB has an annoying, flow-breaking yes/no prompt for this, so I recommend
set confirm no
in your .gdbinit
to disable it.
Sometimes a program is stuck in a loop and I need it to break in the
debugger. I try to avoid CTRL-C in the terminal it since it can confuse
GDB. A safer option is to signal the process from Vim with pkill
, which
GDB will catch (except on Windows):
:!pkill myprogram
I suspect many people don’t know this, but if you’re on Windows and developing a graphical application, you can press F12 in the debuggee’s window to immediately break the program in the attached debugger. This is a general platform feature and works with any native debugger. I’ve been using it quite a lot.
On that note, you can run commands from GDB with !
, which is another way
to avoid having an extra terminal window around:
gdb> !git diff
In any case, GDB will re-read the binary on the next run
and update
breakpoints, so it’s mostly seamless. If there’s a function I want to
debug, I set a breakpoint on it, then run.
gdb> b somefunc
gdb> r
Alternatively I’ll use a line number, which I read from Vim. Though GDB, not being involved in the editing process, cannot track how that line moves between builds.
An empty command repeats the last command, so once I’m at a breakpoint,
I’ll type next
(n
) — or step
(s
) to enter function calls — then
press enter each time I want to advance a line, often with my eye on the
context in Vim in the other window:
gdb> n
gdb>
gdb>
(I wish GDB could print a source listing around the breakpoint as
context, like Delve, but no such feature exists. The woeful Update: GDB’s TUI is a reasonable compromise for GUI
applications or terminal applications running under a separate tty/console
with either list
command
is inadequate.tty
or set new-console
. I can access it everywhere since
w64devkit now supports GDB TUI.)
If I want to advance to the next breakpoint, I use continue
(c
):
gdb> c
If I’m walking through a loop, I want to see how variables change, but
it’s tedious to keep print
ing (p
) the same variables again and again.
So I use display
(disp
) to display an expression with each prompt,
much like the “watch” window in Visual Studio. For example, if my loop
variable is i
over some string str
, this will show me the current
character in character format (/c
).
gdb> disp/c str[i]
You can accumulate multiple expressions. Use undisplay
to remove them.
Too many breakpoints? Use info breakpoints
(i b
) to list them, then
delete
(d
) the unwanted ones by ID.
gdb> i b
gdb> d 3 5 8
GDB has many more feature than this, but 10 commands cover 99% of use
cases: r
, c
, n
, s
, disp
, k
, b
, i
, d
, p
.
Earlier this month Ted Unangst researched compiling the OpenBSD kernel
50% faster, which involved stubbing out the largest, extraneous
branches of the source tree. To find the lowest-hanging fruit, he wrote a
tool called watc — where’s all the code — that displays an
interactive “usage” summary of a source tree oriented around line count. A
followup post about exploring the tree in parallel got me thinking
about the problem, especially since I had just written about a concurrent
queue. Turning it over in my mind, I saw opportunities for interesting
data structures and memory management, and so I wanted to write my own
version of the tool, watc.c
, which is the subject of this
article.
The original watc
is interactive and written in idiomatic Go. My version
is non-interactive, written in C, and currently only supports Windows. Not
only do I prefer batch programs generally, building an interactive user
interface would be complicated and distract from the actual problem I
wanted to tackle. As for the platform restriction, it has some convenient
constraints (for implementers), and my projects are often about shooting
multiple birds with one stone:
The longest path is MAX_PATH
, a meager 260 pseudo-UTF-16 code points,
is nice and short. Technically users can now opt-in to a maximum path
length of 32,767, but so little software supports it, including
much of Windows itself, that it’s not worth considering. Even with the
upper limit, each path component is still restricted by MAX_PATH
. I
can rely on this platform restriction in my design.
Symbolic links, an annoying edge case, are outside of consideration. Technically Windows has them, but they’re sufficiently locked away that they don’t come up in practice.
After years of deliberating, I was finally convinced to buy and try RememdyBG, a super slick Windows debugger. I especially wanted to try out its multi-threading support, and I knew I’d be using multiple threads in this project. Since it’s incompatible with my development kit, my program also supports the MSVC compiler.
The very same day I improved GDB support in my development kit, and this was a great opportunity to dogfood the changes. I’ve used my kit so much these past two years, especially since both it and I have matured enough that I’m nearly as productive in it as I am on Linux.
It’s practice and experience with the wide API, and the tool fully supports Unicode paths. Perhaps a bit unnecessary considering how few source trees stray beyond ASCII, even just in source text — just too many ways things go wrong otherwise.
Running my tool on nearly the same source tree as the original example yields:
C:\openbsd>watc sys
. 6.89MLOC 364.58MiB
├─dev 5.69MLOC 332.75MiB
│ ├─pci 4.46MLOC 293.80MiB
│ │ ├─drm 3.99MLOC 280.25MiB
│ │ │ ├─amd 3.33MLOC 261.24MiB
│ │ │ │ ├─include 2.61MLOC 238.48MiB
│ │ │ │ │ ├─asic_reg 2.53MLOC 235.07MiB
│ │ │ │ │ │ ├─nbio 689.56kLOC 69.33MiB
│ │ │ │ │ │ ├─dcn 583.67kLOC 58.60MiB
│ │ │ │ │ │ ├─gc 290.26kLOC 28.90MiB
│ │ │ │ │ │ ├─dce 210.16kLOC 16.81MiB
│ │ │ │ │ │ ├─mmhub 155.60kLOC 16.03MiB
│ │ │ │ │ │ ├─dpcs 123.90kLOC 12.97MiB
│ │ │ │ │ │ ├─gca 105.91kLOC 5.87MiB
│ │ │ │ │ │ ├─bif 71.45kLOC 4.41MiB
│ │ │ │ │ │ ├─gmc 64.24kLOC 3.41MiB
│ │ │ │ │ │ └─(other) 230.99kLOC 18.73MiB
│ │ │ │ │ └─(other) 2.10kLOC 139.29kiB
│ │ │ │ └─(other) 718.93kLOC 22.76MiB
│ │ │ └─(other) 583.63kLOC 16.86MiB
│ │ └─(other) 8.53kLOC 259.07kiB
│ └─(other) 1.20MLOC 38.34MiB
└─(other) 1.20MLOC 31.83MiB
In place of interactivity it has -n
(lines) and -d
(depth) switches to
control tree pruning, where branches are summarized as (other)
entries.
My idea is for users to run the tool repeatedly with different cutoffs and
filters to get a feel for where’s all the code. (It could really use
more such knobs.) Repeated counting makes performance all the more
important. On my machine, and a hot cache, the above takes ~180ms to count
those 6.89 million lines of code across 8,607 source files.
Each directory is treated like one big source file of its recursively concatenated contents, so the tool only needs to track directories. Each directory entry comprises a variable-length string name, line and byte totals, and tree linkage such that it can be later navigated for sorting and printing. That linkage has a clever solution, which I’ll get to later. First, lets deal with strings.
It’s important to get out of the null-terminated string business early,
only reverting to their use at system boundaries, such as constructing
paths for the operating system. Better to handle strings as offset/length
pairs into a buffer. Definitely avoid silly things like allocating many
individual strings, as encouraged by strdup
— and most other
programming language idioms — and certainly avoid useless functions like
strcpy
.
When the operating system provides a path component that I need to track for later, I intern it into a single, large buffer. That buffer looks like so:
#define BUF_MAX (1 << 22)
struct buf {
int32_t len;
wchar_t buf[BUF_MAX];
};
Empirically I determined that even large source trees cumulatively total on the order of 10,000 characters of directory names. The OpenBSD kernel source tree is only 2,992 characters of names.
$ find sys -type d -printf %f | wc -c
2992
The biggest I found was the LLVM source tree at 121,720 characters, not
only because of its sheer volume but also because it has generally has
relatively long names. So for my maximum buffer size I just maxed it out
(explained in a moment) and called it good. Even with UTF-16, that’s only
8MiB which is perfectly reasonable to allocate all at once up front. Since
my string handles don’t contain pointers, this buffer could be freely
relocated in the case of realloc
.
The operating system provides a null-terminated string. The buffer makes a copy and returns a handle. A handle is a 32-bit integer encoding offset and length.
int32_t buf_push(struct buf *b, wchar_t *s)
{
int32_t off = b->len;
int32_t len = wcslen(s);
if (b->len+len > BUF_MAX) {
return -1; // out of memory
}
memcpy(b->buf+off, s, len*sizeof(*s));
b->len += len;
return len<<22 | off;
}
The negative range is reserved for errors, leaving 31 bits. I allocate 9
to the length — enough for MAX_PATH
of 260 — and the remaining 22 bits
for the buffer offset, exactly matching the range of my BUF_MAX
.
Splitting on a nibble boundary would have displayed more nicely in
hexadecimal during debugging, but oh well.
A couple of helper functions are in order:
int str_len(int32_t s) { return s >> 22; }
int32_t str_off(int32_t s) { return s & 0x3fffff; }
Rather than allocate the string buffer on the heap, it’s a static
(read:
too big for the stack) scoped to main
. I consistently call it b
.
static struct buf b;
That’s string management solved efficiently in a dozen lines of code. I briefly considered a hash table to de-duplicate strings in the buffer, but real source trees aren’t redundant enough to make up for the hash table itself, plus there’s no reason here to make that sort of time/memory trade-off.
I settled on 24-byte directory entries:
struct dir {
uint64_t nbytes;
uint32_t nlines;
int32_t name;
int32_t link;
int32_t nsubdirs;
};
For nbytes
I teetered between 32 bits and 64 bits for the byte count. No
source tree I found overflows an unsigned 32-bit integer, but LLVM comes
close, just barely overflowing a signed 31-bit integer as of this year.
Since I wanted 10x over the worst case I could find, that left me with a
64-bit integer for bytes.
For nlines
, 32 bits has plenty of overhead. More importantly, this field
is updated concurrently and atomically by multiple threads — line counting
is parallelized — and I want this program to work on 32-bit hosts limited
to 32-bit atomics.
The name
is the string handle for that directory’s name.
The link
and nsubdirs
is the tree linkage. The link
field is an
index, and serves two different purposes at different times. Initially it
will identify the directory’s parent directory, and I had originally named
it parent
. nsubdirs
is the number of subdirectories, but there is
initially no link to a directory’s children.
Like with the buffer, I pre-allocate all the directory entries I’ll need:
#define DIRS_MAX (1 << 17)
int32_t ndirs = 0;
static struct dir dirs[DIRS_MAX];
A directory handle is just an index into dirs
. The link
field is one
such handle. Like string handles, directory entries contain no pointers,
and so this dirs
buffer could be freely relocated, a la realloc
, if
the context called for such flexibility. In my program, rather than
allocate this on the heap, it’s just a static
(read: too big for the
stack) scoped to main
.
For DIRS_MAX
, I again looked at the worst case I could find, LLVM, which
requires 12,163 entries. I had hoped for 16-bit directory handles, but
that would limit source trees to 32,768 directories — not quite 10x over
the worst case. I settled on 131,072 entries: 3MiB. At only 11MiB total so
far, in the very worst case, it hardly matters that I couldn’t shave off
these extra few bytes.
$ find llvm-project -type d | wc -l
12163
Allocating a directory entry is just a matter of bumping the ndirs
counter. Reading a directory into dirs
looks roughly like so:
int32_t glob = buf_push(&b, L"*");
static struct dir dirs[DIRS_MAX];
int32_t parent = ...; // an existing directory handle
wchar_t path[MAX_PATH];
buildpath(path, &b, dirs, parent, glob);
WIN32_FIND_DATAW fd;
HANDLE h = FindFirstFileW(path, &fd);
do {
if (FILE_ATTRIBUTE_DIRECTORY & fd.dwFileAttributes) {
int32_t name = buf_push(&b, fd.cFileName);
if (name < 0 || ndirs == DIRS_MAX) {
// out of memory
}
int32_t i = ndirs++;
dirs[i].name = name;
dirs[i].link = parent;
dirs[parent].nsubdirs++;
} else {
// ... process file ...
}
} while (FindNextFileW(h, &fd));
CloseHandle(h);
Mentally bookmark that “process file” part. It will be addressed later.
The buildpath
function walks the link
fields, copying (memcpy
) path
components from the string buffer into the path
, separated by
backslashes.
At the top-level the program must first traverse a tree. There are two strategies for traversing a tree (or any graph):
Recursion makes me nervous, but besides this, a queue is already a natural
fit for this problem. The tree I build in dirs
is also the breadth-first
processing queue. (Note: This is entirely distinct from the message
queue that I’ll introduce later, and is not a concurrent queue.) Further,
building the tree in dirs
via breadth-first traversal will have useful
properties later.
The queue is initialized with the root directory, then iterated over until the iterator reaches the end. Additional directories may added during iteration, per the last section.
int32_t root = ndirs++;
dirs[root].name = buf_push(&b, L".");
dirs[root].link = -1; // terminator
for (int32_t parent = 0; parent < ndirs; parent++) {
// ... FindFirstFileW / FindNextFileW ...
}
When the loop exits, the program has traversed the full tree. Counts are
now propagated up the tree using the link
field, pointing from leaves to
root. In this direction it’s just a linked list. Propagation starts at the
root and works towards leaves to avoid multiple-counting, and the
breadth-first dirs
is already ordered for this.
for (int32_t i = 1; i < ndirs; i++) {
for (int32_t j = dirs[i].link; j >= 0; j = dirs[j].link) {
dirs[j].nbytes += dirs[i].nbytes;
dirs[j].nlines += dirs[i].nlines;
}
}
Since this is really another traversal, this could be done during the first traversal. However, line counting will be done concurrently, and it’s easier, and probably more efficient, to propagate concurrent results after the concurrent part of the code is complete.
Printing the graph will require a depth-first traversal. Given an entry, the program will iterate over its children. However, the tree links are currently backwards, pointing from child to parent:
To traverse from root to leaves, those links will need to be inverted:
However, there’s only one link
on each node, but potentially multiple
children. The breadth-first traversal comes to the rescue: All child nodes
for a given directory are adjacent in dirs
. If link
points to the
first child, finding the rest is trivial. There’s an implicit link between
siblings by virtue of position:
An entry’s first child immediately follows the previous entry’s last
child. So to flip the links around, manually establish the root’s link
field, then walk the tree breadth-first and hook link
up to each entry’s
children based on the previous entry’s link
and nsubdirs
:
dirs[0].link = 1;
for (int32_t i = 1; i < ndirs; i++) {
dirs[i].link = dirs[i-1].link + dirs[i-1].nsubdirs;
}
The tree is now restructured for sorting and depth-first traversal.
I won’t include it here, but I have a qsort
-compatible comparison
function, dircmp
that compares by line count descending, then by name
ascending. As a file system tree, siblings cannot have equal names.
int dircmp(const void *, const void *);
Since child entries are adjacent, it’s a trivial to qsort
each entry’s
children. A loop sorts the whole tree:
for (int32_t i = 0; i < ndirs; i++) {
struct dir *beg = dirs + dirs[i].link;
qsort(beg, dirs[i].nsubdirs, sizeof(*dirs), dircmp);
}
We’re almost to the finish line.
As I said, recursion makes me nervous, so I took the slightly more
complicated route of an explicit stack. Path components must be separated
by a backslash delimiter, so the deepest possible stack is MAX_PATH/2
.
Each stack element tracks a directory handle (d
) and a subdirectory
index (i
).
I have a printstat
to output an entry. It takes an entry, the string
buffer, and a depth for indentation level.
void printstat(struct dir *d, struct buf *b, int depth);
Here’s a simplified depth-first traversal calling printstat
. (The real
one has to make decisions about when to stop and summarize, and it’s
dominated by edge cases.) I initialize the stack with the root directory,
then loop until it’s empty.
int n = 0; // top of stack
struct {
int32_t d;
int32_t i;
} stack[MAX_PATH/2];
stack[n].d = 0;
stack[n].i = 0;
printstat(dirs+0, &b, n);
while (n >= 0) {
int32_t d = stack[n].d;
int32_t i = stack[n].i++;
if (i >= dirs[d].nsubdirs) {
n--; // pop
} else {
int32_t cur = dirs[d].link + i;
printstat(dirs+cur, &b, n);
n++; // push
stack[n].d = cur;
stack[n].i = 0;
}
}
At this point the “process file” part of traversal was a straightforward
CreateFile
, ReadFile
loop, CloseHandle
. I suspected it spent most of
its time in the loop counting newlines since I didn’t do anything special,
like SIMD, aside from not over-constraining code
generation.
However after taking some measurements, I found the program was spending
99.9% its time waiting on Win32 functions. CreateFile
was the most
expensive at nearly 50% of the total run time, and even CloseHandle
was
a substantial blocker. These two alone meant overlapped I/O wouldn’t help
much, and threads were necessary to run these Win32 blockers concurrently.
Counting newlines, even over gigabytes of data, was practically free, and
so required no further attention.
So I set up my lock-free work queue.
#define QUEUE_LEN (1<<15)
struct queue {
uint32_t q;
int32_t d[QUEUE_LEN];
int32_t f[QUEUE_LEN];
};
As before, q
here is the atomic. A max-size queue for QUEUE_LEN
worked
best in my tests. Larger queues were rarely full. Or empty, except at
startup and shutdown. Queue elements are a pair of directory handle (d
)
and file string handle (f
), stored in separate arrays.
I didn’t need to push the file name strings into the string buffer before, but now it’s a great way to supply strings to other threads. I push the string into the buffer, then send the handle through the queue. The recipient re-constructs the path on its end using the directory tree and this file name. Unfortunately this puts more stress on the string buffer, which is why I had to max out the size, but it’s worth it.
The “process files” part now looks like this:
dirs[parent].nbytes += fd.nFileSizeLow;
dirs[parent].nbytes += (uint64_t)fd.nFileSizeHigh << 32;
int32_t name = buf_push(&b, fd.cFileName);
if (!queue_send(&queue, parent, name)) {
wchar_t path[MAX_PATH];
buildpath(path, buf.buf, dirs, parent, name);
processfile(path, dirs, parent);
}
If queue_send()
returns false then the queue is full, so it processes
the job itself. There might be room later for the next file.
Worker threads look similar, spinning until an item arrives in the queue:
for (;;) {
int32_t d;
int32_t name;
while (!queue_recv(q, &d, &name));
if (d == -1) {
return 0;
}
wchar_t path[MAX_PATH];
buildpath(path, buf, dirs, d, name);
processfile(path, dirs, d);
}
A special directory entry handle of -1 tells the worker to exit. When traversal completes, the main thread becomes a worker until the queue empties, pushes one termination handle for each worker thread, then joins the worker threads — a synchronization point that indicates all work is complete, and the main thread can move on to propagation and sorting.
This was a substantial performance boost. At least on my system, running just 4 threads total is enough to saturate the Win32 interface, and additional threads do not make the program faster despite more available cores.
Aside from overall portability, I’m quite happy with the results.
]]>While considering concurrent queue design I came up with a generic, lock-free queue that fits in a 32-bit integer. The queue is “generic” in that a single implementation supports elements of any arbitrary type, despite an implementation in C. It’s lock-free in that there is guaranteed system-wide progress. It can store up to 32,767 elements at a time — more than enough for message queues, which must always be bounded. I will first present a single-consumer, single-producer queue, then expand support to multiple consumers at a cost. Like my lightweight barrier, I’m not presenting this as a packaged solution, but rather as a technique you can apply when circumstances call.
How can the queue store so many elements when it’s just 32 bits? It only handles the indexes of a circular buffer. The caller is responsible for allocating and manipulating the queue’s storage, which, in the single-consumer case, doesn’t require anything fancy. Synchronization is managed by the queue.
Like a typical circular buffer, it has a head index and a tail index. The head is the next element to be pushed, and the tail is the next element to be popped. The queue storage must have a power-of-two length, but the capacity is one less than the length. If the head and tail are equal then the queue is empty. This “wastes” one element, which is why the capacity is one less than the length of the storage. So already there are some notable constraints imposed by this design, but I believe the main use case for such a queue — a job queue for CPU-bound jobs — has no problem with these constraints.
Since this is a concurrent queue it’s worth noting “ownership” of storage elements. The consumer owns elements from the tail up to, but excluding, the head. The producer owns everything else. Both pushing and popping involve a “commit” step that transfers ownership of an element to the other thread. No elements are accessed concurrently, which makes things easy for either caller.
Pushing (to the front) and popping (from the back) are each a three-step process:
I’ll be using C11 atomics for my implementation, but it should be easy to
translate these into something else no matter the programming language. As
I mentioned, the queue fits in a 32-bit integer, and so it’s represented
by an _Atomic uint32_t
. Here’s the entire interface:
int queue_pop(_Atomic uint32_t *queue, int exp);
void queue_pop_commit(_Atomic uint32_t *queue);
int queue_push(_Atomic uint32_t *queue, int exp);
void queue_push_commit(_Atomic uint32_t *queue);
Both queue_pop
and queue_push
return -1 if the queue is empty/full.
To create a queue, initialize an atomic 32-bit integer to zero. Also choose a size exponent and allocate some storage. Here’s a 63-element queue of jobs:
#define EXP 6 // note; 2**6 == 64
struct job slots[1<<EXP];
_Atomic uint32_t q = 0;
Rather than a length, the queue functions accept a base-2 exponent, which
is why I’ve defined EXP
. If you don’t like this, you can just accept a
length in your own implementation, though remember it’s constrained to
powers of two. The producer might look like so:
for (;;) {
int i;
do {
i = queue_push(&q, EXP);
} while (i < 0); // note: busy-wait while full
slots[i] = job_create();
queue_push_commit(&q);
}
This is a busy-wait loop, which makes for a simple illustration but isn’t ideal. In a real program I’d have the producer run a job while it waits for a queue slot, or just have it turn into a consumer (if this wasn’t a single-consumer queue). Similarly, if the queue is empty, then maybe a consumer turns into the producer. It all depends on the context.
The consumer might look like so:
for (;;) {
int i;
do {
i = queue_pop(&q, EXP);
} while (i < 0); // note: busy-wait while empty
struct job job = slots[i];
queue_pop_commit(&q);
job_run(job);
}
In either case it’s important that neither touches the element after committing since that transfers ownership away.
The queue is actually a pair of 16-bit integers, head and tail, each stored in the low and high halves of the 32-bit integer. So the first thing to do is atomically load the integer, then extract these “fields.”
If for some reason a capacity of 32,767 is insufficient, you can trivially upgrade your queue to an Enterprise Queue: a 64-bit integer with a capacity of over 2 billion elements. I’m going to stick with the 32-bit queue.
Starting with the pop operation since it’s simpler:
int queue_pop(_Atomic uint32_t *q, int exp)
{
uint32_t r = *q; // consider "acquire"
int mask = (1u << exp) - 1;
int head = r & mask;
int tail = r>>16 & mask;
return head == tail ? -1 : tail;
}
If the indexes are equal, the queue is empty. Otherwise return the tail
field. The *q
is an atomic load since it’s qualified _Atomic
. The load
might be more efficient if this were an explicit “acquire” operation,
which is what I used in some of my tests.
To complete the pop, atomically increment the tail index so that the
element falls out of the range of elements owned by the consumer. The tail
is the high half of the integer so add 0x10000
rather than just 1.
void queue_pop_commit(_Atomic uint32_t *q)
{
*q += 0x10000; // consider "release"
}
It’s harmless if this overflows since it’s congruent with the power-of-two storage length, and an overflow won’t affect the head index. The increment might be more efficient if this were an explicit “release” operation, which, again, is what I used in some of my tests.
Pushing is a little more complex. As is typical with circular buffers, before doing anything it must ensure the result won’t ambiguously create an empty queue.
int queue_push(_Atomic uint32_t *q, int exp)
{
uint32_t r = *q; // consider "acquire"
int mask = (1u << exp) - 1;
int head = r & mask;
int tail = r>>16 & mask;
int next = (head + 1u) & mask;
if (r & 0x8000) { // avoid overflow on commit
*q &= ~0x8000;
}
return next == tail ? -1 : head;
}
It’s important that incrementing the head field won’t overflow into the tail field, so it atomically clears the high bit if set, giving the increment overhead into which it can overflow.
void queue_push_commit(_Atomic uint32_t *q)
{
*q += 1; // consider "release"
}
The single producer and single consumer didn’t require locks nor atomic accesses to the storage array since the queue guaranteed that accesses at the specified index were not concurrent. However, this is not the case with multiple-consumers. Consumers race when popping. The loser’s access might occur after the winner’s commit, making its access concurrent with the producer. Both producer and consumers must account for this.
_Atomic struct job slots[1<<EXP];
To prepare for multiple consumers, the array now has an atomic qualifier: one of the costs of multiple consumers. Fortunately these new atomic accesses can use a “relaxed” ordering since there are no required ordering constraints. Even if it wasn’t atomic, and the load was torn, we’d detect it when attempting to commit. It’s simply against the rules to have a data race, and I don’t know how else to avoid it other than dropping into assembly.
The next cost is that committing can fail. Another consumer might have won
the race, which means you must start over. Here’s my multiple-consumer
interface, which I’ve uncreatively called mpop
(“multiple-consumer
pop”). Besides a _Bool
for indicating failure, the main change is a new
save
parameter:
int queue_mpop(_Atomic uint32_t *, int, uint32_t *save);
_Bool queue_mpop_commit(_Atomic uint32_t *, uint32_t save);
The caller must carry some temporary state (save
), which is how failures
are detected, ultimately communicated by that _Bool
return.
for (;;) {
int i;
int32_t save;
struct job job;
do {
do {
i = queue_mpop(&q, EXP, &save);
} while (i < 0); // note: busy-wait while empty
job = slots[i];
} while (!queue_mpop_commit(&q, save));
job_run(job);
}
It’s important that the consumer doesn’t attempt to use job
until a
successful commit, since it might not be valid. As noted, that load could
be relaxed (what a mouthful):
job = atomic_load_explicit(slots+i, memory_order_relaxed);
Here’s the pop implementation:
int queue_mpop(_Atomic uint32_t *q, int exp, uint32_t *save)
{
uint32_t r = *save = *q;
int mask = (1u << exp) - 1;
int head = r & mask;
int tail = r>>16 & mask;
return head == tail ? -1 : tail;
}
So far it’s exactly the same, except it stores a full snapshot of the
queue state in *save
. This is needed for a compare-and-swap (CAS) in the
commit, which checks that the queue hasn’t been modified concurrently
(i.e. by another consumer):
_Bool queue_mpop_commit(_Atomic uint32_t *q, uint32_t save)
{
return atomic_compare_exchange_strong(q, &save, save+0x10000);
}
As always with CAS, we must be wary of the ABA problem. Imagine that between starting to pop and this CAS that the producer and another consumer looped over the entire queue and ended up back at exactly the same spot as where we started. The queue would look like we expect, and the commit would “succeed” despite reading a garbage value.
Fortunately this matches the entire 32-bit state, and so a small queue capacity is not at a greater risk. The tail counter is always 16 bits, and the head counter is 15 bits (due to keeping the 16th clear for overflow). The chance of them landing at exactly the same count is low. Though if those odds aren’t low enough, as mentioned you can always upgrade to the 64-bit Enterprise Queue with larger counters.
There’s a notable performance defect with this particular design. If the producer concurrently pushes a new value, the commit will fail even if there was no real race since only the head field changed. It would be better if the head field was isolated from the tail field…
You might have noticed that there’s little reason to pack two 16-bit counters into a 32-bit integer. These could just be fields in a structure:
struct queue {
_Atomic uint16_t head;
_Atomic uint16_t tail;
};
While this entire structure can be atomically loaded just like the 32-bit integer, C11 (and later) do not permit non-atomic accesses to these atomic fields in an unshared copy loaded from an atomic. So I’d either use compiler-specific built-ins for atomics — much more flexible, and what I prefer anyway — or just load them individually:
int queue_pop(struct queue *q, int exp, uint16_t *save)
{
int mask = (1u << exp) - 1;
int head = q->head & mask;
int tail = (*save = q->tail) & mask;
return head == tail ? -1 : tail;
}
Technically with two loads this could extract a head
/tail
pair that
were never contemporaneous. The worst case is the queue appears empty even
if it was never actually empty.
_Bool queue_mpop_commit(struct queue *q, uint16_t save)
{
return atomic_compare_exchange_strong(&q->tail, &save, save+1);
}
Since the head index isn’t part of the CAS, the producer can’t interfere with the commit. (Though there’s still certainly false sharing happening.)
If you want to try it out, especially with my tests: queue.c. It has both single-consumer and multiple-consumer queues, and supports at least:
Since I wanted to test across a variety of implementations, especially under Thread Sanitizer (TSan). On a similar note, I also implemented a concurrent queue shared between C and Go: queue.go.
]]>If you want to skip ahead, here’s the full source, tests, and benchmark:
luhn.c
The Luhn algorithm isn’t just for credit card numbers, but they do make a nice target for a SWAR approach. The major payment processors use 16 digit numbers — i.e. 16 ASCII bytes — and typical machines today have 8-byte registers, so the input fits into two machine registers. In this context, the algorithm works like so:
Consider the digits number as an array, and double every other digit starting with the first. For example, 6543 becomes 12, 5, 8, 3.
Sum individual digits in each element. The example becomes 3 (i.e. 1+2), 5, 8, 3.
Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.
I will implement this algorithm in C with this prototype:
int luhn(const char *s);
It assumes the input is 16 bytes and only contains digits, and it will return the Luhn sum. Callers either validate a number by comparing the result to zero, or use it to compute a check digit when generating a number. (Read: You could use SWAR to rapidly generate valid numbers.)
The plan is to process the 16-digit number in two halves, and so first
load the halves into 64-bit registers, which I’m calling hi
and lo
:
uint64_t hi =
(uint64_t)(s[ 0]&255) << 0 | (uint64_t)(s[ 1]&255) << 8 |
(uint64_t)(s[ 2]&255) << 16 | (uint64_t)(s[ 3]&255) << 24 |
(uint64_t)(s[ 4]&255) << 32 | (uint64_t)(s[ 5]&255) << 40 |
(uint64_t)(s[ 6]&255) << 48 | (uint64_t)(s[ 7]&255) << 56;
uint64_t lo =
(uint64_t)(s[ 8]&255) << 0 | (uint64_t)(s[ 9]&255) << 8 |
(uint64_t)(s[10]&255) << 16 | (uint64_t)(s[11]&255) << 24 |
(uint64_t)(s[12]&255) << 32 | (uint64_t)(s[13]&255) << 40 |
(uint64_t)(s[14]&255) << 48 | (uint64_t)(s[15]&255) << 56;
This looks complicated and possibly expensive, but it’s really just an idiom for loading a little endian 64-bit integer from a buffer. Breaking it down:
The input, *s
, is char
, which may be signed on some architectures. I
chose this type since it’s the natural type for strings. However, I do
not want sign extension, so I mask the low byte of the possibly-signed
result by ANDing with 255. It’s as though *s
was unsigned char
.
The shifts assemble the 64-bit result in little endian byte order regardless of the host machine byte order. In other words, this will produce correct results even on big endian hosts.
I chose little endian since it’s the natural byte order for all the architectures I care about. Big endian hosts may pay a cost on this load (byte swap instruction, etc.). The rest of the function could just as easily be computed over a big endian load if I was primarily targeting a big endian machine instead.
I could have used unsigned long long
(i.e. at least 64 bits) since
no part of this function requires exactly 64 bits. I chose uint64_t
since it’s succinct, and in practice, every implementation supporting
long long
also defines uint64_t
.
Both GCC and Clang figure this all out and produce perfect code. On x86-64, just one instruction for each statement:
mov rax, [rdi+0]
mov rdx, [rdi+8]
Or, more impressively, loading both using a single instruction on ARM64:
ldp x0, x1, [x0]
The next step is to decode ASCII into numeric values. This is trivial and
common in SWAR, and only requires subtracting '0'
(0x30
). So long
as there is no overflow, this can be done lane-wise.
hi -= 0x3030303030303030;
lo -= 0x3030303030303030;
Each byte of the register now contains values in 0–9. Next, double every other digit. Multiplication in SWAR is not easy, but doubling just means adding the odd lanes to themselves. I can mask out the lanes that are not doubled. Regarding the mask, recall that the least significant byte is the first byte (little endian).
hi += hi & 0x00ff00ff00ff00ff;
lo += lo & 0x00ff00ff00ff00ff;
Each byte of the register now contains values in 0–18. Now for the tricky problem of folding the tens place into the ones place. Unlike 8 or 16, 10 is not a particularly convenient base for computers, especially since SWAR lacks lane-wide division or modulo. Perhaps a lane-wise binary-coded decimal could solve this. However, I have a better trick up my sleeve.
Consider that the tens place is either 0 or 1. In other words, we really only care if the value in the lane is greater than 9. If I add 6 to each lane, the 5th bit (value 16) will definitely be set in any lanes that were previously at least 10. I can use that bit as the tens place.
hi += (hi + 0x0006000600060006)>>4 & 0x0001000100010001;
lo += (lo + 0x0006000600060006)>>4 & 0x0001000100010001;
This code adds 6 to the doubled lanes, shifts the 5th bit to the least significant position in the lane, masks for just that bit, and adds it lane-wise to the total. Only applying this to doubled lanes is a style decision, and I could have applied it to all lanes for free.
The astute might notice I’ve strayed from the stated algorithm. A lane that was holding, say, 12 now hold 13 rather than 3. Since the final result of the algorithm is modulo 10, leaving the tens place alone is harmless, so this is fine.
At this point each lane contains values in 0–19. Now that the tens processing is done, I can combine the halves into one register with a lane-wise sum:
hi += lo;
Each lane contains values in 0–38. I would have preferred to do this sooner, but that would have complicated tens place handling. Even if I had rotated the doubled lanes in one register to even out the sums, some lanes may still have had a 2 in the tens place.
The final step is a horizontal sum reduction using the typical SWAR approach. Add the top half of the register to the bottom half, then the top half of what’s left to the bottom half, etc.
hi += hi >> 32;
hi += hi >> 16;
hi += hi >> 8;
Before the sum I said each lane was 0–38, so couldn’t this sum be as high as 304 (8x38)? It would overflow the lane, giving an incorrect result. Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled lanes. That’s a maximum of 224, which fits in the result lane without overflow. Whew! I’ve been tracking the range all along to guard against overflow like this.
Finally mask the result lane and return it modulo 10:
return (hi&255) % 10;
On my machine, SWAR is around 3x faster than a straightforward digit-by-digit implementation.
int is_valid(const char *s)
{
return luhn(s) == 0;
}
void random_credit_card(char *s)
{
sprintf(s, "%015llu0", rand64()%1000000000000000);
s[15] = '0' + 10 - luhn(s);
}
Conveniently, all the SWAR operations translate directly into SSE2 instructions. If you understand the SWAR version, then this is easy to follow:
int luhn(const char *s)
{
__m128i r = _mm_loadu_si128((void *)s);
// decode ASCII
r = _mm_sub_epi8(r, _mm_set1_epi8(0x30));
// double every other digit
__m128i m = _mm_set1_epi16(0x00ff);
r = _mm_add_epi8(r, _mm_and_si128(r, m));
// extract and add tens digit
__m128i t = _mm_set1_epi16(0x0006);
t = _mm_add_epi8(r, t);
t = _mm_srai_epi32(t, 4);
t = _mm_and_si128(t, _mm_set1_epi8(1));
r = _mm_add_epi8(r, t);
// horizontal sum
r = _mm_sad_epu8(r, _mm_set1_epi32(0));
r = _mm_add_epi32(r, _mm_shuffle_epi32(r, 2));
return _mm_cvtsi128_si32(r) % 10;
}
On my machine, the SIMD version is around another 3x increase over SWAR, and so nearly an order of magnitude faster than a digit-by-digit implementation.
Update: Const-me on Hacker News suggests a better option for handling the tens digit in the function above, shaving off 7% of the function’s run time on my machine:
// if (digit > 9) digit -= 9
__m128i nine = _mm_set1_epi8(9);
__m128i gt = _mm_cmpgt_epi8(r, nine);
r = _mm_sub_epi8(r, _mm_and_si128(gt, nine));
Update: u/aqrit on reddit has come up with a more optimized SSE2 solution, 12% faster than mine on my machine:
int luhn(const char *s)
{
__m128i v = _mm_loadu_si128((void *)s);
__m128i m = _mm_cmpgt_epi8(_mm_set1_epi16('5'), v);
v = _mm_add_epi8(v, _mm_slli_epi16(v, 8));
v = _mm_add_epi8(v, m); // subtract 1 if less than 5
v = _mm_sad_epu8(v, _mm_setzero_si128());
v = _mm_add_epi32(v, _mm_shuffle_epi32(v, 2));
return (_mm_cvtsi128_si32(v) - 4) % 10;
// (('0' * 24) - 8) % 10 == 4
}
The other day I wanted try the famous memory reordering experiment for myself. It’s the double-slit experiment of concurrency, where a program can observe an “impossible” result on common hardware, as though a thread had time-traveled. While getting thread timing as tight as possible, I designed a possibly-novel thread barrier. It’s purely spin-locked, the entire footprint is a zero-initialized integer, it automatically resets, it can be used across processes, and the entire implementation is just three to four lines of code.
Here’s the entire barrier implementation for two threads in C11.
// Spin-lock barrier for two threads. Initialize *barrier to zero.
void barrier_wait(_Atomic uint32_t *barrier)
{
uint32_t v = ++*barrier;
if (v & 1) {
for (v &= 2; (*barrier&2) == v;);
}
}
Or in Go:
func BarrierWait(barrier *uint32) {
v := atomic.AddUint32(barrier, 1)
if v&1 == 1 {
v &= 2
for atomic.LoadUint32(barrier)&2 == v {
}
}
}
Even more, these two implementations are compatible with each other. C threads and Go goroutines can synchronize on a common barrier using these functions. Also note how it only uses two bits.
When I was done with my experiment, I did a quick search online for other spin-lock barriers to see if anyone came up with the same idea. I found a couple of subtly-incorrect spin-lock barriers, and some straightforward barrier constructions using a mutex spin-lock.
Before diving into how this works, and how to generalize it, let’s discuss the circumstance that let to its design.
Here’s the setup for the memory reordering experiment, where w0
and w1
are initialized to zero.
thread#1 thread#2
w0 = 1 w1 = 1
r1 = w1 r0 = w0
Considering all the possible orderings, it would seem that at least one of
r0
or r1
is 1. There seems to be no ordering where r0
and r1
could
both be 0. However, if raced precisely, this is a frequent or possibly
even majority occurrence on common hardware, including x86 and ARM.
How to go about running this experiment? These are concurrent loads and
stores, so it’s tempting to use volatile
for w0
and w1
. However,
this would constitute a data race — undefined behavior in at least C and
C++ — and so we couldn’t really reason much about the results, at least
not without first verifying the compiler’s assembly. These are variables
in a high-level language, not architecture-level stores/loads, even with
volatile
.
So my first idea was to use a bit of inline assembly for all accesses that would otherwise be data races. x86-64:
static int experiment(int *w0, int *w1)
{
int r1;
__asm volatile (
"movl $1, %1\n"
"movl %2, %0\n"
: "=r"(r1), "=m"(*w0)
: "m"(*w1)
);
return r1;
}
ARM64 (to try on my Raspberry Pi):
static int experiment(int *w0, int *w1)
{
int r1 = 1;
__asm volatile (
"str %w0, %1\n"
"ldr %w0, %2\n"
: "+r"(r1), "=m"(w0)
: "m"(w1)
);
return r1;
}
This is from the point-of-view of thread#1, but I can swap the arguments
for thread#2. I’m expecting this to be inlined, and encouraging it with
static
.
Alternatively, I could use C11 atomics with a relaxed memory order:
static int experiment(_Atomic int *w0, _Atomic int *w1)
{
atomic_store_explicit(w0, 1, memory_order_relaxed);
return atomic_load_explicit(w1, memory_order_relaxed);
}
Since this is a race and I want both threads to run their two experiment instructions as simultaneously as possible, it would be wise to use some sort of starting barrier… exactly the purpose of a thread barrier! It will hold the threads back until they’re both ready.
int w0, w1, r0, r1;
// thread#1 // thread#2
w0 = w1 = 0;
BARRIER; BARRIER;
r1 = experiment(&w0, &w1); r0 = experiment(&w1, &w0);
BARRIER; BARRIER;
if (!r0 && !r1) {
puts("impossible!");
}
The second thread goes straight into the barrier, but the first thread does a little more work to initialize the experiment and a little more at the end to check the result. The second barrier ensures they’re both done before checking.
Running this only once isn’t so useful, so each thread loops a few million times, hence the re-initialization in thread#1. The barriers keep them lockstep.
On my first attempt, I made the obvious decision for the barrier: I used
pthread_barrier_t
. I was already using pthreads for spawning the
extra thread, including on Windows, so this was convenient.
However, my initial results were disappointing. I only observed an “impossible” result around one in a million trials. With some debugging I determined that the pthreads barrier was just too damn slow, throwing off the timing. This was especially true with winpthreads, bundled with Mingw-w64, which in addition to the per-barrier mutex, grabs a global lock twice per wait to manage the barrier’s reference counter.
All pthreads implementations I used were quick to yield to the system scheduler. The first thread to arrive at the barrier would go to sleep, the second thread would wake it up, and it was rare they’d actually race on the experiment. This is perfectly reasonable for a pthreads barrier designed for the general case, but I really needed a spin-lock barrier. That is, the first thread to arrive spins in a loop until the second thread arrives, and it never interacts with the scheduler. This happens so frequently and quickly that it should only spin for a few iterations.
Spin locking means atomics. By default, atomics have sequentially
consistent ordering and will provide the necessary synchronization for the
non-atomic experiment variables. Stores (e.g. to w0
, w1
) made before
the barrier will be visible to all other threads upon passing through the
barrier. In other words, the initialization will propagate before either
thread exits the first barrier, and results propagate before either thread
exits the second barrier.
I know statically that there are only two threads, simplifying the implementation. The plan: When threads arrive, they atomically increment a shared variable to indicate such. The first to arrive will see an odd number, telling it to atomically read the variable in a loop until the other thread changes it to an even number.
At first with just two threads this might seem like a single bit would suffice. If the bit is set, the other thread hasn’t arrived. If clear, both threads have arrived.
void broken_wait1(_Atomic unsigned *barrier)
{
++*barrier;
while (*barrier&1);
}
Or to avoid an extra load, use the result directly:
void broken_wait2(_Atomic unsigned *barrier)
{
if (++*barrier & 1) {
while (*barrier&1);
}
}
Neither of these work correctly, and the other mutex-free barriers I found all have the same defect. Consider the broader picture: Between atomic loads in the first thread spin-lock loop, suppose the second thread arrives, passes through the barrier, does its work, hits the next barrier, and increments the counter. Both threads see an odd counter simultaneously and deadlock. No good.
To fix this, the wait function must also track the phase. The first barrier is the first phase, the second barrier is the second phase, etc. Conveniently the rest of the integer acts like a phase counter! Writing this out more explicitly:
void barrier_wait(_Atomic unsigned *barrier)
{
unsigned observed = ++*barrier;
unsigned thread_count = observed & 1;
if (thread_count != 0) {
// not last arrival, watch for phase change
unsigned init_phase = observed >> 1;
for (;;) {
unsigned current_phase = *barrier >> 1;
if (current_phase != init_phase) {
break;
}
}
}
}
The key: When the last thread arrives, it overflows the thread counter to zero and increments the phase counter in one operation.
By the way, I’m using unsigned
since it may eventually overflow, and
even _Atomic int
overflow is undefined for the ++
operator. However,
if you use atomic_fetch_add
or C++ std::atomic
then overflow is
defined and you can use int
.
Threads can never be more than one phase apart by definition, so only one
bit is needed for the phase counter, making this effectively a two-phase,
two-bit barrier. In my final implementation, rather than shift (>>
), I
mask (&
) the phase bit with 2.
With this spin-lock barrier, the experiment observes r0 = r1 = 0
in ~10%
of trials on my x86 machines and ~75% of trials on my Raspberry Pi 4.
Two threads required two bits. This generalizes to log2(n)+1
bits for
n
threads, where n
is a power of two. You may have already figured out
how to support more threads: spend more bits on the thread counter.
// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
void barrier_waitn(_Atomic unsigned *barrier, int n)
{
unsigned v = ++*barrier;
if (v & (n - 1)) {
for (v &= n; (*barrier&n) == v;);
}
}
Note: It never makes sense for n
to exceed the logical core count!
If it does, then at least one thread must not be actively running. The
spin-lock ensures it does not get scheduled promptly, and the barrier will
waste lots of resources doing nothing in the meantime.
If the barrier is used little enough that you won’t overflow the overall
barrier integer — maybe just use a uint64_t
— an implementation could
support arbitrary thread counts with the same principle using modular
division instead of the &
operator. The denominator is ideally a
compile-time constant in order to avoid paying for division in the
spin-lock loop.
While C11 _Atomic
seems like it would be useful, unsurprisingly it is
not supported by one major, stubborn implementation. If you’re
using C++11 or later, then go ahead use std::atomic<int>
since it’s
well-supported. In real, practical C programs, I will continue using dual
implementations: interlocked functions on MSVC, and GCC built-ins (also
supported by Clang) everywhere else.
#if __GNUC__
# define BARRIER_INC(x) __atomic_add_fetch(x, 1, __ATOMIC_SEQ_CST)
# define BARRIER_GET(x) __atomic_load_n(x, __ATOMIC_SEQ_CST)
#elif _MSC_VER
# define BARRIER_INC(x) _InterlockedIncrement(x)
# define BARRIER_GET(x) _InterlockedOr(x, 0)
#endif
// Spin-lock barrier for n threads, where n is a power of two.
// Initialize *barrier to zero.
static void barrier_wait(int *barrier, int n)
{
int v = BARRIER_INC(barrier);
if (v & (n - 1)) {
for (v &= n; (BARRIER_GET(barrier)&n) == v;);
}
}
This has the nice bonus that the interface does not have the _Atomic
qualifier, nor std::atomic
template. It’s just a plain old int
, making
the interface simpler and easier to use. It’s something I’ve grown to
appreciate from Go.
If you’d like to try the experiment yourself: reorder.c
. If
you’d like to see a test of Go and C sharing a thread barrier:
coop.go
.
I’m intentionally not providing the spin-lock barrier as a library. First, it’s too trivial and small for that, and second, I believe context is everything. Now that you understand the principle, you can whip up your own, custom-tailored implementation when the situation calls for it, just as the one in my experiment is hard-coded for exactly two threads.
]]>Last week one particular QuickBASIC clone, WorDOSle, caught my eye. It embeds its word list despite the dire constraints of its 16-bit platform. The original Wordle list (1, 2) has 12,972 words which, naively stored, would consume 77,832 bytes (5 letters, plus newline). Sadly this exceeds a 16-bit address space. Eliminating the redundant newline delimiter brings it down to 64,860 bytes — just small enough to fit in an 8086 segment, but probably still difficult to manage from QuickBASIC.
The author made a trade-off, reducing the word list to a more manageable, if meager, 2,318 words, wisely excluding delimiters. Otherwise no further effort made towards reducing the size. The list is sorted, and the program cleverly tests words against the list in place using a binary search.
Before getting into any real compression technologies, there’s low hanging fruit to investigate. Words are exactly five, case-insensitive, English language letters: a–z. To illustrate, here are the first 100 5-letter words from a short Wordle word list.
abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
In ASCII/UTF-8 form it’s 8 bits per letter, 5 bytes per word, but I only
need 5 bits per letter, or more specifically, ~4.7 bits (log2(26)
) per
letter. If I instead treat each word as a base-26 number, I can pack each
word into 3 bytes (26**5
is ~23.5 bits). A 40% savings just by using a
smarter representation.
With 12,972 words, that’s 38,916 bytes for the whole list. Any compression I apply must at least beat this size in order to be worth using.
Not all letters occur at the same frequency. Here’s the letter frequency for the original Wordle word list:
a:5990 e:6662 i:3759 m:1976 q: 112 u:2511 y:2074
b:1627 f:1115 j: 291 n:2952 r:4158 v: 694 z: 434
c:2028 g:1644 k:1505 o:4438 s:6665 w:1039
d:2453 h:1760 l:3371 p:2019 t:3295 x: 288
When encoding a word, I can save space by spending fewer bits on frequent
letters like e
at the cost of spending more bits on infrequent letters
like q
. There are multiple approaches, but the simplest is Huffman
coding. It’s not the most efficient, but it’s so easy I can
almost code it in my sleep.
While my ultimate target is C, I did the frequency analysis, explored the problem space, and implemented my compressors in Python. I don’t normally like to use Python, but it is good for one-shot, disposable data science-y stuff like this. The decompressor will be implemented in C, partially via meta-programming: Python code generating my C code. Here’s my letter histogram code:
words = [line[:5] for line in sys.stdin]
hist = collections.defaultdict(int)
for c in itertools.chain(*words):
hist[c] += 1
To build a Huffman coding tree, I’ll need a min-heap (priority queue) initially filled with nodes representing each letter and its frequency. While the heap has more than one element, I pop off the two lowest frequency nodes, create a new parent node with the sum of their frequencies, and push it into the heap. When the heap has one element, the remaining element is the root of the Huffman coding tree.
def huffman(hist):
heap = [(n, c) for c, n in hist.items()]
heapq.heapify(heap)
while len(heap) > 1:
a, b = heapq.heappop(heap), heapq.heappop(heap)
node = a[0]+b[0], (a[1], b[1])
heapq.heappush(heap, node)
return heap[0][1]
tree = huffman(hist)
(By the way, I love that heapq
operates directly on a plain list
rather than being its own data structure.) This produces the following
Huffman coding tree (via pprint
):
((('e', 's'),
(('t', 'l'), (('g', ('v', 'w')), ('h', 'm')))),
((('i', ('p', 'c')),
('r', ('y', ('f', ('z', ('j', ('q', 'x'))))))),
(('o', ('d', 'u')), ('a', ('n', ('k', 'b'))))))
It would be more useful to actually see the encodings.
def flatten(tree, prefix=""):
if isinstance(tree, tuple):
return flatten(tree[0], prefix+"0") + \
flatten(tree[1], prefix+"1")
else:
return [(tree, prefix)]
I used isinstance
to distinguish leaves (str
) from internal nodes
(tuple
). With sorted(flatten(tree))
, I get something like Morse Code:
[('a', '1110'), ('j', '10111110'), ('s', '001'),
('b', '111111'), ('k', '111110'), ('t', '0100'),
('c', '10011'), ('l', '0101'), ('u', '11011'),
('d', '11010'), ('m', '01111'), ('v', '011010'),
('e', '000'), ('n', '11110'), ('w', '011011'),
('f', '101110'), ('o', '1100'), ('x', '101111111'),
('g', '01100'), ('p', '10010'), ('y', '10110'),
('h', '01110'), ('q', '101111110'), ('z', '1011110')]
('i', '1000'), ('r', '1010'),
In terms of encoded bit length, what is the shortest and longest?
codes = dict(flatten(tree))
lengths = [(sum(len(codes[c]) for c in w), w) for w in words]
min(lengths)
is “esses” at 15 bits, and max(lengths)
is “qajaq” at 34
bits. In other words, the worst case is worse than the compact, 24-bit
representation! However, the total is better: sum(w[0] for w in lengths)
reports 281,956 bits, or 35,245 bytes. Packed appropriately, that shaves
off ~3.5kB, though it comes at the cost of losing random access, and
therefore binary search.
Speaking of bit packing, I’m ready to compress the entire word list into a bit stream:
bits = "".join("".join(codes[c] for c in w) for w in words)
Where bits
begins with:
11101110011100001101011101110010110001000111011101...
On the C side I’ll pack these into 32-bit integers, least significant bit
first. I abused textwrap
to dice it up, and I also need to reverse each
set of bits before converting to an integer.
u32 = [int(b[::-1], 2) for b in textwrap.wrap(bits, width=32)]
I now have my compressed data as a sequence of 32-bit integers. Next, some meta-programming:
print(f"static const uint32_t words[{len(u32)}] =", "{", end="")
for i, u in enumerate(u32):
if i%6 == 0:
print("\n ", end="")
print(f"0x{u:08x},", end="")
print("\n};")
That produces a C table, the beginnings of my decompressor. The array length isn’t necessary since the C compiler can figure it out, but being explicit allows human readers to know the size at a glance, too. Observe how the final 32-bit integer isn’t entirely filled.
static const uint32_t words[8812] = {
0x4eeb0e77,0xb8caee23,0xffb892bb,0x397fddf2,0xddfcbfee,0x5ff7997f,
// ...
0x7b4e66bd,0x35ebcccd,0x8f9af60f,0x0000000c,
};
Now, how to go about building the rest of the decompressor? I have a Huffman coding tree, which is an awful lot like a state machine, eh? I can even have Python generate a state transition table from the Huffman tree:
def transitions(tree, states, state):
if isinstance(tree, tuple):
child = len(states)
states[state] = -child
states.extend((None, None))
transitions(tree[0], states, child+0)
transitions(tree[1], states, child+1)
else:
states[state] = ord(tree)
return states
states = transitions(tree, [None], 0)
The central idea: positive entries are leaves, and negative entries are
internal nodes. The negated value is the index of the left child, with the
right child immediately following. In transitions
, the caller reserves
space in the state table for callees, hence starting with [None]
. I’ll
show the actual table in C form after some more meta-programming:
print(f"static const int8_t states[{len(states)}] =", "{", end="")
for i, s in enumerate(states):
if i%12 == 0:
print("\n ", end="")
print(f"{s:4},", end="")
print("\n};")
I chose int8_t
since I know these values will all fit in an octet, and
it must be signed because of the negatives. The result:
static const int8_t states[51] = {
-1, -3, -19, -5, -7, 101, 115, -9, -11, 116, 108, -13,
-17, 103, -15, 118, 119, 104, 109, -21, -39, -23, -27, 105,
-25, 112, 99, 114, -29, 121, -31, 102, -33, 122, -35, 106,
-37, 113, 120, -41, -45, 111, -43, 100, 117, 97, -47, 110,
-49, 107, 98,
};
The first node is -1, meaning if you read a 0 bit then transition to state 1, else state 2 (e.g. immediately following 1). The decompressor reads one bit at a time, walking the state table until it hits a positive value, which is an ASCII code. I’ve decided on this function prototype:
int32_t next(char word[5], int32_t n);
The n
is the bit index, which starts at zero. The function decodes the
word at the given index, then returns the bit index for the next word.
Callers can iterate the entire word list without decompressing the whole
list at once. Finally the decompressor code:
int32_t next(char word[5], int32_t n)
{
for (int i = 0; i < 5; i++) {
int state = 0;
for (; states[state] < 0; n++) {
int b = words[n>>5]>>(n&31) & 1; // next bit
state = b - states[state];
}
word[i] = states[state];
}
return n;
}
When compiled, this is about 80 bytes of instructions, both x86-64 and ARM64. This, along with the 51 bytes for the state table, should be counted against the compression size. That’s 35,579 bytes total.
Trying it out, this program indeed reproduces the original word list:
int main(void)
{
int32_t state = 0;
char word[] = ".....\n";
for (int i = 0; i < 12972; i++) {
state = next(word, state);
fwrite(word, 6, 1, stdout);
}
}
Searching 12,972 words linearly isn’t too bad, even for an old 16-bit
machine. However, if you really need to speed it up, you could build a
little run time index to track various bit positions in the list. For
example, the first word starting with b
is at bit offset 15,743. If the
word I’m looking up begins with b
then I can start there and stop at the
first c
, decompressing just 909 words.
Here’s the 100-word word list sample again. The sorting is deliberate:
abbey acute agile album alloy ample apron array attic awful
abide adapt aging alert alone angel arbor arrow audio babes
about added agree algae along anger areas ashes audit backs
above admit ahead alias aloud angle arena aside autos bacon
abuse adobe aided alien alpha angry argue asked avail badge
acids adopt aides align altar ankle arise aspen avoid badly
acorn adult aimed alike alter annex armed asses await baked
acres after aired alive amber apart armor asset awake baker
acted again aisle alley amend apple aroma atlas award balls
actor agent alarm allow among apply arose atoms aware bands
If I look at words column-wise, I see a long run of a
, then a long run
of b
, etc. Even the second column has long runs. I should really exploit
this somehow. The first scheme would have worked equally as well on a
shuffled list as a sorted list, which is an indication that it’s storing
unnecessary information, namely the word list order. (Rule of thumb:
Compression should work better on sorted inputs.)
For this second scheme, I’ll pivot the whole list so that I can encode it in column-order. (This is roughly how one part of bzip2 works, by the way.) I’ll use run-length encoding (RLE) to communicate “91 ‘a’, 135 ‘b’, etc.”, then I’ll encode these RLE tokens using Huffman coding, per the first scheme, since there will be lots of repeated tokens.
First, pivot the word list:
pivot = "".join("".join(w[i] for w in words) for i in range(5))
Next compute the RLE token stream. The stream works in pairs, first indicating a letter (1–26), then the run length.
tokens = []
offset = 0
while offset < len(pivot):
c = pivot[offset]
start = offset
while offset < len(pivot) and pivot[offset] == c:
offset += 1
tokens.append(ord(c) - ord('a') + 1)
tokens.append(offset - start)
I’ve biased the letter representation by 1 — i.e. 1–26 instead of 0–25 — since I’m going to encode all the tokens using the same Huffman tree. (Exercise for the reader: Does compression improve with two distinct Huffman trees, one for letters and the other for runs?) There are no zero-length runs, and I want there to be as few unique tokens as possible.
tokens
looks like so (e.g. 737 ‘a’, 909 ‘b’, …):
[1, 737, 2, 909, 3, 922, 4, 685, 5, 303, 6, 598, ...]
The original Wordle list results in 139 unique tokens. A few tokens appear many times, but most of appear only once. Reusing my Huffman coding tree builder from before:
tree = huffman(collections.Counter(tokens))
This makes for a more complex and interesting tree:
(1,
((((18, 20), (25, (((10, 24), (26, 22)), 8))),
(5,
((11,
((23,
((17,
(((35, (46, 76)), ((82, 93), (104, 111))),
(((165, 168), 27), (28, (((30, 39), 31), 38))))),
((((((40, 41), ((44, 48), 45)),
((53, (54, 56)), 55)),
((((57, 59), 58), ((60, 61), (62, 63))),
((64, (65, 66)), ((67, 70), 68)))),
(((((71, 75), 74), (77, (78, 79))),
(((80, 85), 87), 81)),
((((90, 91), (92, 97)), (96, (99, 100))),
(((101, 103), 102),
((105, 106), (109, 110)))))),
((((((113, 114), 117), ((120, 121), (125, 129))),
(((130, 133), (137, 139)), (138, (140, 142)))),
((((144, 145), (147, 153)), (148, (166, 175))),
(((181, 183), (187, 189)),
((193, 202), (220, 242))))),
(((((262, 303), (325, 376)),
((413, 489), (577, 598))),
(((628, 638), (685, 693)),
((737, 815), (859, 909)))),
((((922, 1565), 29), 32), (34, (33, 43)))))))),
6)),
3))),
((19, 2),
((4, (15, (21, 16))), ((14, 9), (12, (13, 7)))))))
Peeking at the first 21 elements of sorted(flatten(tree))
, which chops
off the long tail of large-valued, single-occurrence tokens:
[(1, '0'), (8, '100111'), (15, '111010'),
(2, '1101'), (9, '111101'), (16, '1110111'),
(3, '10111'), (10, '10011000'), (17, '1011010100'),
(4, '11100'), (11, '101100'), (18, '10000'),
(5, '1010'), (12, '111110'), (19, '1100'),
(6, '1011011'), (13, '1111110'), (20, '10001'),
(7, '1111111'), (14, '111100'), (21, '1110110')]
Huffman-encoding the RLE stream is more straightforward:
codes = dict(flatten(tree))
bits = "".join(codes[token] for token in tokens)
This time len(bits)
is 164,958, or 20,620 bytes! A huge difference,
around 40% additional savings!
Slicing and dicing 32-bit integers and printing the table works the same
as before. However, this time the state table has larger values (e.g. that
run of 909), and so the state table will be int16_t
. I copy-pasted the
original meta-programming code and make the appropriate adjustments:
static const int16_t states[277] = {
-1, 1, -3, -5,-257, -7, -21, -9, -11, 18, 20, 25,
-13, -15, 8, -17, -19, 10, 24, 26, 22, 5, -23, -25,
3, 11, -27, -29, 6, 23, -31, -33, -63, 17, -35, -37,
-49, -39, -43, 35, -41, 46, 76, -45, -47, 82, 93, 104,
111, -51, -55, -53, 27, 165, 168, 28, -57, -59, 38, -61,
31, 30, 39, -65,-155, -67,-109, -69, -85, -71, -79, -73,
-75, 40, 41, -77, 45, 44, 48, -81, 55, 53, -83, 54,
56, -87, -99, -89, -93, -91, 58, 57, 59, -95, -97, 60,
61, 62, 63,-101,-105, 64,-103, 65, 66,-107, 68, 67,
70,-111,-129,-113,-123,-115,-119,-117, 74, 71, 75, 77,
-121, 78, 79,-125, 81,-127, 87, 80, 85,-131,-143,-133,
-139,-135,-137, 90, 91, 92, 97, 96,-141, 99, 100,-145,
-149,-147, 102, 101, 103,-151,-153, 105, 106, 109, 110,-157,
-213,-159,-185,-161,-173,-163,-167,-165, 117, 113, 114,-169,
-171, 120, 121, 125, 129,-175,-181,-177,-179, 130, 133, 137,
139, 138,-183, 140, 142,-187,-199,-189,-195,-191,-193, 144,
145, 147, 153, 148,-197, 166, 175,-201,-207,-203,-205, 181,
183, 187, 189,-209,-211, 193, 202, 220, 242,-215,-245,-217,
-231,-219,-225,-221,-223, 262, 303, 325, 376,-227,-229, 413,
489, 577, 598,-233,-239,-235,-237, 628, 638, 685, 693,-241,
-243, 737, 815, 859, 909,-247,-253,-249, 32,-251, 29, 922,
1565, 34,-255, 33, 43,-259,-261, 19, 2,-263,-269, 4,
-265, 15,-267, 21, 16,-271,-273, 14, 9, 12,-275, 13,
7,
};
(Since 277 is prime it will never wrap to a nice rectangle no matter what width I plug in. Ugh.)
With column-wise compression it’s not possible to iterate a word at a
time. The entire list must be decompressed at once. The interface now
looks like so, where the caller supplies a 12972*5
-byte buffer to be
filled:
void decompress(char *);
Exercise for the reader: Modify this to decompress into the 24-bit compact
form, so the caller only needs a 12972*3
-byte buffer.
Here’s my decoder, much like before:
void decompress(char *buf)
{
for (int32_t x = 0, y = 0, i = 0; i < 164958;) {
// Decode letter
int state = 0;
for (; states[state] < 0; i++) {
int b = words[i>>5]>>(i&31) & 1;
state = b - states[state];
}
int c = states[state] + 96;
// Decode run-length
state = 0;
for (; states[state] < 0; i++) {
int b = words[i>>5]>>(i&31) & 1;
state = b - states[state];
}
int len = states[state];
// Fill columns
for (int n = 0; n < len; n++, y++) {
buf[y*5+x] = c;
}
if (y == 12972) {
y = 0;
x++;
}
}
}
And my new test exactly reproduces the original list:
int main(void)
{
char buf[12972*5L];
decompress(buf);
char word[] = ".....\n";
for (int i = 0; i < 12972; i++) {
memcpy(word, buf+i*5, 5);
fwrite(word, 6, 1, stdout);
}
}
Totalling it up:
That’s a total of 21,374 bytes. Surprisingly this beats general purpose compressors!
PROGRAM VERSION SIZE
bzip2 -9 1.0.8 33,752
gzip -9 1.10 30,338
zstd -19 1.4.8 27,098
brotli -9 1.0.9 26,031
xz -9e 5.2.5 16,656
lzip -9 1.22 16,608
Only xz
and lzip
come out ahead on the raw compressed data, but lose
if accounting for an embedded decompressor (on the order of 10kB). Clearly
there’s an advantage to customizing compression to a particular dataset.
Update: Johannes Rudolph has pointed out a compression scheme for a Game Boy Wordle clone last month that gets it down to 17,871 bytes, and supports iteration. I improved on this scheme to further reduce it to 16,659 bytes.
]]>Unix-like systems pass the argv
array directly from parent to child. On
Linux it’s literally copied onto the child’s stack just above the stack
pointer on entry. The runtime just bumps the stack pointer address a few
bytes and calls it argv
. Here’s a minimalist x86-64 Linux runtime in
just 6 instructions (22 bytes):
_start: mov edi, [rsp] ; argc
lea rsi, [rsp+8] ; argv
call main
mov edi, eax
mov eax, 60 ; SYS_exit
syscall
It’s 5 instructions (20 bytes) on ARM64:
_start: ldr w0, [sp] ; argc
add x1, sp, 8 ; argv
bl main
mov w8, 93 ; SYS_exit
svc 0
On Windows, argv
is passed in serialized form as a string. That’s how
MS-DOS did it (via the Program Segment Prefix), because that’s how
CP/M did it. It made more sense when processes were mostly launched
directly by humans: The string was literally typed by a human operator,
and somebody has to parse it after all. Today, processes are nearly
always launched by other programs, but despite this, must still serialize
the argument array into a string as though a human had typed it out.
Windows itself provides an operating system routine for parsing command
line strings: CommandLineToArgvW. Fetch the command line string
with GetCommandLineW, pass it to this function, and you have your
argc
and argv
. Plus maybe LocalFree to clean up. It’s only available
in “wide” form, so if you want to work in UTF-8 you’ll also need
WideCharToMultiByte
. It’s around 20 lines of C rather than 6 lines of
assembly, but it’s not too bad.
GetCommandLineW returns a pointer into static storage, which is why it
doesn’t need to be freed. More specifically, it comes from the Process
Environment Block. This got me thinking: Could I locate this address
myself without the API call? First I needed to find the PEB. After some
research I found a PEB pointer in the Thread Information Block,
itself found via the gs
register (x64, fs
on x86), an old 386 segment
register. Buried in the PEB is a UNICODE_STRING
, with the
command line string address. I worked out all the offsets for both x86 and
x64, and the whole thing is just three instructions:
wchar_t *cmdline_fetch(void)
{
void *cmd = 0;
#if __amd64
__asm ("mov %%gs:(0x60), %0\n"
"mov 0x20(%0), %0\n"
"mov 0x78(%0), %0\n"
: "=r"(cmd));
#elif __i386
__asm ("mov %%fs:(0x30), %0\n"
"mov 0x10(%0), %0\n"
"mov 0x44(%0), %0\n"
: "=r"(cmd));
#endif
return cmd;
}
From Windows XP through Windows 11, this returns exactly the same address
as GetCommandLineW. There’s little reason to do it this way other than to
annoy Raymond Chen, but it’s still neat and maybe has some super niche
use. Technically some of these offsets are undocumented and/or subject to
change, except Microsoft’s own static link CRT also hardcodes all these
offsets. It’s easy to find: disassemble any statically linked program,
look for the gs
register, and you’ll find it using these offsets, too.
If you look carefully at the UNICODE_STRING
you’ll see the length is
given by a USHORT
in units of bytes, despite being a 16-bit wchar_t
string. This is the source of Windows’ maximum command line length
of 32,767 characters (including terminator).
GetCommandLineW is from kernel32.dll
, but CommandLineToArgvW is a bit
more off the beaten path in shell32.dll
. If you wanted to avoid linking
to shell32.dll
for important reasons, you’d need to do the
command line parsing yourself. Many runtimes, including Microsoft’s own
CRTs, don’t call CommandLineToArgvW and instead do their own parsing. It’s
messier than I expected, and when I started digging into it I wasn’t
expecting it to involve a few days of research.
The GetCommandLineW has a rough explanation: split arguments on whitespace (not defined), quoting is involved, and there’s something about counting backslashes, but only if they stop on a quote. It’s not quite enough to implement your own, and if you test against it, it’s quickly apparent that this documentation is at best incomplete. It links to a deprecated page about parsing C++ command line arguments with a few more details. Unfortunately the algorithm described on this page is not the algorithm used by GetCommandLineW, nor is it used by any runtime I could find. It even varies between Microsoft’s own CRTs. There is no canonical command line parsing result, not even a de facto standard.
I eventually came across David Deley’s How Command Line Parameters Are
Parsed, which is the closest there is to an authoritative document on
the matter (also). Unfortunately it focuses on runtimes rather
than CommandLineToArgvW, and so some of those details aren’t captured. In
particular, the first argument (i.e. argv[0]
) follows entirely different
rules, which really confused me for while. The Wine documentation
was helpful particularly for CommandLineToArgvW. As far as I can tell,
they’ve re-implemented it perfectly, matching it bug-for-bug as they do.
Before finding any of this, I started building my own implementation,
which I now believe matches CommandLineToArgvW. These other documents
helped me figure out what I was missing. In my usual fashion, it’s a
little state machine: cmdline.c
. The interface:
int cmdline_to_argv8(const wchar_t *cmdline, char **argv);
Unlike the others, mine encodes straight into WTF-8, a superset of UTF-8 that can round-trip ill-formed UTF-16. The WTF-8 part is negative lines of code: invisible since it involves not reacting to ill-formed input. If you use the new-ish UTF-8 manifest Win32 feature then your program cannot handle command line strings with ill-formed UTF-16, a problem solved by WTF-8.
As documented, that argv
must be a particular size — a pointer-aligned,
224kB (x64) or 160kB (x86) buffer — which covers the absolute worst case.
That’s not too bad when the command line is limited to 32,766 UTF-16
characters. The worst case argument is a single long sequence of 3-byte
UTF-8. 4-byte UTF-8 requires 2 UTF-16 code points, so there would only be
half as many. The worst case argc
is 16,383 (plus one more argv
slot
for the null pointer terminator), which is one argument for each pair of
command line characters. The second half (roughly) of the argv
is
actually used as a char
buffer for the arguments, so it’s all a single,
fixed allocation. There is no error case since it cannot fail.
int mainCRTStartup(void)
{
static char *argv[CMDLINE_ARGV_MAX];
int argc = cmdline_to_argv8(cmdline_fetch(), argv);
return main(argc, argv);
}
Also: Note the FUZZ
option in my source. It has been pretty thoroughly
fuzz tested. It didn’t find anything, but it does make me more
confident in the result.
I also peeked at some language runtimes to see how others handle it. Just as expected, Mingw-w64 has the behavior of an old (pre-2008) Microsoft CRT. Also expected, CPython implicitly does whatever the underlying C runtime does, so its exact command line behavior depends on which version of Visual Studio was used to build the Python binary. OpenJDK pragmatically calls CommandLineToArgvW. Go (gc) does its own parsing, with behavior mixed between CommandLineToArgvW and some of Microsoft’s CRTs, but not quite matching either.
I’ve always been boggled as to why there’s no complementary inverse to
CommandLineToArgvW. When spawning processes with arbitrary arguments,
everyone is left to implement the inverse of this under-specified and
non-trivial command line format to serialize an argv
. Hopefully the
receiver parses it compatibly! There’s no falling back on a system routine
to help out. This has lead to a lot of repeated effort: it’s not limited
to high level runtimes, but almost any extensible application (itself a
kind of runtime). Fortunately serializing is not quite as complex as
parsing since many of the edge cases simply don’t come up if done in a
straightforward way.
Naturally, I also wrote my own implementation (same source):
int cmdline_from_argv8(wchar_t *cmdline, int len, char **argv);
Like before, it accepts a WTF-8 argv
, meaning it can correctly pass
through ill-formed UTF-16 arguments. It returns the actual command line
length. Since this one can fail when argv
is too large, it returns
zero for an error.
char *argv[] = {"python.exe", "-c", code, 0};
wchar_t cmd[CMDLINE_CMD_MAX];
if (!cmdline_from_argv8(cmd, CMDLINE_CMD_MAX, argv)) {
return "argv too large";
}
if (!CreateProcessW(0, cmd, /*...*/)) {
return "CreateProcessW failed";
}
How do others handle this?
The aged Emacs implementation is written in C rather than Lisp, steeped in history with vestigial wrong turns. Emacs still only calls the “narrow” CreateProcessA despite having every affordance to do otherwise, and uses the wrong encoding at that. A personal source of headaches.
CPython uses Python rather than C via subprocess.list2cmdline
.
While undocumented, it’s accessible on any platform and easy to
test against various inputs. Try it out!
Go (gc) is just as delightfully boring I’d expect.
OpenJDK optimistically optimizes for command line strings under 80 bytes, and like Emacs, displays the weathering of long use.
I don’t plan to write a language implementation anytime soon, where this might be needed, but it’s nice to know I’ve already solved this problem for myself!
]]>My approach introduces a new private chunk type atCh
(“attachment”)
which contains a file name, a flag to indicate if the attachment is
compressed, and an optionally DEFLATE-compressed blob of file contents. I
tried to follow the spirit of PNG chunk formatting, but without the
constraints I hoped to avoid. A single PNG can contain multiple
attachments, e.g. source file, Makefile, README, license file, etc. The
protocol places constraints on the file names to keep it simple and to
avoid shenanigans: no control bytes (anything below ASCII space), no
directories, and cannot start with a period (no special hidden files). If
that’s too constraining, you could attach a ZIP or TAR.
PNG files begin with a fixed 8-byte header followed by of a series of chunks. Each chunk has an 8-byte header and 4-byte footer. The chunk header is a 32-bit big endian chunk length (not counting header or footer) and a 4-byte tag identifying its type. The length allows implementations to skip chunks it doesn’t recognize.
LLLL TTTT ...chunk... CCCC
The footer is a big endian CRC-32 checksum of the 4-byte type tag and the chunk body itself.
Chunk tags are interpreted as 4 ASCII characters, where the capitalization
of each letter encodes 4 additional boolean flags. The flags in my tag,
atCh
, indicate it’s a non-critical private chunk which doesn’t depend on
the image data.
PNG always ends with a zero-length IEND
chunk, which works out to a kind
of 12-byte constant footer.
The PNG standard currently defines three kinds of chunks for storing text
metadata: tEXt
, iTXt
, zTXt
. The first is limited to Latin-1 with LF
newlines, and so cannot store UTF-8 source text. The latter two were
introduced in the PNG 1.2 specification (November 1999), and allow (only)
UTF-8 content with LF newlines. All three have a 1 to 79-byte Latin-1
“key” field, and the latter two some additional fields describing the
language of the text.
The key field is null-terminated, making it 80 bytes maximum when treated as a null-terminated string. I believe this constraint exists to aid implementations, which can rely on this hard upper limit for the key lengths they’re expected to handle. Otherwise a key could have been up to 4GiB in length.
I had considered using part of the key as a file name, prefixed with a
custom namespace (ex. attachment:FILENAME
) to distinguish it from other
text chunks. However, I didn’t like the constraints this placed on the
file name, plus I wanted to support arbitrary file content, not limited to
a particular subformat.
As prior art, there’s a draw.io/diagrams.net format which embeds a source
string without file name. The source string is encoded in base64 (i.e.
unconstrained by PNG), wrapped in XML, then incorrectly encoded as an
iTXt
chunk. The XML alone was enough to keep me away from using this
format.
In my attachment protocol, the file name is an arbitrary length, null-terminated byte string (preferably UTF-8), much like a key field, with the previously-mentioned anti-shenanigans restrictions. The file name is followed by a byte, 0 or 1, indicating if the content is compressed using PNG’s officially-supported compression format. The rest is the arbitrary content bytes, which presumably the recipient will know how to use.
LLLL atCh example.txt 0 F ...contents... CCCC
I expect any experienced programmer could write a basic attachment extractor in their language of choice inside of 30 or so minutes. Hooking up a DEFLATE library for decompression would be the most difficult part.
Since it supports multiple attachments and behaves like an archive format,
my tool supports flags much like tar
: -c
to create attachments
(default and implicit), -t
to list attachments, and -x
to extract
attachments. PNG data is always passed on standard input and standard
output.
For example, to render a Graphviz diagram and attach the source all at once:
$ dot -Tpng graph.dot | pngattach graph.dot >graph.png
Later on someone might extract it and tweak it, like so (-v
verbose,
lists files as they’re extracted, like tar
):
$ pngattach -xv <graph.png
graph.dot
$ vi graph.dot
$ dot -Tpng graph.dot >graph.png
Like tar
, it can also write attachments to standard output with -O
.
For example, to re-render the image as an SVG:
$ pngattach -xO <graph.png | dot -Tsvg >graph.svg
Strictly processing standard input to output, rather than taking the input
as an argument, is something I’ve been trying lately. I’m pretty happy
with my command line design for pngattach
. The real test will
happen in the future, when I’ve forgotten the details and have to figure
it out again from my own documentation.
Curiously, lots of common software refuses to handle PNGs containing large chunks, and so your PNG may not display if you attach a file even as small as a few MiB. A defense against denial of service?
I haven’t gone back and embedded attachments in any older articles, but I may do so in future articles. If you wanted to try it out for yourself, either with my tool or writing your own for fun, this PNG contains a compressed attachment:
I produced it like so (with the help of ImageMagick):
$ echo P3 1 1 1 0 1 0 |
convert ppm:- resize 200 png:- |
pngattach message.txt >atch-test.png
Another technique I’ve been trying is Go-style error value returns in C
programs, where the errors-as-values are const char *
pointers to static
string buffers. The contents contain an error message to be displayed to
the user, and errors may be wrapped in more context (what file, what
operation, etc.) as the stack unwinds. A null pointer means no error, i.e.
nil. I’ve used this extensively in pngattach
. Examples of the style:
int *p;
if (nelem > (size_t)-1/sizeof(*p)) {
return "out of memory"; // overflow
}
p = malloc(nelem*sizeof(*p));
if (!p) {
return "out of memory";
}
// ...
if (!fwrite(buf, len, 1, stdout)) {
free(p);
return "write error";
}
An errwrap()
function builds a new error string in a static buffer. This
simple solution wouldn’t work in a multi-threaded program, but that’s not
the case here. Mine toggles between two static buffers so that it can wrap
recursively.
const char *
errwrap(const char *pre, const char *post)
{
static char errtmp[2][256], i;
int n = i = !i; // toggle between two static buffers
snprintf(errtmp[n], sizeof(errtmp[n]), "%s: %s", pre, post);
return errtmp[n];
}
Then I can do stuff like:
FILE *f = fopen(path, "wb");
if (!f) {
return errwrap("failed to open file", path);
}
And that can keep being wrapped on the way up:
err = png_write(path);
if (err) {
return errwrap("writing PNG", err);
}
So that ultimately the user sees something like:
pngattach: writing PNG: failed to open file: example.png
That’s always printed by a single error printout block at the top level, where all errors are ultimately routed.
int main(int argc, char **argv)
{
// ...
err = run(options);
if (err) {
fprintf(stderr, "pngattach: %s\n", err);
return 1;
}
return 0;
}
There are multiple C implementations, so how could they all be bad, even the early ones? Microsoft’s C runtime has defined how the standard library should work on the platform, and everyone else followed along for the sake of compatibility. I’m excluding Cygwin and its major fork, MSYS2, despite not inheriting any of these flaws. They change so much that they’re effectively whole new platforms, not truly “native” to Windows.
In practice, C++ standard libraries are implemented on top of a C standard library, which is why C++ shares the same problems. CPython dodges these issues: Though written in C, on Windows it bypasses the broken C standard library and directly calls the proprietary interfaces. Other language implementations, such “gc” Go, simply aren’t built on C at all, and instead do things correctly in the first place — the behaviors the C runtimes should have had all along.
If you’re just working on one large project, bypassing the C runtime isn’t such a big deal, and you’re likely already doing so to access important platform functionality. You don’t really even need a C runtime. However, if you write many small programs, as I do, writing the same special Windows support for each one ends up being most of the work, and honestly makes properly supporting Windows not worth the trouble. I end up just accepting the broken defaults most of the time.
Before diving into the details, if you’re looking for a quick-and-easy solution for the Mingw-w64 toolchain, including w64devkit, which magically makes your C and C++ console programs behave well on Windows, I’ve put together a “library” named libwinsane. It solves all problems discussed in this article, except for one. No source changes required, simply link it into your program.
The Windows API comes in two flavors: narrow with an “A” (“ANSI”) suffix, and wide (Unicode, UTF-16) with a “W” suffix. The former is the legacy API, where an active code page maps 256 bytes onto (up to) 256 specific characters. On typical machines configured for European languages, this means code page 1252. Roughly speaking, Windows internally uses UTF-16, and calls through the narrow interface use the active code page to translate the narrow strings to wide strings. The result is that calls through the narrow API have limited access to the system.
The UTF-8 encoding was invented in 1992 and standardized by January 1993. UTF-8 was adopted by the unix world over the following years due to its backwards-compatibility with its existing interfaces. Programs could read and write Unicode data, access Unicode paths, pass Unicode arguments, and get and set Unicode environment variables without needing to change anything. Today UTF-8 has become the dominant text encoding format in the world, in large part due to the world wide web.
In July 1993, Microsoft introduced the wide Windows API with the release of Windows NT 3.1, placing all their bets on UCS-2 (later UTF-16) rather than UTF-8. This turned out to be a mistake, since UTF-16 is inferior to UTF-8 in practically every way, though admittedly some problems weren’t so obvious at the time.
The major problem: The C and C++ standard libraries only hook up to the narrow Windows interfaces. The standard library, and therefore typical portable software on Windows, cannot handle anything but ASCII. The effective result is that these programs:
Doing any of these requires calling proprietary functions, treating Windows as a special target. It’s part of what makes correctly porting software to Windows so painful.
The sensible solution would have been for the C runtime to speak UTF-8 and connect to the wide API. Alternatively, the narrow API could have been changed over to UTF-8, phasing out the old code page concept. In theory this is what the UTF-8 “code page” is about, though it doesn’t always work. There would have been compatibility problems with abruptly making such a change, but until very recently, this wasn’t even an option. Why couldn’t there be a switch I could flip to get sane behavior that works like every other platform?
In 2019, Microsoft introduced a feature to allow programs to request UTF-8 as their active code page on start, along with supporting UTF-8 on more narrow API functions. This is like the magic switch I wanted, except that it involves embedding some ugly XML into your binary in a particular way. At least it’s now an option.
For Mingw-w64, that means writing a resource file like so:
#include <winuser.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "utf8.xml"
Compiling it with windres
:
$ windres -o manifest.o manifest.rc
Then linking that into your program. Amazingly it mostly works! Programs
can access Unicode arguments, Unicode environment variables, and Unicode
paths, including with fopen
, just as it’s worked on other platforms for
decades. Since the active code page is set at load time, it happens before
argv
is constructed (from GetCommandLineA
), which is why that works
out.
Alternatively you could create a “side-by-side assembly” placing that XML
in a file with the same name as your EXE but with .manifest
suffix
(after the .exe
suffix), then placing that next to your EXE. Just be
mindful that there’s a “side-by-side” cache (WinSxS), and so it might not
immediately pick up your changes.
What doesn’t work is console input and output since the console is external to the process, and so isn’t covered by the process’s active code page. It must be configured separately using a proprietary call:
SetConsoleOutputCP(CP_UTF8);
Annoying, but at least it’s not that painful. This only covers output, though, meaning programs can only print UTF-8. Unfortunately UTF-8 input still doesn’t work, and setting the input code page doesn’t do anything despite reporting success:
SetConsoleCP(CP_UTF8); // doesn't work
If you care about reading interactive Unicode input, you’re stuck bypassing the C runtime since it’s still broken.
Another long-standing issue is that C and C++ on Windows has distinct “text” and “binary” streams, which it inherited from DOS. Mainly this means automatic newline conversion between CRLF and LF. The C standard explicitly allows for this, though unix-like platforms have never actually distinguished between text and binary streams.
The standard also specifies that standard input, output, and error are all open as text streams, and there’s no portable method to change the stream mode to binary — a serious deficiency with the standard. On unix-likes this doesn’t matter, but on Windows it means programs can’t read or write binary data on standard streams without calling a non-standard function. It also means reading and writing standard streams is slow, frequently a bottleneck unless I route around it.
Personally, I like writing binary data to standard output,
including video, and sometimes binary filters that also read
binary input. I do it so often that in probably half my C programs I have
this snippet in main
just so they work correctly on Windows:
#ifdef _WIN32
int _setmode(int, int);
_setmode(0, 0x8000);
_setmode(1, 0x8000);
#endif
That incantation sets standard input and output in the C runtime to binary mode without the need to include a header, making it compact, simple, and self-contained.
This built-in newline translation, along with the Windows standard text editor, Notepad, lagging decades behind, meant that many other programs, including Git, grew their own, annoying, newline conversion misfeatures that cause other problems.
I introduced libwinsane at the beginning of the article, which fixes all
this simply by being linked into a program. It includes the magic XML
manifest .rsrc
section, configures the console for UTF-8 output, and
sets standard streams to binary before main
(via a GCC constructor). I
called it a “library”, but it’s actually a single object file. It can’t be
a static library since it must be linked into the program despite not
actually being referenced by the program.
So normally this program:
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
char *arg = argv[argc-1];
size_t len = strlen(arg);
printf("%zu %s\n", len, arg);
}
Compiled and run:
C:\>cc -o example example.c
C:\>example π
1 p
As usual, the Unicode argument is silently mangled into one byte. Linked with libwinsane, it just works like everywhere else:
C:\>gcc -o example example.c libwinsane.o
C:\>example π
2 π
If you’re maintaining a substantial program, you probably want to copy and integrate the necessary parts of libwinsane into your project and build, rather than always link against this loose object file. This is more for convenience and for succinctly capturing the concept. You may even want to enable ANSI escape processing in your version.
]]>I recently learned of csvquote, a tool that encodes troublesome CSV characters such that unix tools can correctly process them. It reverses the encoding at the end of the pipeline, recovering the original input. The original implementation handles CSV quotes using the straightforward, naive method. However, there’s a better approach that is not only simpler, but around 3x faster on modern hardware. Even more, there’s yet another approach using SIMD intrinsics, plus some bit twiddling tricks, which increases the processing speed by an order of magnitude. My csvquote implementation includes both approaches.
Records in CSV data are separated by line feeds, and fields are separated by commas. Fields may be quoted.
aaa,bbb,ccc
xxx,"yyy",zzz
Fields containing a line feed (U+000A), quotation mark (U+0022), or comma (U+002C), must be quoted, otherwise they would be ambiguous with the CSV formatting itself. Quoted quotation marks are turned into a pair of quotes. For example, here are two records with two fields apiece:
"George Herman ""Babe"" Ruth","1919–1921, 1923, 1926"
"Frankenstein;
or, The Modern Prometheus",Mary Shelley
A CSV-unaware tool splitting on commas and line feeds (ex. awk
) would
process these records improperly. So csvquote translates quoted line feeds
into record separators (U+001E) and commas into unit separators (U+001F).
These control characters rarely appear in normal text data, and can be
trivially processed in UTF-8-encoded text without decoding or encoding.
The above records become:
"George Herman ""Babe"" Ruth","1919–1921\x1f 1923\x1f 1926"
"Frankenstein;\x1eor\x1f The Modern Prometheus",Mary Shelley
I’ve used \x1e
and \x1f
here to illustrate the control characters.
The data is exactly the same length since it’s a straight byte-for-byte replacement. Quotes are left entirely untouched. The challenge is parsing the quotes to track whether the two special characters fall inside or outside pairs of quotes.
The original csvquote walks the input a byte at a time and is in one of three states:
"
in a ""
)Since I love state machines so much, here it is translated into a switch-based state machine:
// Return the next state given an input character.
int next(int state, int c)
{
switch (state) {
case 1: return c == '"' ? 2 : 1;
case 2: return c == '"' ? 3 : 2;
case 3: return c == '"' ? 2 : 1;
}
}
The real program also has more conditions for potentially making a replacement. It’s an awful lot of performance-killing branching.
However, this context is about finding “in” and “out” — not validating the CSV — so the “escape” state is unnecessary. I need only match up pairs of quotes. An “escaped” quote can be considered terminating a quoted region and immediately starting a new quoted region. That’s means there’s just the first two states in a trivial arrangement:
int next(int state, int c)
{
switch (state) {
case 1: return c == '"' ? 2 : 1;
case 2: return c == '"' ? 1 : 2;
}
}
Since the text can be processed as bytes, there are only 256 possible inputs. With 2 states and 256 inputs, this state machine, with replacement machinery, can be implemented with a 512-byte table and no branches. Here’s the table initialization:
unsigned char table[2][256];
void init(void)
{
for (int i = 0; i < 256; i++) {
table[0][i] = i;
table[1][i] = i;
}
table[1]['\n'] = 0x1e;
table[1][','] = 0x1f;
}
In the first state, characters map onto themselves. In the second state, characters map onto their replacements. This is the entire encoder and decoder:
void encode(unsigned char *buf, size_t len)
{
int state = 0;
for (size_t i = 0; i < len; i++) {
state ^= (buf[i] == '"');
buf[i] = table[state][buf[i]];
}
}
Well, strictly speaking, the decoder need not process quotes. By my
benchmark (csvdump
in my implementation) this processes at ~1 GiB/s on
my laptop — 3x faster than the original. However, there’s still
low-hanging fruit to be picked!
Any decent SIMD implementation is going to make use of masking. Find the quotes, compute a mask over quoted regions, compute another mask for replacement matches, combine the masks, then use that mask to blend the input with the replacements. Roughly:
quotes = find_quoted_regions(input)
linefeeds = input == '\n'
commas = input == ','
output = blend(input, '\n', quotes & linefeeds)
output = blend(output, ',', quotes & commas)
The hard part is computing the quote mask, and also somehow handle quoted regions straddling SIMD chunks (not pictured), and do all that without resorting to slow byte-at-time operations. Fortunately there are some bitwise tricks that can resolve each issue.
Imagine I load 32 bytes into a SIMD register (e.g. AVX2), and I compute a 32-bit mask where each bit corresponds to one byte. If that byte contains a quote, the corresponding bit is set.
"George Herman ""Babe"" Ruth","1
10000000000000011000011000001010
That last/lowest 1 corresponds to the beginning of a quoted region. For my mask, I’d like to set all bits following that bit. I can do this by subtracting 1.
"George Herman ""Babe"" Ruth","1
10000000000000011000011000001001
Using the Kernighan technique I can also remove this bit from the original input by ANDing them together.
"George Herman ""Babe"" Ruth","1
10000000000000011000011000001000
Now I’m left with a new bottom bit. If I repeat this, I build up layers of masks, one for each input quote.
10000000000000011000011000001001
10000000000000011000011000000111
10000000000000011000010111111111
10000000000000011000001111111111
10000000000000010111111111111111
10000000000000001111111111111111
01111111111111111111111111111111
Remember how I use XOR in the state machine above to toggle between states? If I XOR all these together, I toggle the quotes on and off, building up quoted regions:
"George Herman ""Babe"" Ruth","1
01111111111111100111100111110001
However, for reasons I’ll explain shortly, it’s critical that the opening quote is included in this mask. If I XOR the pre-subtracted value with the mask when I compute the mask, I can toggle the remaining quotes on and off such that the opening quotes are included. Here’s my function:
uint32_t find_quoted_regions(uint32_t x)
{
uint32_t r = 0;
while (x) {
r ^= x;
r ^= x - 1;
x &= x - 1;
}
return r;
}
Which gives me exactly what I want:
"George Herman ""Babe"" Ruth","1
11111111111111101111101111110011
It’s important that the opening quote is included because it means a region that begins on the last byte will have that last bit set. I can use that last bit to determine if the next chunk begins in a quoted state. If a region begins in a quoted state, I need only NOT the whole result to reverse the quoted regions.
How can I “sign extend” a 1 into all bits set, or do nothing for zero? Negate it!
uint32_t carry = -(prev & 1);
uint32_t quotes = find_quoted_regions(input) ^ carry;
// ...
prev = quotes;
That takes care of computing quoted regions and chaining them between chunks. The loop will unfortunately cause branch prediction penalties if the input has lots of quotes, but I couldn’t find a way around this.
However, I’ve made a serious mistake. I’m using _mm256_movemask_epi8
and
it puts the first byte in the lowest bit. Doh! That means it looks like
this:
1","htuR ""ebaB"" namreH egroeG"
01010000011000011000000000000001
There’s no efficient way to flip the bits around, so I just need to find a way to work in the other direction. To flip the bits to the left of a set bit, negate it.
00000000000000000000000010000000 = +0x00000080
11111111111111111111111110000000 = -0x00000080
Unlike before, this keeps the original bit set, so I need to XOR the original value into the input to flip the quotes. This is as simple as initializing to the input rather than zero. The new loop:
uint32_t find_quoted_regions(uint32_t x)
{
uint32_t r = x;
while (x) {
r ^= -x ^ x;
x &= x - 1;
}
return r;
}
The result:
1","htuR ""ebaB"" namreH egroeG"
11001111110111110111111111111111
The carry now depends on the high bit rather than the low bit:
uint32_t carry = -(prev >> 31);
The next problem: for reasons I don’t understand, AVX2 does not include
the inverse of _mm256_movemask_epi8
. Converting the bit-mask back into a
byte-mask requires some clever shuffling. Fortunately I’m not the first
to have this problem, and so I didn’t have to figure it out from
scratch.
First fill the 32-byte register with repeated copies of the 32-bit mask.
abcdabcdabcdabcdabcdabcdabcdabcd
Shuffle the bytes so that the first 8 register bytes have the same copy of the first bit-mask byte, etc.
aaaaaaaabbbbbbbbccccccccdddddddd
In byte 0, I care only about bit 0, in byte 1 I care only about the bit 1,
… in byte N I care only about bit N%8
. I can pre-compute a mask to
isolate each of these bits and produce a proper byte-wise mask from the
bit-mask. Fortunately all this isn’t too bad: four instructions instead of
the one I had wanted. It looks like a lot of code, but it’s really only a
few instructions.
In my benchmark, which includes randomly occurring quoted fields, the SIMD version processes at ~4 GiB/s — 10x faster than the original. I haven’t profiled, but I expect mispredictions on the bit-mask loop are the main obstacle preventing the hypothetical 32x speedup.
My version also optionally rejects inputs containing the two special control characters since the encoding would be irreversible. This is implemented in SIMD when available, and it slows processing by around 10%.
Geoff Langdale and others have graciously pointed out PCLMULQDQ, which can compute the quote masks using carryless multiplication (also) entirely in SIMD and without a loop. I haven’t yet quite worked out exactly how to apply it, but it should be much faster.
]]>Years ago, OpenBSD gained two new security system calls, pledge(2)
(originally tame(2)
) and unveil
. In both, an application
surrenders capabilities at run-time. The idea is to perform initialization
like usual, then drop capabilities before handling untrusted input,
limiting unwanted side effects. This feature is applicable even where type
safety isn’t an issue, such as Python, where a program might still get
tricked into accessing sensitive files or making network connections when
it shouldn’t. So how can a Python program access these system calls?
As discussed previously, it’s quite easy to access C APIs from
Python through its ctypes
package, and this is no exception.
In this article I show how to do it. Here’s the full source if you want to
dive in: openbsd.py
.
I’ve chosen these extra constraints:
As extra safety features, unnecessary for correctness, attempts to call these functions on systems where they don’t exist will silently do nothing, as though they succeeded. They’re provided as a best effort.
Systems other than OpenBSD may support these functions, now or in the future, and it would be nice to automatically make use of them when available. This means no checking for OpenBSD specifically but instead feature sniffing for their presence.
The interfaces should be Pythonic as though they were implemented in Python itself. Raise exceptions for errors, and accept strings since they’re more convenient than bytes.
For reference, here are the function prototypes:
int pledge(const char *promises, const char *execpromises);
int unveil(const char *path, const char *permissions);
The string-oriented interface of pledge
will make this a whole
lot easier to implement.
The first step is to grab functions through ctypes
. Like a lot of Python
documentation, this area is frustratingly imprecise and under-documented.
I want to grab a handle to the already-linked libc and search for either
function. However, getting that handle is a little different on each
platform, and in the process I saw four different exceptions, only one of
which is documented.
I came up with passing None to ctypes.CDLL
, which ultimately just passes
NULL
to dlopen(3)
. That’s really all I wanted. Currently on
Windows this is a TypeError. Once the handle is in hand, try to access the
pledge
attribute, which will fail with AttributeError if it doesn’t
exist. In the event of any exception, just assume the behavior isn’t
available. If found, I also define the function prototype for ctypes
.
_pledge = None
try:
_pledge = ctypes.CDLL(None, use_errno=True).pledge
_pledge.restype = ctypes.c_int
_pledge.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
_pledge = None
Catching a broad Exception isn’t great, but it’s the best we can do since the documentation is incomplete. From this block I’ve seen TypeError, AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if there are more possibilities, and I don’t want to risk missing them.
Note that I’m catching Exception rather than using a bare except
. My
code will not catch KeyboardInterrupt nor SystemExit. This is deliberate,
and I never want to catch these.
The same story for unveil
:
_unveil = None
try:
_unveil = ctypes.CDLL(None, use_errno=True).unveil
_unveil.restype = ctypes.c_int
_unveil.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
_unveil = None
The next and final step is to wrap the low-level call in an interface that
hides their C and ctypes
nature.
Python strings must be encoded to bytes before they can be passed to C
functions. Rather than make the caller worry about this, we’ll let them
pass friendly strings and have the wrapper do the conversion. Either may
also be NULL
, so None is allowed.
def pledge(promises: Optional[str], execpromises: Optional[str]):
if not _pledge:
return # unimplemented
r = _pledge(None if promises is None else promises.encode(),
None if execpromises is None else execpromises.encode())
if r == -1:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
As usual, a return of -1 means there was an error, in which case we fetch
errno
and raise the appropriate OSError.
unveil
works a little differently since the first argument is a path.
Python functions that accept paths, such as open
, generally accept
either strings or bytes. On unix-like systems, paths are fundamentally
bytestrings and not necessarily Unicode, so it’s necessary to accept
bytes. Since strings are nearly always more convenient, they take both.
The unveil
wrapper here will do the same. If it’s a string, encode it,
otherwise pass it straight through.
def unveil(path: Union[str, bytes, None], permissions: Optional[str]):
if not _unveil:
return # unimplemented
r = _unveil(path.encode() if isinstance(path, str) else path,
None if permissions is None else permissions.encode())
if r == -1:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
That’s it!
Let’s start with unveil
. Initially a process has access to the whole
file system with the usual restrictions. On the first call to unveil
it’s immediately restricted to some subset of the tree. Each call reveals
a little more until a final NULL
which locks it in place for the rest of
the process’s existence.
Suppose a program has been tricked into accessing your shell history, perhaps by mishandling a path:
def hackme():
try:
with open(pathlib.Path.home() / ".bash_history"):
print("You've been hacked!")
except FileNotFoundError:
print("Blocked by unveil.")
hackme()
If you’re a Bash user, this prints:
You've been hacked!
Using our new feature to restrict the program’s access first:
# restrict access to static program data
unveil("/usr/share", "r")
unveil(None, None)
hackme()
On OpenBSD this now prints:
Blocked by unveil.
Working just as it should!
With pledge
we declare what abilities we’d like to keep by supplying a
list of promises, pledging to use only those abilities afterward. A
common case is the stdio
promise which allows reading and writing of
open files, but not opening files. A program might open its log file,
then drop the ability to open files while retaining the ability to write
to its log.
An invalid or unknown promise is an error. Does that work?
>>> pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument
So far so good. How about the functionality itself?
pledge("stdio", None)
hackme()
The program is instantly killed when making the disallowed system call:
Abort trap (core dumped)
If you want something a little softer, include the error
promise:
pledge("stdio error", None)
hackme()
Instead it’s an exception, which will be a lot easier to debug when it comes to Python, so you probably always want to use it.
OSError: [Errno 78] Function not implemented
The core dump isn’t going to be much help to a Python program, so you
probably always want to use this promise. In general you need to be extra
careful about pledge
in complex runtimes like Python’s which may
reasonably need to do many arbitrary, undocumented things at any time.
Mission code names are built using “adjective noun”. Some examples from the game’s word list:
To generate a code name, we could select a random adjective and a random noun, but as discussed it wouldn’t take long for a collision. The naive approach is to keep a database of previously-generated names, and to consult this database when generating new names. That works, but there’s an even better solution: use a random permutation. Done well, we don’t need to keep track of previous names, and the generator won’t repeat until it’s exhausted all possibilities.
Further, the total number of possible code names, 4028, is suspiciously
shy of 4,096, a power of two (2**12
). That makes designing and
implementing an efficient permutation that much easier.
A classic, obvious solution is a linear congruential generator (LCG). A full-period, 12-bit LCG is nothing more than a permutation of the numbers 0 to 4,095. When generating names, we can skip over the extra 68 values and pretend it’s a permutation of 4,028 elements. An LCG is constructed like so:
f(n) = (f(n-1)*A + C) % M
Typically the seed is used for f(0)
. M is selected based on the problem
space or implementation efficiency, and usually a power of two. In this
case it will be 4,096. Then there are some rules for choosing A and C.
Simply choosing a random f(0)
per game isn’t great. The code name order
will always be the same, and we’re only choosing where in the cycle to
start. It would be better to vary the permutation itself, which we can do
by also choosing unique A and C constants per game.
Choosing C is easy: It must be relatively prime with M, i.e. it must be
odd. Since it’s addition modulo M, there’s no reason to choose C >= M
since the results are identical to a smaller C. If we think of C as a
12-bit integer, 1 bit is locked in, and the other 11 bits are free to
vary:
xxxxxxxxxxx1
Choosing A is more complicated: must be odd, A-1
must be divisible by 4,
and A-1
should be divisible by 8 (better results). Again, thinking of
this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits
free:
xxxxxxxxx101
This ensures all the must and should properties of A.
Finally 0 <= f(0) < M
. Because of modular arithmetic larger, values are
redundant, and all possible values are valid since the LCG, being
full-period, will cycle through all of them. This is just choosing the
starting point in a particular permutation cycle. As a 12-bit number, all
12 bits are free:
xxxxxxxxxxxx
That’s 9 + 11 + 12 = 32
free bits to fill randomly: again, how
incredibly convenient! Every 32-bit integer defines some unique code name
permutation… almost. Any 32-bit descriptor where f(0) >= 4028
will
collide with at least one other due to skipping, and so around 1.7% of the
state space is redundant. A small loss that should shrink with slightly
better word list planning. I don’t think anyone will notice.
I love compact state machines, and this is an opportunity to put one to good use. My code name generator will be just one function:
uint32_t codename(uint32_t state, char *buf);
This takes one of those 32-bit permutation descriptors, writes the first
code name to buf
, and returns a descriptor for another permutation that
starts with the next name. All we have to do is keep track of that 32-bit
number and we’ll never need to worry about repeating code names until all
have been exhausted.
First, lets extract A, C, and f(0)
, which I’m calling S. The low bits
are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to
lock in the hard-set bits.
long a = (state << 3 | 5) & 0xfff; // 9 bits
long c = (state >> 8 | 1) & 0xfff; // 11 bits
long s = state >> 20; // 12 bits
Next iterate the LCG until we have a number in range:
do {
s = (s*a + c) & 0xfff;
} while (s >= 4028);
Once we have an appropriate LCG state, compute the adjective/noun indexes and build a code name:
int i = s % 53;
int j = s / 53;
sprintf(buf, "%s %s", adjvs[i], nouns[j]);
Finally assemble the next 32-bit state. Since A and C don’t change, these are passed through while the old S is masked out and replaced with the new S.
return (state & 0xfffff) | (uint32_t)s<<20;
Putting it all together:
static const char *adjvs[] = { /* ... */ };
static const char *nouns[] = { /* ... */ };
uint32_t codename(uint32_t state, char *buf)
{
long a = (state << 3 | 5) & 0xfff; // 9 bits
long c = (state >> 8 | 1) & 0xfff; // 11 bits
long s = state >> 20; // 12 bits
do {
s = (s*a + c) & 0xfff;
} while (s >= COUNTOF(adjvs)*COUNTOF(nouns));
int i = s % COUNTOF(adjvs);
int j = s / COUNTOF(adjvs);
sprintf(buf, "%s %s", adjvs[i], nouns[j]);
return (state & 0xfffff) | (uint32_t)s<<20;
}
The caller just needs to generate an initial 32-bit integer. Any 32-bit
integer is valid — even zero — so this could just be, say, the unix epoch
(time(2)
), but adjacent values will have similar-ish permutations. I
intentionally placed S in the high bits, which are least likely to vary,
since it only affects where the cycle begins, while A and C have a much
more dramatic impact and so are placed at more variable locations.
Regardless, it would be better to hash such an input so that adjacent time values map to distant states. It also helps hide poorer (less random) choices for A multipliers. I happen to have designed some great functions for exactly this purpose. Here’s one of my best:
static uint32_t
hash32(uint32_t x)
{
x += 0x3243f6a8U; x ^= x >> 15;
x *= 0xd168aaadU; x ^= x >> 15;
x *= 0xaf723597U; x ^= x >> 15;
return x;
}
This would be perfectly reasonable for generating all possible names in a random order:
uint32_t state = hash32(time(0));
for (int i = 0; i < 4028; i++) {
char buf[32];
state = codename(state, buf);
puts(buf);
}
To further help cover up poorer A multipliers, it’s better for the word list to be pre-shuffled in its static storage. If that underlying order happens to show through, at least it will be less obvious (i.e. not in alphabetical order). Shuffling the string list in my source is just a few keystrokes in Vim, so this is easy enough.
If you’re set on making the codename
function easier to use such that
consumers don’t need to think about hashes, you could “encode” and
“decode” the descriptor going in an out of the function:
uint32_t codename(uint32_t state, char *buf)
{
state += 0x3243f6a8U; state ^= state >> 17;
state *= 0x9e485565U; state ^= state >> 16;
state *= 0xef1d6b47U; state ^= state >> 16;
// ...
state = (state & 0xfffff) | (uint32_t)s<<20;
state ^= state >> 16; state *= 0xeb00ce77U;
state ^= state >> 16; state *= 0x88ccd46dU;
state ^= state >> 17; state -= 0x3243f6a8U;
return state;
}
This permutes the state coming in, and reverses that permutation on the way out (read: inverse hash). This breaks up similar starting points.
Of course this isn’t the only way to build a permutation. I recently picked up another trick: Kensler permutation. The key insight is cycle-walking, allowing for random-access to a permutation of a smaller domain (e.g. 4,028 elements) through permutation of a larger domain (e.g. 4096 elements).
Here’s such a code name generator built around a bespoke 12-bit xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply is less effective the smaller the permutation.
// Generate the nth code name for this seed.
void codename_n(char *buf, uint32_t seed, int n)
{
uint32_t i = n;
do {
i ^= i >> 6; i ^= seed >> 0; i *= 0x325; i &= 0xfff;
i ^= i >> 6; i ^= seed >> 8; i *= 0x3f5; i &= 0xfff;
i ^= i >> 6; i ^= seed >> 16; i *= 0xa89; i &= 0xfff;
i ^= i >> 6; i ^= seed >> 24; i *= 0x85b; i &= 0xfff;
i ^= i >> 6;
} while (i >= COUNTOF(adjvs)*COUNTOF(nouns));
int a = i % COUNTOF(adjvs);
int b = i / COUNTOF(adjvs);
snprintf(buf, 22, "%s %s", adjvs[a], nouns[b]);
}
While this is more flexible, avoids poorer permutations, and doesn’t have state space collisions, I still have a soft spot for my LCG-based state machine generator.
You can find the complete, working source code with both generators here:
codename.c
. I used real US Secret Service code names for
my word list. Some sample outputs:
However, I’ve long struggled with architecture diversity. My work and testing has been almost entirely on x86, with ARM as a distant second (Raspberry Pi and friends). Big endian hosts are particularly rare. However, I recently learned a trick for quickly and conveniently accessing many different architectures without even leaving my laptop: QEMU User Emulation. Debian and its derivatives support this very well and require almost no setup or configuration.
While there are many options, my main cross-testing architecture has been PowerPC. It’s 32-bit big endian, while I’m generally working on 64-bit little endian, which is exactly the sort of mismatch I’m going for. I use a Debian-supplied cross-compiler and qemu-user tools. The binfmt support is especially slick, so that’s how I usually use it.
# apt install gcc-powerpc-linux-gnu qemu-user-binfmt
binfmt_misc
is a kernel module that teaches Linux how to recognize
arbitrary binary formats. For instance, there’s a Wine binfmt so that
Linux programs can transparently exec(3)
Windows .exe
binaries. In the
case of QEMU User Mode, binaries for foreign architectures are loaded into
a QEMU virtual machine configured in user mode. In user mode there’s no
guest operating system, and instead the virtual machine translates guest
system calls to the host operating system.
The first package gives me powerpc-linux-gnu-gcc
. The prefix is the
architecture tuple describing the instruction set and system ABI.
To try this out, I have a little test program that inspects its execution
environment:
#include <stdio.h>
int main(void)
{
char *w = "?";
switch (sizeof(void *)) {
case 1: w = "8"; break;
case 2: w = "16"; break;
case 4: w = "32"; break;
case 8: w = "64"; break;
}
char *b = "?";
switch (*(char *)(int []){1}) {
case 0: b = "big"; break;
case 1: b = "little"; break;
}
printf("%s-bit, %s endian\n", w, b);
}
When I run this natively on x86-64:
$ gcc test.c
$ ./a.out
64-bit, little endian
Running it on PowerPC via QEMU:
$ powerpc-linux-gnu-gcc -static test.c
$ ./a.out
32-bit, big endian
Thanks to binfmt, I could execute it as though the PowerPC binary were a native binary. With just a couple of environment variables in the right place, I could pretend I’m developing on PowerPC — aside from emulation performance penalties of course.
However, you might have noticed I pulled a sneaky on ya: -static
. So far
what I’ve shown only works with static binaries. There’s no dynamic loader
available to run dynamically-linked binaries. Fortunately this is easy to
fix in two steps. The first step is to install the dynamic linker for
PowerPC:
# apt install libc6-powerpc-cross
The second is to tell QEMU where to find it since, unfortunately, it cannot currently do so on its own.
$ export QEMU_LD_PREFIX=/usr/powerpc-linux-gnu
Now I can leave out the -static
:
$ powerpc-linux-gnu-gcc test.c
$ ./a.out
32-bit, big endian
A practical example: Remember binitools? I’m now ready to run its fuzz-generated test suite on this cross-testing platform.
$ git clone https://github.com/skeeto/binitools
$ cd binitools/
$ make check CC=powerpc-linux-gnu-gcc
...
PASS: 668/668
Or if I’m going to be running make
often:
$ export CC=powerpc-linux-gnu-gcc
$ make -e check
Recall: make’s -e
flag passes the environment through, so I
don’t need to pass CC=...
on the command line each time.
When setting up a test suite for your own programs, consider how difficult it would be to run the tests under customized circumstances like this. The easier it is to run your tests, the more they’re going to be run. I’ve run into many projects with such overly-complex test builds that even enabling sanitizers in the tests suite was a pain, let alone cross-architecture testing.
Dependencies? There might be a way to use Debian’s multiarch support to install these packages, but I haven’t been able to figure it out. You likely need to build dependencies yourself using the cross compiler.
None of this is limited to C (or even C++). I’ve also successfully used
this to test Go libraries and programs cross-architecture. This isn’t
nearly as important since it’s harder to write unportable Go than C — e.g.
dumb pointer tricks are literally labeled “unsafe”. However, Go
(gc) trivializes cross-compilation and is statically compiled, so it’s
incredibly simple. Once you’ve installed qemu-user-binfmt
it’s entirely
transparent:
$ GOARCH=mips64 go test
That’s all there is to cross-platform testing. If for some reason binfmt
doesn’t work (WSL) or you don’t want to install it, there’s just one extra
step (package named example
):
$ GOARCH=mips64 go test -c
$ qemu-mips64-static example.test
The -c
option builds a test binary but doesn’t run it, instead allowing
you to choose where and how to run it.
It even works with cgo — if you’re willing to jump through the same hoops as with C of course:
package main
// #include <stdint.h>
// uint16_t v = 0x1234;
// char *hi = (char *)&v + 0;
// char *lo = (char *)&v + 1;
import "C"
import "fmt"
func main() {
fmt.Printf("%02x %02x\n", *C.hi, *C.lo)
}
With go run
on x86-64:
$ CGO_ENABLED=1 go run example.go
34 12
Via QEMU User Mode:
$ export CGO_ENABLED=1
$ export GOARCH=mips64
$ export CC=mips64-linux-gnuabi64-gcc
$ export QEMU_LD_PREFIX=/usr/mips64-linux-gnuabi64
$ go run example.go
12 34
I was pleasantly surprised how well this all works.
Despite the variety, all these architectures are still “running” the same operating system, Linux, and so they only vary on one dimension. For most programs primarily targeting x86-64 Linux, PowerPC Linux is practically the same thing, while x86-64 OpenBSD is foreign territory despite sharing an architecture and ABI (System V). Testing across operating systems still requires spending the time to install, configure, and maintain these extra hosts. That’s an article for another time.
]]>strcpy
function is a common sight in typical C programs.
It’s also a source of buffer overflow defects, so linters and code
reviewers commonly recommend alternatives such as strncpy
(difficult to use correctly; mismatched semantics), strlcpy
(non-standard, flawed), or C11’s optional strcpy_s
(no correct or
practical implementations). Besides their individual shortcomings,
these answers are incorrect. strcpy
and friends are, at best, incredibly
niche, and the correct replacement is memcpy
.
If strcpy
is not easily replaced with memcpy
then the code is
fundamentally wrong. Either it’s not using strcpy
correctly or it’s
doing something dumb and should be rewritten. Highlighting such problems
is part of what makes memcpy
such an effective replacement.
Note: Everything here applies just as much to strcat
and
friends.
Clarification update: This article is about correctness (objective), not safety (subjective). If the word “safety” comes to mind then you’ve missed the point.
Buffer overflows arise when the destination is smaller than the source.
Safe use of strcpy
requires a priori knowledge of the length of the
source string length. Usually this knowledge is the exact source string
length. If so, memcpy
is not only a trivial substitute, it’s faster
since it will not simultaneously search for a null terminator.
char *my_strdup(const char *s)
{
size_t len = strlen(s) + 1;
char *c = malloc(len);
if (c) {
strcpy(c, s); // BAD
}
return c;
}
char *my_strdup_v2(const char *s)
{
size_t len = strlen(s) + 1;
char *c = malloc(len);
if (c) {
memcpy(c, s, len); // GOOD
}
return c;
}
A more benign case is a static source string, i.e. trusted input.
struct err {
char message[16];
};
void set_oom(struct err *err)
{
strcpy(err->message, "out of memory"); // BAD
}
The size is a compile time constant, so exploit it as such! Even more, a static assertion (C11) can catch mistakes at compile time rather than run time.
void set_oom_v2(struct err *err)
{
static const char oom[] = "out of memory";
static_assert(sizeof(err->message) >= sizeof(oom));
memcpy(err->message, oom, sizeof(oom));
}
// Or using a macro:
void set_oom_v3(struct err *err)
{
#define OOM "out of memory"
static_assert(sizeof(err->message) >= sizeof(OOM));
memcpy(err->message, OOM, sizeof(OOM));
}
// Or assignment (implicit memcpy):
void set_oom_v4(struct err *err)
{
static const struct err oom = {"out of memory"};
*err = oom;
}
This covers the vast majority of cases of already-correct strcpy
.
strcpy
can still be correct without knowing the exact source string
length. It is enough to know its upper bound does not exceed the
destination length. In this example — assuming the input is guaranteed to
be null-terminated — this strcpy
is correct without ever knowing the
source string length:
struct reply {
char message[32];
int x, y;
};
struct log {
time_t timestamp;
char message[32];
};
void log_reply(struct log *e, const struct reply *r)
{
e->timestamp = time(0);
strcpy(e->message, r->message);
}
This is a rare case where strncpy
has the right semantics. It zeros out
unused destination bytes, destroying any previous contents.
strncpy(e->message, r->message, sizeof(e->message));
// In this case, same as:
memset(e->message, 0, sizeof(e->message));
strcpy(e->message, r->message);
It’s not a general strcpy
replacement because strncpy
might not write
a null terminator. If the source string does not null-terminate within the
destination length, then neither will destination string.
As before, we can do better with memcpy
!
static_assert(sizeof(e->message) >= sizeof(r->message));
memcpy(e->message, r->message, sizeof(r->message));
This unconditionally copies 32 bytes. But doesn’t it waste time copying
bytes it won’t need? No! On modern hardware it’s far better to copy a
large, fixed number of bytes than a small, variable number of bytes. After
all, branching is expensive. Searching for and handling that null
terminator has a cost. This fixed-size copy is literally two instructions
on x86-64 (output of clang -march=x86-64-v3 -O3
):
vmovups ymm0, [rsi]
vmovups [rdi + 8], ymm0
It’s faster and there’s no strcpy
to attract complaints.
So where is strcpy
useful? Only where all of the following apply:
You only know the upper bound of the source string.
It’s undesirable to read beyond that length. Maybe storage is limited to the exact length of the string, or the upper bound is very large so an unconditional copy is too expensive.
The source string is so long, and the function so hot, that it’s worth
avoiding two passes: strlen
followed by memcpy
.
These circumstances are very unusual which makes strcpy
a niche function
you probably don’t need. This is the best case I can imagine, and it’s
pretty dumb:
struct doc {
unsigned long long id;
char body[1L<<20];
};
// Create a new document from a buffer.
//
// If body is more than 1MiB, the behavior is undefined.
struct doc *doc_create(const char *body)
{
struct doc *c = calloc(1, sizeof(*c));
if (c) {
c->id = id_gen();
assert(strlen(body) < sizeof(c->body));
strcpy(c->body, body);
}
return c;
}
If you’re dealing with such large null-terminated strings that (2) and (3) apply then you’re already doing something fundamentally wrong and self-contradictory. The pointer and length should be kept and passed together. It’s especially essential for a hot function.
struct doc_v2 {
unsigned long long id;
size_t len;
char body[];
};
*_s
isn’t helping youC11 introduced “safe” string functions as an optional “Annex K”, each
named with a _s
suffix to its “unsafe” counterpart. Here is the
prototype for strcpy_s
:
errno_t strcpy_s(char *restrict s1,
rsize_t s1max,
const char *restrict s2);
The rsize_t
is a size_t
with a “restricted” range (RSIZE_MAX
,
probably SIZE_MAX/2
) intended to catch integer underflows. If you
accidentally compute a negative length, it will be a very large
number in unsigned form. (An indicator that size_t
should have
originally been defined as signed.) This will be outside the
restricted range, and so the operation isn’t attempted due to a likely
underflow.
These “safe” functions were modeled after functions of the same name in MSVC. However, as noted, there are no practical implementations of Annex K. The functions in MSVC have different semantics and behavior, and they do not attempt to implement the standard.
Worse, they don’t even do what’s promised in their documentation.
The following program should cause a runtime-constraint violation since
-1
is an invalid rsize_t
in any reasonable implementation:
#define __STDC_WANT_LIB_EXT1__ 1
#include <stdio.h>
#include <string.h>
int main(void)
{
char buf[8] = {0};
errno_t r = strcpy_s(buf, -1, "hello");
printf("%d %s\n", (int)r, buf);
}
With the latest MSVC as of this writing (VS 2019), this program prints “0
hello
”. Using strcpy_s
did not make my program any safer than had I
just used strcpy
. If anything, it’s less safe due to a false sense of
security. Don’t use these functions.
The primary Go implementation, confusingly named “gc”, is an incredible piece of software engineering. This is apparent when building the Go toolchain itself, a process that is fast, reliable, easy, and simple. It was originally written in C, but was re-written in Go starting with Go 1.5. The C compiler in w64devkit can build the original C implementation which then can be used to bootstrap any more recent version. It’s so easy that I personally never use official binary releases and always bootstrap from source.
You will need the Go 1.4 source, go1.4-bootstrap-20171003.tar.gz. This “bootstrap” tarball is the last Go 1.4 release plus a few additional bugfixes. You will also need the source of the actual version of Go you want to use, such as Go 1.16.5 (latest version as of this writing).
Start by building Go 1.4 using w64devkit. On Windows, Go is built using a
batch script and no special build system is needed. Since it shouldn’t be
invoked with the BusyBox ash shell, I use cmd.exe
explicitly.
$ tar xf go1.4-bootstrap-20171003.tar.gz
$ mv go/ bootstrap
$ (cd bootstrap/src/ && cmd /c make)
In about 30 seconds you’ll have a fully-working Go 1.4 toolchain. Next use it to build the desired toolchain. You can move this new toolchain after it’s built if necessary.
$ export GOROOT_BOOTSTRAP="$PWD/bootstrap"
$ tar xf go1.16.5.src.tar.gz
$ (cd go/src/ && cmd /c make)
At this point you can delete the bootstrap toolchain. You probably also want to put Go on your PATH.
$ rm -rf bootstrap/
$ printf 'PATH="$PATH;%s/go/bin"\n' "$PWD" >>~/.profile
$ source ~/.profile
Not only is Go now available, so is the full power of cgo. (Including its costs if used.)
Since w64devkit is oriented so much around Vim, here’s my personal Vim
configuration for Go. I don’t need or want fancy plugins, just access to
goimports
and a couple of corrections to Vim’s built-in Go support ([[
and ]]
navigation). The included ctags
understands Go, so tags
navigation works the same as it does with C. \i
saves the current
buffer, runs goimports
, and populates the quickfix list with any errors.
Similarly :make
invokes go build
and, as expected, populates the
quickfix list.
autocmd FileType go setlocal makeprg=go\ build
autocmd FileType go map <silent> <buffer> <leader>i
\ :update \|
\ :cexpr system("goimports -w " . expand("%")) \|
\ :silent edit<cr>
autocmd FileType go map <buffer> [[
\ ?^\(func\\|var\\|type\\|import\\|package\)\><cr>
autocmd FileType go map <buffer> ]]
\ /^\(func\\|var\\|type\\|import\\|package\)\><cr>
Go only comes with gofmt
but goimports
is just one command away, so
there’s little excuse not to have it:
$ go install golang.org/x/tools/cmd/goimports@latest
Thanks to GOPROXY, all Go dependencies are accessible without (or before) installing Git, so this tool installation works with nothing more than w64devkit and a bootstrapped Go toolchain.
The intricacies of cgo are beyond the scope of this article, but the gist
is that a Go source file contains C source in a comment followed by
import "C"
. The imported C
object provides access to C types and
functions. Go functions marked with an //export
comment, as well as the
commented C code, are accessible to C. The latter means we can use Go to
implement a C interface in a DLL, and the caller will have no idea they’re
actually talking to Go.
To illustrate, here’s an little C interface. To keep it simple, I’ve specifically sidestepped some more complicated issues, particularly involving memory management.
// Which DLL am I running?
int version(void);
// Generate 64 bits from a CSPRNG.
unsigned long long rand64(void);
// Compute the Euclidean norm.
float dist(float x, float y);
Here’s a C implementation which I’m calling “version 1”.
#include <math.h>
#include <windows.h>
#include <ntsecapi.h>
__declspec(dllexport)
int
version(void)
{
return 1;
}
__declspec(dllexport)
unsigned long long
rand64(void)
{
unsigned long long x;
RtlGenRandom(&x, sizeof(x));
return x;
}
__declspec(dllexport)
float
dist(float x, float y)
{
return sqrtf(x*x + y*y);
}
As discussed in the previous article, each function is exported using
__declspec
so that they’re available for import. As before:
$ cc -shared -Os -s -o hello1.dll hello1.c
Side note: This could be trivially converted into a C++ implementation
just by adding extern "C"
to each declaration. It disables C++ features
like name mangling, and follows the C ABI so that the C++ functions appear
as C functions. Compiling the C++ DLL is exactly the same.
Suppose we wanted to implement this in Go instead of C. We already have all the tools needed to do so. Here’s a Go implementation, “version 2”:
package main
import "C"
import (
"crypto/rand"
"encoding/binary"
"math"
)
//export version
func version() C.int {
return 2
}
//export rand64
func rand64() C.ulonglong {
var buf [8]byte
rand.Read(buf[:])
r := binary.LittleEndian.Uint64(buf[:])
return C.ulonglong(r)
}
//export dist
func dist(x, y C.float) C.float {
return C.float(math.Sqrt(float64(x*x + y*y)))
}
func main() {
}
Note the use of C types for all arguments and return values. The main
function is required since this is the main package, but it will never be
called. The DLL is built like so:
$ go build -buildmode=c-shared -o hello2.dll hello2.go
Without the -o
option, the DLL will lack an extension. This works fine
since it’s mostly only convention on Windows, but it may be confusing
without it.
What if we need an import library? This will be required when linking with
the MSVC toolchain. In the previous article we asked Binutils to generate
one using --out-implib
. For Go we have to handle this ourselves via
gendef
and dlltool
.
$ gendef hello2.dll
$ dlltool -l hello2.lib -d hello2.def
The only way anyone upgrading would know version 2 was implemented in Go is that the DLL is a lot bigger (a few MB vs. a few kB) since it now contains an entire Go runtime.
We could also go the other direction and implement the DLL using plain assembly. It won’t even require linking against a C runtime.
w64devkit includes two assemblers: GAS (Binutils) which is used by GCC, and NASM which has friendlier syntax. I prefer the latter whenever possible — exactly why I included NASM in the distribution. So here’s how I implemented “version 3” in NASM assembly.
bits 64
section .text
global DllMainCRTStartup
export DllMainCRTStartup
DllMainCRTStartup:
mov eax, 1
ret
global version
export version
version:
mov eax, 3
ret
global rand64
export rand64
rand64:
rdrand rax
ret
global dist
export dist
dist:
mulss xmm0, xmm0
mulss xmm1, xmm1
addss xmm0, xmm1
sqrtss xmm0, xmm0
ret
The global
directive is common in NASM assembly and causes the named
symbol to have the external linkage needed when linking the DLL. The
export
directive is Windows-specific and is equivalent to dllexport
in
C.
Every DLL must have an entrypoint, usually named DllMainCRTStartup
. The
return value indicates if the DLL successfully loaded. So far this has
been handled automatically by the C implementation, but at this low level
we must define it explicitly.
Here’s how to assemble and link the DLL:
$ nasm -fwin64 -o hello3.o hello3.s
$ ld -shared -s -o hello3.dll hello3.o
Python has a nice, built-in C interop, ctypes
, that allows Python to
call arbitrary C functions in shared libraries, including DLLs, without
writing C to glue it together. To tie this all off, here’s a Python
program that loads all of the DLLs above and invokes each of the
functions:
import ctypes
def load(version):
hello = ctypes.CDLL(f"./hello{version}.dll")
hello.version.restype = ctypes.c_int
hello.version.argtypes = ()
hello.dist.restype = ctypes.c_float
hello.dist.argtypes = (ctypes.c_float, ctypes.c_float)
hello.rand64.restype = ctypes.c_ulonglong
hello.rand64.argtypes = ()
return hello
for hello in load(1), load(2), load(3):
print("version", hello.version())
print("rand ", f"{hello.rand64():016x}")
print("dist ", hello.dist(3, 4))
After loading the DLL with CDLL
the program defines each function
prototype so that Python knows how to call it. Unfortunately it’s not
possible to build Python with w64devkit, so you’ll also need to install
the standard CPython distribution in order to run it. Here’s the output:
$ python finale.py
version 1
rand b011ea9bdbde4bdf
dist 5.0
version 2
rand f7c86ff06ae3d1a2
dist 5.0
version 3
rand 2a35a05b0482c898
dist 5.0
That output is the result of four different languages interfacing in one process: C, Go, x86-64 assembly, and Python. Pretty neat if you ask me!
]]>In this article, all commands and examples are being run in the context of w64devkit (1.8.0).
If all you care about is the GNU toolchain then DLLs are straightforward,
working mostly like shared objects on other platforms. To illustrate,
let’s build a “square” library with one “exported” function, square
,
that returns the square of its input (square.c
):
long square(long x)
{
return x * x;
}
The header file (square.h
):
#ifndef SQUARE_H
#define SQUARE_H
long square(long);
#endif
To build a stripped, size-optimized DLL, square.dll
:
$ cc -shared -Os -s -o square.dll square.c
Now a test program to link against it (main.c
), which “imports” square
from square.dll
:
#include <stdio.h>
#include "square.h"
int main(void)
{
printf("%ld\n", square(2));
}
Linking and testing it:
$ cc -Os -s main.c square.dll
$ ./a
4
It’s that simple. Or more traditionally, using the -l
flag:
$ cc -Os -s -L. main.c -lsquare
Given -lxyz
GCC will look for xyz.dll
in the library path.
Given a DLL, printing a list of the exported functions of a DLL is not so
straightforward. For ELF shared objects there’s nm -D
, but despite what
the internet will tell you, this tool does not support DLLs. objdump
will print the exports as part of the “private” headers (-p
). A bit of
awk
can cut this down to just a list of exports. Since we’ll need this a
few times, here’s a script, exports.sh
, that composes objdump
and
awk
into the tool I want:
#!/bin/sh
set -e
printf 'LIBRARY %s\nEXPORTS\n' "$1"
objdump -p "$1" | awk '/^$/{t=0} {if(t)print$NF} /^\[O/{t=1}'
Running this on square.dll
above:
$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
square
This can be helpful when debugging. It also works outside of Windows, such
as on Linux. By the way, the output format is no accident: This is the
.def
file format (also), which will be particularly
useful in a moment.
Mingw-w64 has a gendef
tool to produce the above output, and this tool
is now included in w64devkit. To print the exports to standard output:
$ gendef - square.dll
LIBRARY "square.dll"
EXPORTS
square
Alternatively Visual Studio provides dumpbin
. It’s not as concise as
exports.sh
but it’s a lot less verbose than objdump -p
.
$ dumpbin /nologo /exports square.dll
...
1 0 000012B0 square
...
You can get by without knowing anything more, which is usually enough for those looking to support Windows as a secondary platform, even just as a cross-compilation target. However, with a bit more work we can do better. Imagine doing the above with a non-trivial program. GCC doesn’t know which functions are part of the API and which are not. Obviously static functions should not be exported, but what about non-static functions visible between translation units (i.e. object files)?
For instance, suppose square.c
also has this function which is not part
of its API but may be called by another translation unit.
void internal_func(void) {}
Now when I build:
$ ./exports.sh square.dll
LIBRARY square.dll
EXPORTS
internal_func
square
On the other side, when I build main.c
how does it know which functions
are imported from a DLL and which will be found in another translation
unit? GCC makes it work regardless, but it can generate more efficient
code if it knows at compile time (vs. link time).
On Windows both are solved by adding __declspec
notation on both sides.
In square.c
the exports are marked as dllexport
:
__declspec(dllexport)
long square(long x)
{
return x * x;
}
void internal_func(void) {}
In the header, it’s marked as an import:
__declspec(dllimport)
long square(long);
The mere presence of dllexport
tells the linker to only export those
functions marked as exports, and so internal_func
disappears from the
exports list. Convenient!
On the import side, during compilation of the original program, GCC
assumed square
wasn’t an import and generated a local function call.
When the linker later resolved the symbol to the DLL, it generated a
trampoline to fill in as that local function (like a PLT). With
dllimport
, GCC knows it’s an imported function and so doesn’t go through
a trampoline.
While generally unnecessary for the GNU toolchain, it’s good hygiene to
use __declspec
. It’s also mandatory when using MSVC, in case you
care about that as well.
Mingw-w64-compiled DLLs will work with LoadLibrary
out of the box, which
is sufficient in many cases, such as for dynamically-loaded plugins. For
example (loadlib.c
):
#include <stdio.h>
#include <windows.h>
int main(void)
{
HANDLE h = LoadLibrary("square.dll");
long (*square)(long) = GetProcAddress(h, "square");
printf("%ld\n", square(2));
}
Compiled with MSVC cl
(via vcvars.bat
):
$ cl /nologo loadlib.c
$ ./loadlib
4
However, the MSVC toolchain is rather primitive and, unlike the GNU
toolchain, cannot link directly with DLLs. It requires an import
library. Conventionally this matches the DLL name but has a .lib
extension — square.lib
in this case. The most convenient way to get an
import library is to ask GCC to generate one at link-time via
--out-implib
:
$ cc -shared -Wl,--out-implib,square.lib -o square.dll square.c
Back to cl
, just add square.lib
as another input. You don’t actually
need square.dll
present at link time.
$ cl /nologo /Os main.c square.lib
$ ./main
4
What if you already have the DLL and you just need an import library? GNU
Binutils’ dlltool
can do this, though not without help. It cannot
generate an import library from a DLL alone since it requires a .def
file enumerating the exports. (Why?) What luck that we have a tool for
this!
$ ./exports.sh square.dll >square.def
$ dlltool --input-def square.def --output-lib square.lib
Going the other way, building a DLL with MSVC and linking it with
Mingw-w64, is nearly as easy as the pure Mingw-w64 case, though it
requires that all exports are tagged with dllexport
. The /LD
(case
sensitive) is just like GCC’s -shared
.
$ cl /nologo /LD /Os square.c
$ cc -Os -s main.c square.dll
$ ./a
4
cl
outputs three files: square.dll
, square.lib
, and square.exp
.
The last can be discarded, and the second will be needed if linking with
MSVC, but as before, Mingw-w64 requires only the first.
This all demonstrates that Mingw-w64 and MSVC are quite interoperable — at least for C interfaces that don’t share CRT objects.
If your program is designed to be portable, those __declspec
will get in
the way. That can be tidied up with some macros, but even better, those
macros can be used to control ELF symbol visibility so that the library
has good hygiene on, say, Linux as well.
The strategy will be to mark all API functions with SQUARE_API
and
expand that to whatever is necessary at the time. When building a library,
it will expand to dllexport
, or default visibility on unix-likes. When
consuming a library it will expand to dllimport
, or nothing outside of
Windows. The new square.h
:
#ifndef SQUARE_H
#define SQUARE_H
#if defined(SQUARE_BUILD)
# if defined(_WIN32)
# define SQUARE_API __declspec(dllexport)
# elif defined(__ELF__)
# define SQUARE_API __attribute__ ((visibility ("default")))
# else
# define SQUARE_API
# endif
#else
# if defined(_WIN32)
# define SQUARE_API __declspec(dllimport)
# else
# define SQUARE_API
# endif
#endif
SQUARE_API
long square(long);
#endif
The new square.c
:
#define SQUARE_BUILD
#include "square.h"
SQUARE_API
long square(long x)
{
return x * x;
}
main.c
remains the same. When compiling on unix-like systems, add the
-fvisibility=hidden
to hide all symbols by default so that this macro
can reveal them.
$ cc -shared -Os -fvisibility=hidden -s -o libsquare.so square.c
$ cc -Os -s main.c ./libsquare.so
$ ./a.out
4
While Mingw-w64 hides a lot of the differences between Windows and
unix-like systems, when it comes to dynamic libraries it can only do so
much, especially if you care about import libraries. If I were maintaining
a dynamic library — unlikely since I strongly prefer embedding or static
linking — I’d probably just use different Makefiles per toolchain
and target. Aside from the SQUARE_API
type of macros, the source code
can fortunately remain fairly agnostic about it.
Here’s what I might use as NMakefile
for MSVC nmake
:
CC = cl /nologo
CFLAGS = /Os
all: main.exe square.dll square.lib
main.exe: main.c square.h square.lib
$(CC) $(CFLAGS) main.c square.lib
square.dll: square.c square.h
$(CC) /LD $(CFLAGS) square.c
square.lib: square.dll
clean:
-del /f main.exe square.dll square.lib square.exp
Usage:
nmake /nologo /f NMakefile
For w64devkit and cross-compiling, Makefile.w64
, which includes
import library generation for the sake of MSVC consumers:
CC = cc
CFLAGS = -Os
LDFLAGS = -s
LDLIBS =
all: main.exe square.dll square.lib
main.exe: main.c square.dll square.h
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c square.dll $(LDLIBS)
square.dll: square.c square.h
$(CC) -shared -Wl,--out-implib,$(@:dll=lib) \
$(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)
square.lib: square.dll
clean:
rm -f main.exe square.dll square.lib
Usage:
make -f Makefile.w64
And a Makefile
for everyone else:
CC = cc
CFLAGS = -Os -fvisibility=hidden
LDFLAGS = -s
LDLIBS =
all: main libsquare.so
main: main.c libsquare.so square.h
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ main.c ./libsquare.so $(LDLIBS)
libsquare.so: square.c square.h
$(CC) -shared $(CFLAGS) $(LDFLAGS) -o $@ square.c $(LDLIBS)
clean:
rm -f main libsquare.so
Now that I have this article, I’m glad I won’t have to figure this all out again next time I need it!
]]>My small development distribution for Windows, w64devkit, is my own little way of pushing back against this trend where it affects me most. Following in the footsteps of projects like Handmade Hero and Making a Video Game from Scratch, this is my guide to no-nonsense software development using my development kit. It’s an overview of the tooling and development workflow, and I’ve tried not to assume too much knowledge of the reader. Being a guide rather than manual, it is incomplete on its own, and I link to substantial external resources to fill in the gaps. The guide is capped with a small game I wrote entirely using my development kit, serving as a demonstration of what sorts of things are not only possible, but quite reasonably attainable.
Game repository: https://github.com/skeeto/asteroids-demo
Guide to source: Understanding Asteroids
Of course you cannot use the development kit if you don’t have it yet. Go
to the releases section and download the latest release. It will be
a .zip file named w64devkit-x.y.z.zip
where x.y.z
is the version.
You will need to unzip the development kit before using it. Windows has
built-in support for .zip files, so you can either right-click to access
“Extract All…” or navigate into it as a folder then drag-and-drop the
w64devkit
directory somewhere outside the .zip file. It doesn’t care
where it’s unzipped (aka it’s “portable”), so put it where ever is
convenient: your desktop, user profile directory, a thumb drive, etc. You
can move it later if you change your mind just so long as you’re not
actively running it. If you decide you don’t need it anymore then delete
it.
There is a w64devkit.exe
in the unzipped w64devkit
directory. This is
the easiest way to enter the development environment, and will not require
system configuration changes. This program puts the kit’s programs in the
PATH
environment variable then runs a Bourne shell — the standard unix
shell. Aside from the text editor, this is the primary interface for
developing software. In time you may even extend this environment with
your own tools.
If you want an additional “terminal” window, run w64devkit.exe
again. If
you use it a lot, you may want to create a shortcut and even pin it to
your task bar.
Whether on Windows or unix-like systems, when you type a command into the
system shell it uses the PATH
environment variable to locate the actual
program to run for that command. In practice, the PATH
variable is a
concatenation of multiple directories, and the shell searches these
directories in order. On unix-like systems, PATH
elements are separated
by colons. However, Windows uses colons to delimit drive letters, so its
PATH
elements are separated by semicolons.
# Prepending to PATH on unix
PATH="$HOME/bin:$PATH"
# Prepending to PATH on Windows (w64devkit)
PATH="$HOME/bin;$PATH"
For more advanced users: Rather than use w64devkit.exe
, you could “Edit
environment variables for your account” and manually add w64devkit’s bin
directory to your PATH
, making the tools generally available everywhere
on your system. If you’ve gone this route, you can start a Bourne shell at
any time with sh -l
. (The -l
option requests a login shell.)
Also borrowed from the unix world is the concept of a home directory,
specified by the HOME
environment variable. By default this will be your
user profile directory, typically C:/Users/$USER
. Login shells always
start in the home directory. This directory is often indicated by tilde
(~
), and many programs automatically expand a leading tilde to the home
directory.
The shell is a command interpreter. It’s named such because it was
originally a shell around the operating system kernel — the user
interface to the kernel. Your system’s graphical interface — Windows
Explorer, or Explorer.exe
— is really just a kind of shell, too. That
shell is oriented around the mouse and graphics. This is fine for some
tasks, but a keyboard-oriented command shell is far better suited for
development tasks. It’s more efficient, but more importantly its features
are composable: Complex operations and processes can be constructed
from simple, easy-to-understand tools. Embrace it!
In the shell you can navigate between directories with cd
, make
directories with mkdir
, remove files with rm
, regular expression text
searches with grep
, etc. Run busybox
to see a listing of the available
standard commands. Unfortunately there are no manual pages, but you can
access basic usage information for any command with busybox CMD --help
.
Windows’ standard command shell is cmd.exe
. Unfortunately this shell is
terrible and exists mostly for legacy compatibility. The intended
replacement is PowerShell for users who regularly use a shell. However,
PowerShell is fundamentally broken, does virtually everything incorrectly,
and manages to be even worse than cmd.exe
. Besides, sticking to POSIX
shell conventions significantly improves build portability, and unix tool
knowledge is transferable to basically every other operating system.
Unix’s standard shell was the Bourne shell, sh
. The shells in use today
are Bourne shell clones with a superset of its features. The most popular
interactive shells are Bash and Zsh. On Linux, dash (Debian Almquist
shell) has become popular for non-interactive use (scripting). The shell
included with w64devkit is the BusyBox fork of the Almquist shell (ash
),
closely related to dash. The Almquist shell has almost no non-interactive
features beyond the standard Bourne shell, and so as far as scripts are
concerned can be regarded as a plain Bourne shell clone. That’s why I
typically refer to it by the name sh
.
However, BusyBox’s Almquist shell has interactive features much like Bash, and Bash users should be quite comfortable. It’s not just tab-completion but a slew of Emacs-like keybindings:
Take special note of Ctrl-r, which is the most important and powerful shortcut of the bunch. Frequent use is a good habit. Don’t mash the up arrow to search through the command history.
Special note for Cygwin and MSYS2 users: the shell is aware of Windows paths and does not present a virtual unix file system scheme. This has important consequences for scripting, both good and bad. The shell even supports backslash as a directory separator, though you should of course prefer forward slashes.
Login shells (-l
) evaluate the contents of ~/.profile
on startup. This
is your chance to customize the shell configuration, such as setting
environment variables or defining aliases and functions. For instance, if
you wanted the prompt to show the working directory in green you’d set
PS1
in your ~/.profile
:
PS1="$(printf '\x1b[33;1m\\w\x1b[0m$ ')"
If you find yourself using the same command sequences or set of options
again and again, you might consider putting those commands into a script,
and then installing that script somewhere on your PATH
so that you can
run it as a new command. First make a directory to hold your scripts, say
in ~/bin
:
mkdir ~/bin
In ~/.profile
prepend it to your PATH
:
PATH="$HOME/bin;$PATH"
If you don’t want to start a fresh shell to try it out, then load the new configuration in your current shell:
source ~/.profile
Suppose you keep getting the tar
switches mixed up and you’d like to
just have an untar
command that does the right thing. Create a file
named untar
or untar.sh
in ~/bin
with these contents:
#!/bin/sh
set -e
tar -xaf "$@"
Now a command like untar something.tar.gz
will extract the archive
contents.
To learn more about Bourne shell scripting, the POSIX shell command language specification is a good reference. All of the features listed in that document are available to your shell scripts.
The development kit includes the powerful and popular text editor Vim. It takes effort to learn, but is well worth the investment. It’s packed with features, but since you only need a small number of them on a regular basis it’s not as daunting as it might appear. Using Vim effectively, you will write and edit text so much more quickly than before. That includes not just code, but prose: READMEs, documentation, etc.
(The catch: Non-modal editing will forever feel frustratingly inefficient. That’s not because you will become unpracticed at it, or even have trouble code switching between input styles, but because you’ll now be aware how bad it is. Ignorance is bliss.)
Vim includes its own tutorial for absolute beginners which you can access
with the vimtutor
command. It will run in the console window and guide
you through the basics in about half an hour. Do not be afraid to return
to the tutorial at any time since this is the stuff you need to know by
heart.
When it comes time to actually use Vim to write code, you can continue
writing code via the terminal interface (vim
), or you can run the
graphical interface (gvim
). The latter is recommended since it has some
nice quality-of-life features, but it’s not strictly necessary. When
starting the GUI, put an ampersand (&
) on the command so that it runs in
the background. For instance this brings up the editor with two files open
but leaves the shell running in the foreground so you can continue using
it while you edit:
gvim main.c Makefile &
Vim’s defaults are good but imperfect. Before getting started with
actually editing code you should establish at least the following minimal
configuration in ~/_vimrc
. (To understand these better, use :help
to
jump the built-in documentation.)
set hidden encoding=utf-8 shellslash
filetype plugin indent on
syntax on
The graphical interface defaults to a white background. Many people prefer
“dark mode” when editing code, so inverting this is simply a matter of
choosing a dark color scheme. Vim comes with a handful of color schemes,
around half of which have dark backgrounds. Use :colorscheme
to change
it, and put it in your ~/_vimrc
to persist it.
colorscheme slate
The default graphical interface includes a menu bar and tool bar. There are better ways to accomplish all these operations, none of which require touching the mouse, so consider removing all that junk:
set guioptions=ac
Finally, since the development kit is oriented around C and C++, here’s my own entire Vim configuration for C which makes it obey my own style:
set cinoptions+=t0,l1,:0 cinkeys-=0#
Once you’re comfortable with the basics, the best next step is to read Practical Vim: Edit Text at the Speed of Thought by Drew Neil. It’s an opinionated guide to Vim that instills good habits. If you want something cost-free to whet your appetite, check out Seven habits of effective text editing.
We’ve established a shell and text editor. Next is the development workflow for writing an actual application. Ultimately you will invoke a compiler from within Vim, which will parse compiler messages and take you directly to the parts of your source code that need attention. Before we get that far, let’s start with the basics.
The classic example is the “hello world” program, which we’ll suppose is
in a file called hello.c
:
#include <stdio.h>
int main(void)
{
puts("Hello, world!");
}
While this development kit provides a version of the GNU compiler, gcc
,
this guide mostly speaks of it in terms of the generic unix C compiler
name, cc
. Unix-like systems install cc
as an alias for the system’s
default C compiler, and w64devkit is no exception.
cc -o hello.exe hello.c
This command creates hello.exe
from hello.c
. Since this is not (yet?)
on your PATH
, you must invoke it via a path name (i.e. the command must
include a slash), since otherwise the shell will search for it via the
PATH
variable. Typically this means putting ./
in front of the program
name, meaning “run the program in the current directory”. As a convenience
you do not need to include the .exe
extension:
./hello
Unlike the untar
shell script from before, this hello.exe
is entirely
independent of w64devkit. You can share it with anyone running Windows and
they’ll be able to execute it. There’s a little bit of runtime embedded in
the executable, but the bulk of the runtime is in the operating system
itself. I want to highlight this point because most programming languages
don’t work like this, or at least doing so is unnatural with lots of
compromises. The users of your software do not need to install a runtime
or other supporting software. They just run the executable you give them!
That executable is probably pretty small, less than 50kB — basically a
miracle by today’s standards. Sure, it’s hardly doing anything right now,
but you can add a whole lot more functionality without that executable
getting much bigger. In fact, it’s entirely unoptimized right now and
could be even smaller. Passing the -Os
flag tells the compiler to
optimize for size and -s
flag tells the linker to strip out unneeded
information.
cc -Os -s -o hello.exe hello.c
That cuts the program down to around a third of its previous size. If necessary you can still do even better than this, but that’s outside the scope of this guide.
So far the program could still be valid enough to compile but contain
obvious mistakes. The compiler can warn about many of these mistakes, and
so it’s always worth enabling these warnings. This requires two flags:
-Wall
(“all” warnings) and -Wextra
(extra warnings).
cc -Wall -Wextra -o hello.exe hello.c
When you’re working on a program, you often don’t want optimization
enabled since it makes it more difficult to debug. However, some warnings
aren’t fired unless optimization is enabled. Fortunately there’s an
optimization level to resolve this, -Og
(optimize for debugging).
Combine this with -g3
to embed debug information in the program. This
will be handy later.
cc -Wall -Wextra -Og -g3 -o hello.exe hello.c
These are the compiler flags you typically want to enable while developing
your software. When you distribute it, you’d use either -Os -s
(optimize
for size) or -O3 -s
(optimize for speed).
I mentioned running the compiler from Vim. This isn’t done directly but
via special build script called a Makefile. You invoke the make
program
from Vim, which invokes the compiler as above. The simplest Makefile would
look like this, in a file literally named Makefile
:
hello.exe: hello.c
cc -Wall -Wextra -Og -g3 -o hello.exe hello.c
This tells make
that the file named hello.exe
is derived from another
file called hello.c
, and the tab-indented line is the recipe for doing
so. Running the make
command will run the compiler command if and only
if hello.c
is newer than hello.exe
.
To run make
from Vim, use the :make
command inside Vim. It will not
only run make
but also capture its output in an internal buffer called
the quickfix list. If there is any warning or error, Vim will jump to
it. Use :cn
(next) and :cp
(prev) to move between issues and correct
them, or :cc
to re-display the current issue. When you’re done fixing
the issues, run :make
again to start the cycle over.
Try that now by changing the printed message and recompiling from within Vim. Intentionally create an error (bad syntax, too many arguments, etc.) and see what happens.
Makefiles are a powerful and conventional way to build C and C++ software. Since the development kit includes the standard set of unix utilities, it’s very easy to write portable Makefiles that work across a variety a operating systems and environments. Your software isn’t necessarily tied to Windows just because you’re using a Windows-based development environment. If you want to learn how Makefiles work and how to use them effectively, read A Tutorial on Portable Makefiles. From here on I’ll assume you’ve read that tutorial.
Ultimately I’d probably write my “hello world” Makefile like so:
.POSIX:
CC = cc
CFLAGS = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS =
EXE = .exe
hello$(EXE): hello.c
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)
When building a release, optimize for size or speed:
make CFLAGS=-Os LDFLAGS=-s
This is very much a Windows-first style of Makefile, but still allows it
to be comfortably used on other systems. On Linux this make
invocation
strips away the .exe
extension:
make EXE=
For a Windows-second Makefile, remove the line with EXE = .exe
. This
allows EXE
to come from the environment. So, for instance, I already
define the EXE
environment variable in my w64devkit ~/.profile
:
export EXE=.exe
On Linux running make
does the right thing, as does running make
on
Windows. No special configuration required.
If my software is truly limited to Windows, I’m likely still interested in
supporting cross-compilation. A common convention for GNU toolchains is a
CROSS
Makefile macro. For example:
.POSIX:
CROSS =
CC = $(CROSS)gcc
CFLAGS = -Wall -Wextra -Og -g3
LDFLAGS =
LDLIBS =
hello.exe: hello.c
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ hello.c $(LDLIBS)
On Windows I just run make
, but on Linux I’d set CROSS
appropriately.
make CROSS=x86_64-w64-mingw32-
What happens if you’re working on a larger program and you need to jump to
the definition of a function, macro, or variable? It would be tedious to
use grep
all the time to find definitions. The development kit includes
a solid implementation of ctags
for building a tags database lists the
locations for various kinds of definitions, and Vim knows how to read this
database. Most often you’ll want to run it recursively like so:
ctags -R
You can of course do this from Vim, too: :!ctags -R
With the cursor over an identifier, press CTRL-]
to jump to a definition
for that name. Use :tn
and :tp
to move between different definitions
(e.g. when the name is overloaded). Or if you have a tag in mind rather
than a name listed in the buffer, use the :tag
command to jump by name.
Vim maintains a tag stack and jump list for going back and forth, like the
backward and forward buttons in a browser.
I had mentioned that the -g3
option embeds extra information in the
executable. This is for debuggers, and the development kit includes the
GNU Debugger, gdb
, to help you debug your programs. To use it, invoke
GDB on your executable:
gdb hello.exe
From here you can set breakpoints and such, then run the program with
start
or run
, then step
through it line by line. See Beej’s Quick
Guide to GDB for a guide. During development, always run your
program through GDB, and never exit GDB. See also: Assertions should be
more debugger-oriented.
So far this guide hasn’t actually assumed any C knowledge. One of the best ways to learn C is by reading the highly-regarded The C Programming Language and doing the exercises. Alternatively, cost-free options are Beej’s Guide to C Programming and Modern C (more advanced). You can use the development kit to go through any of these.
I’ve focused on C, but everything above also applies to C++. To learn C++ A Tour of C++ is a safe bet.
To illustrate how much you can do with nothing beyond than this 76MB development kit, here’s a taste in the form of a weekend project: an Asteroids Clone for Windows. That’s the game in the video at the top of this guide.
The development kit doesn’t include Git so you’d need to install it separately in order to clone the repository, but you could at least skip that and download a .zip snapshot of the source. It has no third-party dependencies yet it includes hardware-accelerated graphics, real-time sound mixing, and gamepad input. Building a larger and more complex game is much less about tooling and more about time and skill. That’s what I mean about w64devkit being (almost) everything you need.
]]>That solution is a small C source file, alias.c
. This article is
about why it’s necessary and how it works.
Some alias commands are for convenience, such as a cc
alias for gcc
so
that build systems need not assume any particular C compiler. Others are
essential, such as an sh
alias for “busybox sh
” so that it’s available
as a shell for make
. These aliases are usually created with links, hard
or symbolic. A GCC installation might include (roughly) a symbolic link
created like so:
ln -s gcc cc
BusyBox looks at its argv[0]
on startup, and if it names an applet
(ls
, sh
, awk
, etc.), it behaves like that applet. Typically BusyBox
aliases are installed as hard links to the original binary, and there’s
even a busybox --install
to set these up. Both kinds of aliases are
cheap and effective.
ln busybox sh
ln busybox ls
ln busybox awk
Unfortunately links are not supported by .zip files on Windows. They’d
need to be created by a dedicated installer. As a result, I’ve strongly
recommended that users run “busybox --install
” at some point to
establish the BusyBox alias commands. While w64devkit works without them,
it works better with them. Still, that’s an installation step!
An alternative option is to simply include a full copy of the BusyBox binary for each applet — all 150 of them — simulating hard links. BusyBox is small, around 4kB per applet on average, but it’s not quite that small. Since the .zip format doesn’t use block compression — files are compressed individually — this duplication will appear in the .zip itself. My 573kB BusyBox build duplicated 150 times would double the distribution size and increase the installation footprint by 25%. It’s not worth the cost.
Since .zip is so limited, perhaps I should use a different distribution format that supports links. However, another w64devkit goal is making no assumptions about what other tools are installed. Windows natively supports .zip, even if that support isn’t so great (poor performance, low composability, missing features, etc.). With nothing more than the w64devkit .zip on a fresh, offline Windows installation, you can begin efficiently developing professional, native applications in under a minute.
With links off the table, the next best option is a shell script. On
unix-like systems shell scripts are an effective tool for creating complex
alias commands. Unlike links, they can manipulate the argument list. For
instance, w64devkit includes a c99
alias to invoke the C compiler
configured to use the C99 standard. To do this with a shell script:
#!/bin/sh
exec cc -std=c99 "$@"
This prepends -std=c99
to the argument list and passes through the rest
untouched via the Bourne shell’s special case "$@"
. Because I used
exec
, the shell process becomes the compiler in place. The shell
doesn’t hang around in the background. It’s just gone. This really quite
elegant and powerful.
The closest available on Windows is a .bat batch file. However, like some other parts of DOS and Windows, the Batch language was designed as though its designer once glimpsed at someone using a unix shell, perhaps looking over their shoulder, then copied some of the ideas without understanding them. As a result, it’s not nearly as useful or powerful. Here’s the Batch equivalent:
@cc -std=c99 %*
The @
is necessary because Batch prints its commands by default (Bourne
shell’s -x
option), and @
disables it. Windows lacks the concept of
exec(3)
, so Batch file interpreter cmd.exe
continues running alongside
the compiler. A little wasteful but that hardly matters. What does matter
though is that cmd.exe
doesn’t behave itself! If you, say, Ctrl+C to
cancel compilation, you will get the infamous “Terminate batch job (Y/N)?”
prompt which interferes with other programs running in the same console.
The so-called “batch” script isn’t a batch job at all: It’s interactive.
I tried to use Batch files for BusyBox applets, but this issue came up
constantly and made this approach impractical. Nearly all BusyBox applets
are non-interactive, and lots of things break when they aren’t. Worst of
all, you can easily end up with layers of cmd.exe
clobbering each other
to ask if they should terminate. It was frustrating.
The prompt is hardcoded in cmd.exe
and cannot be disabled. Since so much
depends on cmd.exe
remaining exactly the way it is, Microsoft will never
alter this behavior either. After all, that’s why they made PowerShell a
new, separate tool.
Speaking of PowerShell, could we use that instead? Unfortunately not:
It’s installed by default on Windows, but is not necessarily enabled. One of my own use cases for w64devkit involves systems where PowerShell is disabled by policy. A common policy is it can be used interactively but not run scripts (“Running scripts is disabled on this system”).
PowerShell is not a first class citizen on Windows, and will likely
never be. Even under the friendliest policy it’s not normally possible
to put a PowerShell script on the PATH
and run it by name. (I’m sure
there are ways to make this work via system-wide configuration, but
that’s off the table.)
Everything in PowerShell is broken. For example, it does not support
input redirection with files, and instead you must use the cat
-like
command, Get-Content
, to pipe file contents. However, Get-Content
translates its input and quietly damages your data. There is no way to
disable this “feature” in the version of PowerShell that ships with
Windows, meaning it cannot accomplish the simplest of tasks. This is
just one of many ways that PowerShell is broken beyond usefulness.
Item (2) also affects w64devkit. It has a Bourne shell, but shell scripts are still not first class citizens since Windows doesn’t know what to do with them. Fixing would require system-wide configuration, antithetical to the philosophy of the project.
My working solution is inspired by an insanely clever hack used by my
favorite media player, mpv. The Windows build is strange at first
glance, containing two binaries, mpv.exe
(large) and mpv.com
(tiny).
Is that COM as in an old-school 16-bit DOS binary? No, that’s just
a trick that works around a Windows limitation.
The Windows technology is broken up into subsystems. Console programs run
in the Console subsystem. Graphical programs run in the Windows subsystem.
The original WSL was a subsystem. Unfortunately this design means
that a program must statically pick a subsystem, hardcoded into the binary
image. The program cannot select a subsystem dynamically. For example,
this is why Java installations have both java.exe
and javaw.exe
, and
Emacs has emacs.exe
and runemacs.exe
. Different binaries for different
subsystems.
On Linux, a program that wants to do graphics just talks to the Xorg
server or Wayland compositor. It can dynamically choose to be a terminal
application or a graphical application. Or even both at once. This is
exactly the behavior of mpv
, and it faces a dilemma on Windows: With
subsystems, how can it be both?
The trick is based on the environment variable PATHEXT
which tells
Windows how to prioritize executables with the same base name but
different file extensions. If I type mpv
and it finds both mpv.exe
and
mpv.com
, which binary will run? It will be the first listed in
PATHEXT
, and by default that starts with:
PATHEXT=.COM;.EXE;.BAT;...
So it will run mpv.com
, which is actually a plain old PE+ .exe
in disguise. The Windows subsystem mpv.exe
gets the shortcut and file
associations while Console subsystem mpv.com
catches command line
invocations and serves as console liaison as it invokes the real
mpv.exe
. Ingenious!
I realized I can pull a similar trick to create command aliases — not the
.com
trick, but the miniature flagger program. If only I could compile
each of those Batch files to tiny, well-behaved .exe
files so that it
wouldn’t rely on the badly-behaved cmd.exe
…
Years ago I wrote about tiny, freestanding Windows executables.
That research paid off here since that’s exactly what I want. The alias
command program need only manipulate its command line, invoke another
program, then wait for it to finish. This doesn’t require the C library,
just a handful of kernel32.dll
calls. My alias command programs can be
so small that would no longer matter that I have 150 of them, and I get
complete control over their behavior.
To compile, I use -nostdlib
and -ffreestanding
to disable all system
libraries, -lkernel32
to pull that one back in, -Os
(optimize for
size), and -s
(strip) all to make the result as small as possible.
I don’t want to write a little program for each alias command. Instead
I’ll use a couple of C defines, EXE
and CMD
, to inject the target
command at compile time. So this Batch file:
@target arg1 arg2 %*
Is equivalent to this alias compilation:
gcc -DEXE="target.exe" -DCMD="target arg1 arg2" \
-s -Os -nostdlib -ffreestanding -o alias.exe alias.c -lkernel32
The EXE
string is the actual module name, so the .exe
extension is
required. The CMD
string replaces the first complete token of the
command line string (think argv[0]
) and may contain arbitrary additional
arguments (e.g. -std=c99
). Both are handled as wide strings (L"..."
)
since the alias program uses the wide Win32 API in order to be fully
transparent. Though unfortunately at this time it makes no difference: All
currently aliased programs use the “ANSI” API since the underlying C and
C++ standard libraries only use the ANSI API. (As far as I know, nobody
has ever written fully-functional C and C++ standard libraries for
Windows, not even Microsoft.)
You might wonder why the heck I’m gluing strings together for the arguments. These will need to be parsed (word split, etc.) by someone else, so shouldn’t I construct an argv array instead? That’s not how it works on Windows: Programs receive a flat command string and are expected to parse it themselves following the format specification. When you write a C program, the C runtime does this for you to provide the usual argv array.
This is upside down. The caller creating the process already has arguments
split into an argv array — or something like it — but Win32 requires the
caller to encode the argv array as a string following a special format so
that the recipient can immediately decode it. Why marshaling rather than
pass structured data in the first place? Why does Win32 only supply a
decoder (CommandLineToArgv
) and not an encoder (e.g. the missing
ArgvToCommandLine
)? Hey, I don’t make the rules; I just have to live
with them.
You can look at the original source for the details, but the summary is
that I supply my own xstrlen()
, xmemcpy()
, and partial Win32 command
line parser — just enough to identify the first token, even if that token
is quoted. It glues the strings together, calls CreateProcessW
, waits
for it to exit (WaitForSingleObject
), retrieves the exit code
(GetExitCodeProcess
), and exits with the same status. (The stuff that
comes for free with exec(3)
.)
This all compiles to a 4kB executable, mostly padding, which is small enough for my purposes. These compress to an acceptable 1kB each in the .zip file. Smaller would be nicer, but this would require at minimum a custom linker script, and even smaller would require hand-crafted assembly.
This lingering issue solved, w64devkit now works better than ever. The
alias.c
source is included in the kit in case you need to make any of
your own well-behaved alias commands.
The tool itself is named xxtea (lowercase), and it’s supported on all unix-like and Windows systems. It’s trivial to compile, even on the latter. The code should be easy to follow from top to bottom, with commentary about specific decisions along the way, though I’ll quote the most important stuff inline here.
The command line options follow the usual conventions. The two
modes of operation are encrypt (-E
) and decrypt (-D
). It defaults to
using standard input and standard output so it works great in pipelines.
Supplying -o
sends output elsewhere (automatically deleted if something
goes wrong), and the optional positional argument indicates an alternate
input source.
usage: xxtea <-E|-D> [-h] [-o FILE] [-p PASSWORD] [FILE]
examples:
$ xxtea -E -o file.txt.xxtea file.txt
$ xxtea -D -o file.txt file.txt.xxtea
If no password is provided (-p
), it prompts for a UTF-8-encoded
password. Of course it’s not normally a good idea to supply a
password via command line argument, but it’s been useful for testing.
TEA stands for Tiny Encryption Algorithm and XXTEA is the second attempt at fixing weaknesses in the cipher — with partial success. The remaining issues should not be an issue for this particular application. XXTEA supports a variable block size, but I’ve hardcoded my implementation to a 128-bit block size, along with some unrolling. I’ve also discarded the unneeded decryption function. There are no data-dependent lookups or branches so it’s immune to speculation attacks.
XXTEA operates on 32-bit words and has a 128-bit key, meaning both block and key are four words apiece. My implementation is about a dozen lines long. Its prototype:
// Encrypt a 128-bit block using 128-bit key
void xxtea128_encrypt(const uint32_t key[4], uint32_t block[4]);
All cryptographic operations are built from this function. Another way to think about it is that it accepts two 128-bit inputs and returns a 128-bit result:
uint128 r = f(uint128 key, uint128 block);
Tuck that away in the back of your head since this will be important later.
If I tossed the decryption function, how are messages decrypted? I’m sure many have already guessed: XXTEA will be used in counter mode, or CTR mode. Rather than encrypt the plaintext directly, encrypt a 128-bit block counter and treat it like a stream cipher. The message is XORed with the encrypted counter values for both encryption and decryption.
A 128-bit increment with 32-bit limbs is easy:
void
increment(uint32_t ctr[4])
{
/* 128-bit increment, first word changes fastest */
if (!++ctr[0]) if (!++ctr[1]) if (!++ctr[2]) ++ctr[3];
}
In xxtea, words are always marshalled in little endian byte order (least significant byte first). With the first word as the least significant limb, the entire 128-bit counter is itself little endian.
The counter doesn’t start at zero, but at some randomly-selected 128-bit nonce called the initialization vector (IV), wrapping around to zero if necessary (incredibly unlikely). The IV will be included with the message in the clear. This nonce allows one key (password) to be used with multiple messages, as they’ll all be encrypted using different, randomly-chosen regions of an enormous keystream. It also provides semantic security: encrypt the same file more than once and the ciphertext will always be completely different.
for (/* ... */) {
uint32_t cover[4] = {ctr[0], ctr[1], ctr[2], ctr[3]};
xxtea128_encrypt(key, cover);
block[i+0] ^= cover[0];
block[i+1] ^= cover[1];
block[i+2] ^= cover[2];
block[i+3] ^= cover[3];
increment(ctr);
}
That’s encryption, but there’s still a matter of authentication and key derivation function (KDF). To deal with both I’ll need to devise a hash function. Since I’m only using the one primitive, somehow I need to build a hash function from a block cipher. Fortunately there’s a tool for doing just that: the Merkle–Damgård construction.
Recall that xxtea128_encrypt
accepts two 128-bit inputs and returns a
128-bit result. In other words, it compresses 256 bits into 128 bits: a
compression function. The two 128-bit inputs are cryptographically
combined into one 128-bit result. I can repeat this operation to fold an
arbitrary number of 128-bit inputs into a 128-bit hash result.
uint32_t *input = /* ... */;
uint32_t hash[4] = {0, 0, 0, 0};
xxtea128_encrypt(input + 0, hash);
xxtea128_encrypt(input + 4, hash);
xxtea128_encrypt(input + 8, hash);
xxtea128_encrypt(input + 12, hash);
// ...
Note how the input is the key, not the block. The hash state is repeatedly encrypted using the hash inputs as the key, mixing hash state and input. When the input is exhausted, that block is the result. Sort of.
I used zero for the initial hash state in my example, but it will be more challenging to attack if the starting input is something random. Like Blowfish, in xxtea I chose the first 128 bits of the decimals of pi:
void
xxtea128_hash_init(uint32_t ctx[4])
{
/* first 32 hexadecimal digits of pi */
ctx[0] = 0x243f6a88; ctx[1] = 0x85a308d3;
ctx[2] = 0x13198a2e; ctx[3] = 0x03707344;
}
/* Mix one block into the hash state. */
void
xxtea128_hash_update(uint32_t ctx[4], const uint32_t block[4])
{
xxtea128_encrypt(block, ctx);
}
There are still a couple of problems. First, what if the input isn’t a
multiple of the block size? This time I do need a padding scheme to fill
out that last block. In this case I pad it with bytes where each byte is
the number of padding bytes. For instance, helloworld
becomes, roughly
speaking, helloworld666666
.
That creates a different problem: This will have the same hash result as an input that actually ends with these bytes. So the second rule is that there is always a padding block, even if that block is 100% padding.
Another problem is that the Merkle–Damgård construction is prone to length-extension attacks. Anyone can take my hash result and continue appending additional data without knowing what came before. If I’m using this hash to authenticate the ciphertext, someone could, for example, use this attack to append arbitrary data to the end of messages.
Some important hash functions, such as the most common forms of SHA-2, are vulnerable to length-extension attacks. Keeping this issue in mind, I could address it later using HMAC, but I have an idea for nipping this in the bud now. Before mixing the padding block into the hash state, I swap the two middle words:
/* Append final raw-byte block to hash state. */
void
xxtea128_hash_final(uint32_t ctx[4], const void *buf, int len)
{
assert(len < 16);
unsigned char tmp[16];
memset(tmp, 16-len, 16);
memcpy(tmp, buf, len);
uint32_t k[4] = {
loadu32(tmp + 0), loadu32(tmp + 4),
loadu32(tmp + 8), loadu32(tmp + 12),
};
/* swap middle words to break length extension attacks */
uint32_t swap = ctx[1];
ctx[1] = ctx[2];
ctx[2] = swap;
xxtea128_encrypt(k, ctx);
}
This operation “ties off” the last block so that the hash can’t be extended with more input. Or so I hope. This is my own invention, and so it may not actually work right. Again, this is for fun and learning!
Update: Aristotle Pagaltzis pointed out that when these two words are identical the hash result will be unchanged, leaving it vulnerable to length extension attack. This occurs about once every 232 messages, which is far too small a security margin.
Despite all that care, there are still two more potential weaknesses.
First, XXTEA was never designed to be used with the Merkle–Damgård construction. I assume attackers can modify files I will decrypt, and so the hash input is usually and mostly under control of attackers, meaning they control the cipher key. Ciphers are normally designed assuming the key is not under hostile control. This might be vulnerable to related-key attacks.
As will be discussed below, I use this custom hash function in two ways. In one the input is not controlled by attackers, so this is a non-issue. In the second, the hash state is completely unknown to the attacker before they control the input, which I believe mitigates any issues.
Second, a 128-bit hash state is a bit small these days. For very large inputs, the chance of collision via the birthday paradox is a practical issue.
In xxtea, digests are only computed over a few megabytes of input at a time at most, even when encrypting giant files, so a 128-bit state should be fine.
The user will supply a password and somehow I need to turn that into a 128-bit key.
The first three can be resolved by running the passphrase through the hash
function, using it as key derivation function. What about the last item?
Rather than hash the password once, I concatenate it, including null
terminator, repeatedly until it reaches a certain number of bytes
(hardcoded to 64 MiB, see COST
), and hash that. That’s a computational
workload that attackers must repeat when guessing passwords.
To avoid timing attacks based on the password length, I precompute all possible block arrangements before starting the hash — all the different ways the password might appear concatenated across 16-byte blocks. Blocks may be redundantly computed if necessary to make this part constant time. The hash is fed entirely from these precomputed blocks.
To defend against rainbow tables and the like — as well as make it harder to attack other parts of the message construction — the initialization vector is used as a salt, fed into the hash before the password concatenation.
Unfortunately this KDF isn’t memory-hard, and attackers can use economy of scale to strengthen their attacks (GPUs, custom hardware). However, a memory-hard KDF requires lots of memory to compute the key, making memory an expensive and limiting factor for attackers. Memory-hard KDFs are complex and difficult to design, and I made the trade-off for simplicity.
When I say the encryption is authenticated I mean that it should not be possible for anyone to tamper with the ciphertext undetected without already knowing the key. This is typically accomplished by computing a keyed hash digest and appending it to the message, message authentication code (MAC). Since it’s keyed, only someone who knows the key can compute the digest, and so attackers can’t spoof the MAC.
This is where length-extension attacks come into play: With an improperly constructed MAC, an attacker could append input without knowing the key. Fortunately my hash function isn’t vulnerable to length-extension attacks!
An alternative is to use an authenticated block mode such as GCM, which is still CTR mode at its core. Unfortunately, this is complicated, and, unlike plain CTR, it would take me a long time to convince myself I got it right. So instead I used CTR mode and my hash function in a straightforward way.
At this point there’s a question of what exactly you input into the hash function. Do you hash the plaintext or do you hash the ciphertext? It’s tempting to do the former since it’s (generally) not available to attackers, and would presumably make it harder to attack. This is a mistake. Always compute the MAC over the ciphertext, a.k.a. encrypt then authenticate.
This is the called the Doom Principle. Computing the MAC on the plaintext means that recipients must decrypt untrusted ciphertext before authenticating it. This is bad because messages should be authenticated before decryption. So that’s exactly what xxtea does. It also happens to be the simplest option.
We have a hash function, but to compute a MAC we need a keyed hash function. Again, I do the simplest thing that I believe isn’t broken: concatenate the key with the ciphertext. Or more specifically:
MAC = hash(key || ctr || ciphertext)
Update: Dimitrije Erdeljan explains why this is broken and how to fix it. Given a valid MAC, attackers can forge arbitrary messages.
The counter is because xxtea uses chunked authentication with one megabyte chunks. It can authenticate a chunk at a time, which allows it to decrypt, with authentication, arbitrary amounts of ciphertext in a fixed amount of memory. The worst that can happen is truncation between chunks — an acceptable trade-off. The counter ensures each chunk MAC is uniquely keyed, that they appear in order.
It’s also important to note that the counter is appended after the key. The counter is under hostile control — they can choose the IV — and having the key there first means they have no information about the hash state.
All chunks are one megabyte except for the last chunk, which is always shorter, signaling the end of the message. It may even be just a MAC and zero-length ciphertext. This avoids nasty issues with parsing potentially unauthenticated length fields and whatnot. Just stop successfully at the first short, authenticated chunk.
Some will likely have spotted it, but a potential weakness is that I’m using the same key for both encryption and authentication. These are normally two different keys. This is disastrous in certain cases like CBC-MAC, but I believe it’s alright here. It would be easy to compute a separate MAC key, but I opted for simple.
In my usual style, encrypted files have no distinguishing headers or
fields. They just look like a random block of data. A file begins with the
16-byte IV, then a sequence of zero or more one megabyte chunks, ending
with a short chunk. It’s indistinguishable from /dev/random
.
[IV][lMiB || MAC][1MiB || MAC][<1 MiB || MAC]
If the user types the incorrect password, it will be discovered when authenticating the first chunk (read: immediately). This saves on a dedicated check at the beginning of the file, though it means it’s not possible to distinguish between a bad password and a modified file.
I know my design has weaknesses as a result of artificial, self-imposed constraints and deliberate trade-offs, but I’m curious if I’ve made any glaring mistakes with practical consequences.
]]>I love when my current problem can be solved with a state machine. They’re fun to design and implement, and I have high confidence about correctness. They tend to:
State machines are perhaps one of those concepts you heard about in college but never put into practice. Maybe you use them regularly. Regardless, you certainly run into them regularly, from regular expressions to traffic lights.
Inspired by a puzzle, I came up with this deterministic state
machine for decoding Morse code. It accepts a dot ('.'
), dash
('-'
), or terminator (0) one at a time, advancing through a state
machine step by step:
int morse_decode(int state, int c)
{
static const unsigned char t[] = {
0x03, 0x3f, 0x7b, 0x4f, 0x2f, 0x63, 0x5f, 0x77, 0x7f, 0x72,
0x87, 0x3b, 0x57, 0x47, 0x67, 0x4b, 0x81, 0x40, 0x01, 0x58,
0x00, 0x68, 0x51, 0x32, 0x88, 0x34, 0x8c, 0x92, 0x6c, 0x02,
0x03, 0x18, 0x14, 0x00, 0x10, 0x00, 0x00, 0x00, 0x0c, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x08, 0x1c, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x24,
0x00, 0x28, 0x04, 0x00, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35,
0x36, 0x37, 0x38, 0x39, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,
0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a
};
int v = t[-state];
switch (c) {
case 0x00: return v >> 2 ? t[(v >> 2) + 63] : 0;
case 0x2e: return v & 2 ? state*2 - 1 : 0;
case 0x2d: return v & 1 ? state*2 - 2 : 0;
default: return 0;
}
}
It typically compiles to under 200 bytes (table included), requires only a few bytes of memory to operate, and will fit on even the smallest of microcontrollers. The full source listing, documentation, and comprehensive test suite:
https://github.com/skeeto/scratch/blob/master/parsers/morsecode.c
The state machine is trie-shaped, and the 100-byte table t
is the static
encoding of the Morse code trie:
Dots traverse left, dashes right, terminals emit the character at the current node (terminal state). Stopping on red nodes, or attempting to take an unlisted edge is an error (invalid input).
Each node in the trie is a byte in the table. Dot and dash each have a bit
indicating if their edge exists. The remaining bits index into a 1-based
character table (at the end of t
), and a 0 “index” indicates an empty
(red) node. The nodes themselves are laid out as a binary heap in an
array: the left and right children of the node at i
are found at
i*2+1
and i*2+2
. No need to waste memory storing edges!
Since C sadly does not have multiple return values, I’m using the sign bit
of the return value to create a kind of sum type. A negative return value
is a state — which is why the state is negated internally before use. A
positive result is a character output. If zero, the input was invalid.
Only the initial state is non-negative (zero), which is fine since it’s,
by definition, not possible to traverse to the initial state. No c
input
will produce a bad state.
In the original problem the terminals were missing. Despite being a state
machine, morse_decode
is a pure function. The caller can save their
position in the trie by saving the state integer and trying different
inputs from that state.
The classic UTF-8 decoder state machine is Bjoern Hoehrmann’s Flexible and Economical UTF-8 Decoder. It packs the entire state machine into a relatively small table using clever tricks. It’s easily my favorite UTF-8 decoder.
I wanted to try my own hand at it, so I re-derived the same canonical UTF-8 automaton:
Then I encoded this diagram directly into a much larger (2,064-byte), less elegant table, too large to display inline here:
https://github.com/skeeto/scratch/blob/master/parsers/utf8_decode.c
However, the trade-off is that the executable code is smaller, faster, and branchless again (by accident, I swear!):
int utf8_decode(int state, long *cp, int byte)
{
static const signed char table[8][256] = { /* ... */ };
static const unsigned char masks[2][8] = { /* ... */ };
int next = table[state][byte];
*cp = (*cp << 6) | (byte & masks[!state][next&7]);
return next;
}
Like Bjoern’s decoder, there’s a code point accumulator. The real state machine has 1,109,950 terminal states, and many more edges and nodes. The accumulator is an optimization to track exactly which edge was taken to which node without having to represent such a monstrosity.
Despite the huge table I’m pretty happy with it.
Here’s another state machine I came up with awhile back for counting words one Unicode code point at a time while accounting for Unicode’s various kinds of whitespace. If your input is bytes, then plug this into the above UTF-8 state machine to convert bytes to code points! This one uses a switch instead of a lookup table since the table would be sparse (i.e. let the compiler figure it out).
/* State machine counting words in a sequence of code points.
*
* The current word count is the absolute value of the state, so
* the initial state is zero. Code points are fed into the state
* machine one at a time, each call returning the next state.
*/
long word_count(long state, long codepoint)
{
switch (codepoint) {
case 0x0009: case 0x000a: case 0x000b: case 0x000c: case 0x000d:
case 0x0020: case 0x0085: case 0x00a0: case 0x1680: case 0x2000:
case 0x2001: case 0x2002: case 0x2003: case 0x2004: case 0x2005:
case 0x2006: case 0x2007: case 0x2008: case 0x2009: case 0x200a:
case 0x2028: case 0x2029: case 0x202f: case 0x205f: case 0x3000:
return state < 0 ? -state : state;
default:
return state < 0 ? state : -1 - state;
}
}
I’m particularly happy with the edge-triggered state transition mechanism. The sign of the state tracks whether the “signal” is “high” (inside of a word) or “low” (outside of a word), and so it counts rising edges.
The counter is not technically part of the state machine — though it eventually overflows for practical reasons, it isn’t really “finite” — but is rather an external count of the times the state machine transitions from low to high, which is the actual, useful output.
Reader challenge: Find a slick, efficient way to encode all those code
points as a table rather than rely on whatever the compiler generates for
the switch
(chain of branches, jump table?).
In languages that support them, state machines can be implemented using coroutines, including generators. I do particularly like the idea of compiler-synthesized coroutines as state machines, though this is a rare treat. The state is implicit in the coroutine at each yield, so the programmer doesn’t have to manage it explicitly. (Though often that explicit control is powerful!)
Unfortunately in practice it always feels clunky. The following implements the word count state machine (albeit in a rather un-Pythonic way). The generator returns the current count and is continued by sending it another code point:
WHITESPACE = {
0x0009, 0x000a, 0x000b, 0x000c, 0x000d,
0x0020, 0x0085, 0x00a0, 0x1680, 0x2000,
0x2001, 0x2002, 0x2003, 0x2004, 0x2005,
0x2006, 0x2007, 0x2008, 0x2009, 0x200a,
0x2028, 0x2029, 0x202f, 0x205f, 0x3000,
}
def wordcount():
count = 0
while True:
while True:
# low signal
codepoint = yield count
if codepoint not in WHITESPACE:
count += 1
break
while True:
# high signal
codepoint = yield count
if codepoint in WHITESPACE:
break
However, the generator ceremony dominates the interface, so you’d probably want to wrap it in something nicer — at which point there’s really no reason to use the generator in the first place:
wc = wordcount()
next(wc) # prime the generator
wc.send(ord('A')) # => 1
wc.send(ord(' ')) # => 1
wc.send(ord('B')) # => 2
wc.send(ord(' ')) # => 2
Same idea in Lua, which famously has full coroutines:
local WHITESPACE = {
[0x0009]=true,[0x000a]=true,[0x000b]=true,[0x000c]=true,
[0x000d]=true,[0x0020]=true,[0x0085]=true,[0x00a0]=true,
[0x1680]=true,[0x2000]=true,[0x2001]=true,[0x2002]=true,
[0x2003]=true,[0x2004]=true,[0x2005]=true,[0x2006]=true,
[0x2007]=true,[0x2008]=true,[0x2009]=true,[0x200a]=true,
[0x2028]=true,[0x2029]=true,[0x202f]=true,[0x205f]=true,
[0x3000]=true
}
function wordcount()
local count = 0
while true do
while true do
-- low signal
local codepoint = coroutine.yield(count)
if not WHITESPACE[codepoint] then
count = count + 1
break
end
end
while true do
-- high signal
local codepoint = coroutine.yield(count)
if WHITESPACE[codepoint] then
break
end
end
end
end
Except for initially priming the coroutine, at least coroutine.wrap()
hides the fact that it’s a coroutine.
wc = coroutine.wrap(wordcount)
wc() -- prime the coroutine
wc(string.byte('A')) -- => 1
wc(string.byte(' ')) -- => 1
wc(string.byte('B')) -- => 2
wc(string.byte(' ')) -- => 2
Finally, a couple more examples not worth describing in detail here. First a Unicode case folding state machine:
https://github.com/skeeto/scratch/blob/master/misc/casefold.c
It’s just an interface to do a lookup into the official case folding table. It was an experiment, and I probably wouldn’t use it in a real program.
Second, I’ve mentioned my UTF-7 encoder and decoder before. It’s not obvious from the interface, but internally it’s just a state machine for both encoder and decoder, which is what it allows it to “pause” between any pair of input/output bytes.
]]>Machine learning is a trendy topic, so naturally it’s often used for inappropriate purposes where a simpler, more efficient, and more reliable solution suffices. The other day I saw an illustrative and fun example of this: Neural Network Cars and Genetic Algorithms. The video demonstrates 2D cars driven by a neural network with weights determined by a generic algorithm. However, the entire scheme can be replaced by a first-degree polynomial without any loss in capability. The machine learning part is overkill.
Above demonstrates my implementation using a polynomial to drive the cars. My wife drew the background. There’s no path-finding; these cars are just feeling their way along the track, “following the rails” so to speak.
My intention is not to pick on this project in particular. The likely motivation in the first place was a desire to apply a neural network to something. Many of my own projects are little more than a vehicle to try something new, so I can sympathize. Though a professional setting is different, where machine learning should be viewed with a more skeptical eye than it’s usually given. For instance, don’t use active learning to select sample distribution when a quasirandom sequence will do.
In the video, the car has a limited turn radius, and minimum and maximum speeds. (I’ve retained these contraints in my own simulation.) There are five sensors — forward, forward-diagonals, and sides — each sensing the distance to the nearest wall. These are fed into a 3-layer neural network, and the outputs determine throttle and steering. Sounds pretty cool!
A key feature of neural networks is that the outputs are a nonlinear function of the inputs. However, steering a 2D car is simple enough that a linear function is more than sufficient, and neural networks are unnecessary. Here are my equations:
steering = C0*input1 - C0*input3
throttle = C1*input2
I only need three of the original inputs — forward for throttle, and
diagonals for steering — and the driver has just two parameters, C0
and
C1
, the polynomial coefficients. Optimal values depend on the track
layout and car configuration, but for my simulation, most values above 0
and below 1 are good enough in most cases. It’s less a matter of crashing
and more about navigating the course quickly.
The lengths of the red lines below are the driver’s three inputs:
These polynomials are obviously much faster than a neural network, but they’re also easy to understand and debug. I can confidently reason about the entire range of possible inputs rather than worry about a trained neural network responding strangely to untested inputs.
Instead of doing anything fancy, my program generates the coefficients at random to explore the space. If I wanted to generate a good driver for a course, I’d run a few thousand of these and pick the coefficients that complete the course in the shortest time. For instance, these coefficients make for a fast, capable driver for the course featured at the top of the article:
C0 = 0.896336973, C1 = 0.0354805067
Many constants can complete the track, but some will be faster than others. If I was developing a racing game using this as the AI, I’d not just pick constants that successfully complete the track, but the ones that do it quickly. Here’s what the spread can look like:
If you want to play around with this yourself, here’s my C source code that implements this driving AI and generates the videos and images above:
Racetracks are just images drawn in your favorite image editing program using the colors documented in the source header.
]]>Pixelmusement produces videos about MS-DOS games and software. Each video ends with a short, randomly-selected listing of financial backers. In ADG Filler #57, Kris revealed the selection process, and it absolutely fits the channel’s core theme: a QBasic program. His program relies on QBasic’s built-in pseudo random number generator (PRNG). Even accounting for the platform’s limitations, the PRNG is much poorer quality than it could be. Let’s discuss these weaknesses and figure out how to make the selection more fair.
Kris’s program seeds the PRNG with the system clock (RANDOMIZE TIMER
, a
QBasic idiom), populates an array with the backers represented as integers
(indices), continuously shuffles the list until the user presses a key, then
finally prints out a random selection from the array. Here’s a simplified
version of the program (note: QBasic comments start with apostrophe '
):
CONST ntickets = 203 ' input parameter
CONST nresults = 12
RANDOMIZE TIMER
DIM tickets(0 TO ntickets - 1) AS LONG
FOR i = 0 TO ntickets - 1
tickets(i) = i
NEXT
CLS
PRINT "Press any key to stop shuffling..."
DO
i = INT(RND * ntickets)
j = INT(RND * ntickets)
SWAP tickets(i), tickets(j)
LOOP WHILE INKEY$ = ""
FOR i = 0 to nresults - 1
PRINT tickets(i)
NEXT
This should be readable even if you don’t know QBasic. Note: In the real program, backers at higher tiers get multiple tickets in order to weight the results. This is accounted for in the final loop such that nobody appears more than once. It’s mostly irrelevant to the discussion here, so I’ve omitted it.
The final result is ultimately a function of just three inputs:
TIMER
)The second item has the nice property that by becoming a backer you influence the result.
QBasic’s PRNG is this 24-bit Linear Congruential Generator (LCG):
uint32_t
rnd24(uint32_t *s)
{
*s = (*s*0xfd43fd + 0xc39ec3) & 0xffffff;
return *s;
}
The result is the entire 24-bit state. RND
divides this by 2^24 and
returns it as a single precision float so that the caller receives a value
between 0 and 1 (exclusive).
Needless to say, this is a very poor PRNG. The LCG constants are reasonable, but the choice to limit the state to 24 bits is strange. According to the QBasic 16-bit assembly (note: the LCG constants listed here are wrong), the implementation is a full 32-bit multiply using 16-bit limbs, and it allocates and writes a full 32 bits when storing the state. As expected for the 8086, there was nothing gained by using only the lower 24 bits.
To illustrate how poor it is, here’s a randogram for this PRNG, which shows obvious structure. (This is a small slice of a 4096x4096 randogram where each of the 2^23 24-bit samples is plotted as two 12-bit coordinates.)
Admittedly this far overtaxes the PRNG. With a 24-bit state, it’s only good for 4,096 (2^12) outputs, after which it no longer follows the birthday paradox: No outputs are repeated even though we should start seeing some. However, as I’ll soon show, this doesn’t actually matter.
Instead of discarding the high 8 bits — the highest quality output bits — QBasic’s designers should have discarded the low 8 bits for the output, turning it into a truncated 32-bit LCG:
uint32_t
rnd32(uint32_t *s)
{
*s = *s*0xfd43fd + 0xc39ec3;
return *s >> 8;
}
This LCG would have the same performance, but significantly better quality. Here’s the randogram for this PRNG, and it is also heavily overtaxed (more than 65,536, 2^16 outputs).
It’s a solid upgrade, completely for free!
That’s not the end of our troubles. The RANDOMIZE
statement accepts a
double precision (i.e. 64-bit) seed. The high 16 bits of its IEEE 754
binary representation are XORed with the next highest 16 bits. The high 16
bits of the PRNG state is set to this result. The lowest 8 bits are
preserved.
To make this clearer, here’s a C implementation, verified against QBasic 7.1:
uint32_t s;
void
randomize(double seed)
{
uint64_t x;
memcpy(&x ,&seed, 8);
s = (x>>24 ^ x>>40) & 0xffff00 | (s & 0xff);
}
In other words, RANDOMIZE
only sets the PRNG to one of 65,536 possible
states.
As the final piece, here’s how RND
is implemented, also verified against
QBasic 7.1:
float
rnd(float arg)
{
if (arg < 0) {
memcpy(&s, &arg, 4);
s = (s & 0xffffff) + (s >> 24);
}
if (arg != 0.0f) {
s = (s*0xfd43fd + 0xc39ec3) & 0xffffff;
}
return s / (float)0x1000000;
}
The TIMER
function returns the single precision number of
seconds since midnight with ~55ms precision (i.e. the 18.2Hz timer
interrupt counter). This is strictly time of day, and the current date is
not part of the result, unlike, say, the unix epoch.
This means there are only 1,572,480 distinct values returned by TIMER
.
That’s small even before considering that these map onto only 65,536
possible seeds with RANDOMIZE
— all of which are fortunately
realizable via TIMER
.
Of the three inputs to random selection, this first one is looking pretty bad.
Kris’s idea of continuously mixing the array until he presses a key makes
up for much of the QBasic PRNG weaknesses. He lets it run for over 200,000
array swaps — traversing over 2% of the PRNG’s period — and the array
itself acts like an extended PRNG state, supplementing the 24-bit RND
state.
Since iterations fly by quickly, the exact number of iterations becomes another source of entropy. The results will be quite different if it runs 214,600 iterations versus 273,500 iterations.
Possible improvement: Only exit the loop when a certain key is pressed. If
any other key is pressed then that input and the TIMER
are mixed into
the PRNG state. Mashing the keyboard during the loop introduces more
entropy.
Since the built-in PRNG is so poor, we could improve the situation by
implementing a new one in QBasic itself. The challenge is that
QBasic has no unsigned integers, not even unsigned integer operators (i.e.
Java and JavaScript’s >>>
), and signed overflow is a run-time error. We
can’t even re-implement QBasic’s own LCG without doing long multiplication
in software, since the intermediate result overflows its 32-bit LONG
.
Popular choices in these constraints are Park–Miller generator (as we saw in Bash) or a lagged Fibonacci generator (as used by Emacs, which was for a long time constrained to 29-bit integers).
However, I have a better idea: a PRNG based on RC4. Specifically,
my own design called Sponge4, a sponge construction
built atop RC4. In short: Mixing in more input is just a matter of running
the key schedule again. Implementing this PRNG requires just two simple
operations: modular addition over 2^8, and array swap. QBasic has a SWAP
statement, so it’s a natural fit!
Sponge4 (RC4) has much higher quality output than the 24-bit LCG, and I can mix in more sources of entropy. With its 1,700-bit state, it can absorb quite a bit of entropy without loss.
Until this past weekend, I had not touched QBasic for about 23 years and had to learn it essentially from scratch. Though within a couple of hours I probably already understood it better than I ever had. That’s in large part because I’m far more experienced, but also probably because QBasic tutorials are universally awful. Not surprisingly they’re written for beginners, but they also seem to be all written by beginners, too. I soon got the impression that QBasic community has usually been another case of the blind leading the blind.
There’s little direct information for experienced programmers, and even the official documentation tends to be thin in important places. I wanted documentation that started with the core language semantics:
The basic types are INTEGER (int16), LONG (int32), SINGLE (float32), DOUBLE (float64), and two flavors of STRING, fixed-width and variable-width. Late versions also had incomplete support for a 64-bit, 10,000x fixed-point CURRENCY type.
Variables are SINGLE by default and do not need to be declared ahead of time. Arrays have 11 elements by default.
Variables, constants, and functions may have a suffix if their type is
not SINGLE: INTEGER %
, LONG &
, SINGLE !
, DOUBLE #
, STRING $
,
and CURRENCY @
. For functions, this is the return type.
Each variable type has its own namespace, i.e. i%
is distinct from
i&
. Arrays are also their own namespace, i.e. i%
is distinct from
i%(0)
is distinct from i&(0)
.
Variables may be declared explicitly with DIM
. Declaring a variable
with DIM
allows the suffix to be omitted. It also locks that name out
of the other type namespaces, i.e. DIM i AS LONG
makes any use of i%
invalid in that scope. Though arrays and scalars can still have the same
name even with DIM
declarations.
Numeric operations with mixed types implicitly promote like C.
Functions and subroutines have a single, common namespace regardless of function suffix. As a result, the suffix can (usually) be omitted at function call sites. Built-in functions are special in this case.
Despite initial appearances, QBasic is statically-typed.
The default is pass-by-reference. Use BYVAL
to pass by value.
In array declarations, the parameter is not the size but the largest
index. Multidimensional arrays are supported. Arrays need not be indexed
starting at zero (e.g. (x TO y)
), though this is the default.
Strings are not arrays, but their own special thing with special accessor statements and functions.
Scopes are module, subroutine, and function. “Global” variables must be
declared with SHARED
.
Users can define custom structures with TYPE
. Functions cannot return
user-defined types and instead rely on pass-by-reference.
A crude kind of dynamic allocation is supported with REDIM
to resize
$DYNAMIC
arrays at run-time. ERASE
frees allocations.
These are the semantics I wanted to know getting started. Throw in some illustrative examples, and then it’s a tutorial for experienced developers. (Future article perhaps?) Anyway, that’s enough to follow along below.
Like RC4, I need a 256-element byte array, and two 1-byte indices, i
and
j
. Sponge4 also keeps a third 1-byte counter, k
, to count input.
TYPE sponge4
i AS INTEGER
j AS INTEGER
k AS INTEGER
s(0 TO 255) AS INTEGER
END TYPE
QBasic doesn’t have a “byte” type. A fixed-size 256-byte string would
normally be a good match here, but since they’re not arrays, strings are
not compatible with SWAP
and are not indexed efficiently. So instead I
accept some wasted space and use 16-bit integers for everything.
There are four “methods” for this structure. Three are subroutines since
they don’t return a value, but mutate the sponge. The last, squeeze
,
returns the next byte as an INTEGER (%
).
DECLARE SUB init (r AS sponge4)
DECLARE SUB absorb (r AS sponge4, b AS INTEGER)
DECLARE SUB absorbstop (r AS sponge4)
DECLARE FUNCTION squeeze% (r AS sponge4)
Initialization follows RC4:
SUB init (r AS sponge4)
r.i = 0
r.j = 0
r.k = 0
FOR i% = 0 TO 255
r.s(i%) = i%
NEXT
END SUB
Absorbing a byte means running the RC4 key schedule one step. Absorbing a “stop” symbol, for separating inputs, transforms the state in a way that absorbing a byte cannot.
SUB absorb (r AS sponge4, b AS INTEGER)
r.j = (r.j + r.s(r.i) + b) MOD 256
SWAP r.s(r.i), r.s(r.j)
r.i = (r.i + 1) MOD 256
r.k = (r.k + 1) MOD 256
END SUB
SUB absorbstop (r AS sponge4)
r.j = (r.j + 1) MOD 256
END SUB
Squeezing a byte may involve mixing the state first, then it runs the RC4 generator normally.
FUNCTION squeeze% (r AS sponge4)
IF r.k > 0 THEN
absorbstop r
DO WHILE r.k > 0
absorb r, r.k
LOOP
END IF
r.j = (r.j + r.i) MOD 256
r.i = (r.i + 1) MOD 256
SWAP r.s(r.i), r.s(r.j)
squeeze% = r.s((r.s(r.i) + r.s(r.j)) MOD 256)
END FUNCTION
That’s the entire generator in QBasic! A couple more helper functions will be useful, though. One absorbs entire strings, and the second emits 24-bit results.
SUB absorbstr (r AS sponge4, s AS STRING)
FOR i% = 1 TO LEN(s)
absorb r, ASC(MID$(s, i%))
NEXT
END SUB
FUNCTION squeeze24& (r AS sponge4)
b0& = squeeze%(r)
b1& = squeeze%(r)
b2& = squeeze%(r)
squeeze24& = b2& * &H10000 + b1& * &H100 + b0&
END FUNCTION
QBasic doesn’t have bit-shift operations, so we must make due with
multiplication. The &H
is hexadecimal notation.
One of the problems with the original program is that only the time of day
was a seed. Even were it mixed better, if we run the program at exactly
the same instant on two different days, we get the same seed. The DATE$
function returns the current date, which we can absorb into the sponge to
make the whole date part of the input.
DIM sponge AS sponge4
init sponge
absorbstr sponge, DATE$
absorbstr sponge, MKS$(TIMER)
absorbstr sponge, MKI$(ntickets)
I follow this up with the timer. It’s converted to a string with MKS$
,
which returns the little-endian, single precision binary representation as
a 4-byte string. MKI$
does the same for INTEGER, as a 2-byte string.
One of the problems with the original program was bias: Multiplying RND
by a constant, then truncating the result to an integer is not uniform in
most cases. Some numbers are selected slightly more often than others
because 2^24 inputs cannot map uniformly onto, say, 10 outputs. With all
the shuffling in the original it probably doesn’t make a practical
difference, but I’d like to avoid it.
In my program I account for it by generating another number if it happens
to fall into that extra “tail” part of the input distribution (very
unlikely for small ntickets
). The squeezen
function uniformly
generates a number in 0 to N (exclusive).
FUNCTION squeezen% (r AS sponge4, n AS INTEGER)
DO
x& = squeeze24&(r) - &H1000000 MOD n
LOOP WHILE x& < 0
squeezen% = x& MOD n
END FUNCTION
Finally a Fisher–Yates shuffle, then print the first N elements:
FOR i% = ntickets - 1 TO 1 STEP -1
j% = squeezen%(sponge, i% + 1)
SWAP tickets(i%), tickets(j%)
NEXT
FOR i% = 1 TO nresults
PRINT tickets(i%)
NEXT
Though if you really love Kris’s loop idea:
PRINT "Press Esc to finish, any other key for entropy..."
DO
c& = c& + 1
LOCATE 2, 1
PRINT "cycles ="; c&; "; keys ="; k%
FOR i% = ntickets - 1 TO 1 STEP -1
j% = squeezen%(sponge, i% + 1)
SWAP tickets(i%), tickets(j%)
NEXT
k$ = INKEY$
IF k$ = CHR$(27) THEN
EXIT DO
ELSEIF k$ <> "" THEN
k% = k% + 1
absorbstr sponge, k$
END IF
absorbstr sponge, MKS$(TIMER)
LOOP
If you want to try it out for yourself in, say, DOSBox, here’s the full
source: sponge4.bas
British Square is a 1978 abstract strategy board game which I recently discovered from a YouTube video. It’s well-suited to play by pencil-and-paper, so my wife and I played a few rounds to try it out. Curious about strategies, I searched online for analysis and found nothing whatsoever, meaning I’d have to discover strategies for myself. This is exactly the sort of problem that nerd snipes, and so I sunk a couple of evenings building an analysis engine in C — enough to fully solve the game and play perfectly.
Repository: British Square Analysis Engine (and prebuilt binaries)
The game is played on a 5-by-5 grid with two players taking turns placing pieces of their color. Pieces may not be placed on tiles 4-adjacent to an opposing piece, and as a special rule, the first player may not play the center tile on the first turn. Players pass when they have no legal moves, and the game ends when both players pass. The score is the difference between the piece counts for each player.
In the default configuration, my engine takes a few seconds to explore the full game tree, then presents the minimax values for the current game state along with the list of perfect moves. The UI allows manually exploring down the game tree. It’s intended for analysis, but there’s enough UI present to “play” against the AI should you so wish. For some of my analysis I made small modifications to the program to print or count game states matching certain conditions.
Not accounting for symmetries, there are 4,233,789,642,926,592 possible playouts. In these playouts, the first player wins 2,179,847,574,830,592 (~51%), the second player wins 1,174,071,341,606,400 (~28%), and the remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious the first player has a huge advantage.
Accounting for symmetries, there are 8,659,987 total game states. Of these, 6,955 are terminal states, of which the first player wins 3,599 (~52%) and the second player wins 2,506 (~36%). This small number of states is what allows the engine to fully explore the game tree in a few seconds.
Most importantly: The first player can always win by two points. In other words, it’s not like Tic-Tac-Toe where perfect play by both players results in a tie. Due to the two-point margin, the first player also has more room for mistakes and usually wins even without perfect play. There are fewer opportunities to blunder, and a single blunder usually results in a lower win score. The second player has a narrow lane of perfect play, making it easy to blunder.
Below is the minimax analysis for the first player’s options. The number is the first player’s score given perfect play from that point — i.e. perfect play starts on the tiles marked “2”, and the tiles marked “0” are blunders that lead to ties.
11111
12021
10-01
12021
11111
The special center rule probably exists to reduce the first player’s obvious advantage, but in practice it makes little difference. Without the rule, the first player has an additional (fifth) branch for a win by two points:
11111
12021
10201
12021
11111
Improved alternative special rule: Bias the score by two in favor of the second player. This fully eliminates the first player’s advantage, perfect play by both sides results in a tie, and both players have a narrow lane of perfect play.
The four tie openers are interesting because the reasoning does not require computer assistance. If the first player opens on any of those tiles, the second player can mirror each of the first player’s moves, guaranteeing a tie. Note: The first player can still make mistakes that results in a second player win if the second player knows when to stop mirroring.
One of my goals was to develop a heuristic so that even human players can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was not able to develop any such heuristic, though I was able to prove that a greedy heuristic — always claim as much territory as possible — is often incorrect and, in some cases, leads to blunders.
As I’ve done before, my engine represents the game using
bitboards. Each player has a 25-bit bitboard representing their
pieces. To make move validation more efficient, it also sometimes tracks
a “mask” bitboard where invalid moves have been masked. Updating all
bitboards is cheap (place()
, mask()
), as is validating moves
against the mask (valid()
).
The longest possible game is 32 moves. This would just fit in 5 bits, except that I needed a special “invalid” turn, making it a total of 33 bits. So I use 6 bits to store the turn counter.
Besides generally being unnecessary, the validation masks can be derived from the main bitboards, so I don’t need to store them in the game tree. That means I need 25 bits per player, and 6 bits for the counter: 56 bits total. I pack these into a 64-bit integer. The first player’s bitboard goes in the bottom 25 bits, the second player in the next 25 bits, and the turn counter in the topmost 6 bits. The turn counter starts at 1, so an all zero state is invalid. I exploit this in the hash table so that zeroed slots are empty (more on this later).
In other words, the empty state is 0x4000000000000
(INIT
) and zero
is the null (invalid) state.
Since the state is so small, rather than passing a pointer to a state to be acted upon, bitboard functions return a new bitboard with the requested changes… functional style.
// Compute bitboard+mask where first play is tile 6
// -----
// -X---
// -----
// -----
// -----
uint64_t b = INIT;
uint64_t m = INIT;
b = place(b, 6);
m = mask(m, 6);
The engine uses minimax to propagate information up the tree. Since the search extends to the very bottom of the tree, the minimax “heuristic” evaluation function is the actual score, not an approximation, which is why it’s able to play perfectly.
When I’ve used minimax before, I built an actual tree data structure in memory, linking states by pointer / reference. In this engine there is no such linkage, and instead the links are computed dynamically via the validation masks. Storing the pointers is more expensive than computing their equivalents on the fly, so I don’t store them. Therefore my game tree only requires 56 bits per node — or 64 bits in practice since I’m using a 64-bit integer. With only 8,659,987 nodes to store, that’s a mere 66MiB of memory! This analysis could have easily been done on commodity hardware two decades ago.
What about the minimax values? Game scores range from -10 to 11: 22 distinct values. (That the first player can score up to 11 and the second player at most 10 is another advantage to going first.) That’s 5 bits of information. However, I didn’t have this information up front, and so I assumed a range from -25 to 25, which requires 6 bits.
There are still 8 spare bits left in the 64-bit integer, so I use 6 of them for the minimax score. Rather than worry about two’s complement, I bias the score to eliminate negative values before storing it. So the minimax score rides along for free above the state bits.
The vast majority of game tree branches are redundant. Even without taking symmetries into account, nearly all states are reachable from multiple branches. Exploring all these redundant branches would take centuries. If I run into a state I’ve seen before, I don’t want to recompute it.
Once I’ve computed a result, I store it in a hash table so that I can find it later. Since the state is just a 64-bit integer, I use an integer hash function to compute a starting index from which to linearly probe an open addressing hash table. The entire hash table implementation is literally a dozen lines of code:
uint64_t *
lookup(uint64_t bitboard)
{
static uint64_t table[N];
uint64_t mask = 0xffffffffffffff; // sans minimax
uint64_t hash = bitboard;
hash *= 0xcca1cee435c5048f;
hash ^= hash >> 32;
for (size_t i = hash % N; ; i = (i + 1) % N) {
if (!table[i] || table[i]&mask == bitboard) {
return &table[i];
}
}
}
If the bitboard is not found, it returns a pointer to the (zero-valued) slot where it should go so that the caller can fill it in.
Memoization eliminates nearly all redundancy, but there’s still a major optimization left. Many states are equivalent by symmetry or reflection. Taking that into account, about 7/8th of the remaining work can still be eliminated.
Multiple different states that are identical by symmetry must to be somehow “folded” into a single, canonical state to represent them all. I do this by visiting all 8 rotations and reflections and choosing the one with the smallest 64-bit integer representation.
I only need two operations to visit all 8 symmetries, and I chose transpose (flip around the diagonal) and vertical flip. Alternating between these operations visits each symmetry. Since they’re bitboards, transforms can be implemented using fancy bit-twiddling hacks. Chess boards, with their power-of-two dimensions, have useful properties which these British Square boards lack, so this is the best I could come up with:
// Transpose a board or mask (flip along the diagonal).
uint64_t
transpose(uint64_t b)
{
return ((b >> 16) & 0x00000020000010) |
((b >> 12) & 0x00000410000208) |
((b >> 8) & 0x00008208004104) |
((b >> 4) & 0x00104104082082) |
((b >> 0) & 0xfe082083041041) |
((b << 4) & 0x01041040820820) |
((b << 8) & 0x00820800410400) |
((b << 12) & 0x00410000208000) |
((b << 16) & 0x00200000100000);
}
// Flip a board or mask vertically.
uint64_t
flipv(uint64_t b)
{
return ((b >> 20) & 0x0000003e00001f) |
((b >> 10) & 0x000007c00003e0) |
((b >> 0) & 0xfc00f800007c00) |
((b << 10) & 0x001f00000f8000) |
((b << 20) & 0x03e00001f00000);
}
These transform both players’ bitboards in parallel while leaving the turn counter intact. The logic here is quite simple: Shift the bitboard a little bit at a time while using a mask to deposit bits in their new home once they’re lined up. It’s like a coin sorter. Vertical flip is analogous to byte-swapping, though with 5-bit “bytes”.
Canonicalizing a bitboard now looks like this:
uint64_t
canonicalize(uint64_t b)
{
uint64_t c = b;
b = transpose(b); c = c < b ? c : b;
b = flipv(b); c = c < b ? c : b;
b = transpose(b); c = c < b ? c : b;
b = flipv(b); c = c < b ? c : b;
b = transpose(b); c = c < b ? c : b;
b = flipv(b); c = c < b ? c : b;
b = transpose(b); c = c < b ? c : b;
return c;
}
Callers need only use canonicalize()
on values they pass to lookup()
or store in the table (via the returned pointer).
If you can come up with a perfect play heuristic, especially one that can be reasonably performed by humans, I’d like to hear it. My engine has a built-in heuristic tester, so I can test it against perfect play at all possible game positions to check that it actually works. It’s currently programmed to test the greedy heuristic and print out the millions of cases where it fails. Even a heuristic that fails in only a small number of cases would be pretty reasonable.
]]>This past May I put together my own C and C++ development distribution for Windows called w64devkit. The entire release weighs under 80MB and requires no installation. Unzip and run it in-place anywhere. It’s also entirely offline. It will never automatically update, or even touch the network. In mere seconds any Windows system can become a reliable development machine. (To further increase reliability, disconnect it from the internet.) Despite its simple nature and small packaging, w64devkit is almost everything you need to develop any professional desktop application, from a command line utility to a AAA game.
I don’t mean this in some useless Turing-complete sense, but in a practical, get-stuff-done sense. It’s much more a matter of know-how than of tools or libraries. So then what is this “almost” about?
The distribution does not have WinAPI documentation. It’s notoriously difficult to obtain and, besides, unfriendly to redistribution. It’s essential for interfacing with the operating system and difficult to work without. Even a dead tree reference book would suffice.
Depending on what you’re building, you may still need specialized tools. For instance, game development requires tools for editing art assets.
There is no formal source control system. Git is excluded per the
issues noted in the announcement, and my next option, Quilt,
has similar limitations. However, diff
and patch
are included,
and are sufficient for a kind of old-school, patch-based source
control. I’ve used it successfully when dogfooding w64devkit in a
fresh Windows installation.
As I said in my announcement, w64devkit includes a powerful text editor
that fulfills all text editing needs, from code to documentation. The
editor includes a tutorial (vimtutor
) and complete, built-in manual
(:help
) in case you’re not yet familiar with it.
What about navigation? Use the included ctags to generate a
tags database (ctags -R
), then jump instantly to any
definition at any time. No need for that Language Server Protocol
rubbish. This does not mean you must laboriously type identifiers
as you work. Use built-in completion!
Build system? That’s also covered, via a Windows-aware unix-like
environment that includes make
. Learning how to use it is a
breeze. Software is by its nature unavoidably complicated, so don’t
make it more complicated than necessary.
What about debugging? Use the debugger, GDB. Performance problems? Use
the profiler, gprof. Inspect compiler output either by asking for it
(-S
) or via the disassembler (objdump -d
). No need to go online for
the Godbolt Compiler Explorer, as slick as it is. If the compiler
output is insufficient, use SIMD intrinsics. In the worst case
there are two different assemblers available. Real time graphics? Use an
operating system API like OpenGL, DirectX, or Vulkan.
w64devkit really is nearly everything you need in a single, no nonsense, fully-offline package! It’s difficult to emphasize this point as much as I’d like. When interacting with the broader software ecosystem, I often despair that software development has lost its way. This distribution is my way of carving out an escape from some of the insanity. As a C and C++ toolchain, w64devkit by default produces lean, sane, trivially-distributable, offline-friendly artifacts. All runtime components in the distribution are static link only, so no need to distribute DLLs with your application either.
While most users would likely stick to my published releases, building w64devkit is a two-step process with a single build dependency, Docker. Anyone can easily customize it for their own needs. Don’t care about C++? Toss it to shave 20% off the distribution. Need to tune the runtime for a specific microarchitecture? Tweak the compiler flags.
One of the intended strengths of open source is users can modify software to suit their needs. With w64devkit, you own the toolchain itself. It is one of your dependencies after all. Unfortunately the build initially requires an internet connection even when working from source tarballs, but at least it’s a one-time event.
If you choose to take on dependencies, and you build those dependencies using w64devkit, all the better! You can tweak them to your needs and choose precisely how they’re built. You won’t be relying on the goodwill of internet randos nor the generosity of a free package registry.
Building existing software using w64devkit is probably easier than expected, particularly since much of it has already been “ported” to MinGW and Mingw-w64. Just don’t bother with GNU Autoconf configure scripts. They never work in w64devkit despite having everything they technically need. So other than that, here’s a demonstration of building some popular software.
One of my coworkers uses his own version of PuTTY patched to play more nicely with Emacs. If you wanted to do the same, grab the source tarball, unpack it using the provided tools, then in the unpacked source:
$ make -C windows -f Makefile.mgw
You’ll have a custom-built putty.exe, as well as the other tools. If you have any patches, apply those first!
Would you like to embed an extension language in your application? Lua is a solid choice, in part because it’s such a well-behaved dependency. After unpacking the source tarball:
$ make PLAT=mingw
This produces a complete Lua compiler, runtime, and library. It’s not
even necessary to use the Makefile, as it’s nearly as simple as “cc
*.c
” — painless to integrate or embed into any project.
Do you enjoy NetHack? Perhaps you’d like to try a few of the custom patches. This one is a little more complicated, but I was able to build NetHack 3.6.6 like so:
$ sys/winnt/nhsetup.bat
$ make -C src -f Makefile.gcc cc="cc -fcommon" link="cc"
NetHack has a bug necessitating -fcommon
. If you have any
patches, apply them with patch
before the last step. I won’t belabor it
here, but with just a little more effort I was also able to produce a
NetHack binary with curses support via PDCurses — statically-linked
of course.
How about my archive encryption tool, Enchive? The one that even works with 16-bit DOS compilers. It requires nothing special at all!
$ make
w64devkit can also host parts of itself: Universal Ctags, Vim, and NASM. This means you can modify and recompile these tools without going through the Docker build. Sadly busybox-w32 cannot host itself, though it’s close. I’d love if w64devkit could fully host itself, and so Docker — and therefore an internet connection and such — would only be needed to bootstrap, but unfortunately that’s not realistic given the state of the GNU components.
Software development has increasingly become dependent on a constant internet connection. Robust, offline tooling and development is undervalued.
Consider: Does your current project depend on an external service? Do you pay for this service to ensure that it remains up? If you pull your dependencies from a repository, how much do you trust those who maintain the packages? Do you even know their names? What would be your project’s fate if that service went down permanently? It will someday, though hopefully only after your project is dead and forgotten. If you have the ability to work permanently offline, then you already have happy answers to all these questions.
]]>The usual way to work around the lack of operating system support for a particular asynchronous operation is to dedicate threads to waiting on those operations. By using a thread pool, we can even avoid the overhead of spawning threads when we need them. Plus asyncio is designed to play nicely with thread pools anyway.
Before we get started, we’ll need some way to test that it’s working. We
need a slow file system. One thought is to use ptrace to intercept the
relevant system calls, though this isn’t quite so simple. The
other threads need to continue running while the thread waiting on
open(2)
is paused, but ptrace pauses the whole process. Fortunately
there’s a simpler solution anyway: LD_PRELOAD
.
Setting the LD_PRELOAD
environment variable to the name of a shared
object will cause the loader to load this shared object ahead of
everything else, allowing that shared object to override other
libraries. I’m on x86-64 Linux (Debian), and so I’m looking to override
open64(2)
in glibc. Here’s my open64.c
:
#define _GNU_SOURCE
#include <dlfcn.h>
#include <string.h>
#include <unistd.h>
int
open64(const char *path, int flags, int mode)
{
if (!strncmp(path, "/tmp/", 5)) {
sleep(3);
}
int (*f)(const char *, int, int) = dlsym(RTLD_NEXT, "open64");
return f(path, flags, mode);
}
Now Python must go through my C function when it opens files. If the
file resides where under /tmp/
, opening the file will be delayed by 3
seconds. Since I still want to actually open a file, I use dlsym()
to
access the real open64()
in glibc. I build it like so:
$ cc -shared -fPIC -o open64.so open64.c -ldl
And to test that it works with Python, let’s time how long it takes to
open /tmp/x
:
$ touch /tmp/x
$ time LD_PRELOAD=./open64.so python3 -c 'open("/tmp/x")'
real 0m3.021s
user 0m0.014s
sys 0m0.005s
Perfect! (Note: It’s a little strange putting time
before setting the
environment variable, but that’s because I’m using Bash and it time
is
special since this is the shell’s version of the command.)
Python’s standard open()
is most commonly used as a context manager
so that the file is automatically closed no matter what happens.
with open('output.txt', 'w') as out:
print('hello world', file=out)
I’d like my asynchronous open to follow this pattern using async
with
. It’s like with
, but the context manager is acquired and
released asynchronously. I’ll call my version aopen()
:
async with aopen('output.txt', 'w') as out:
...
So aopen()
will need to return an asynchronous context manager, an
object with methods __aenter__
and __aexit__
that both return
awaitables. Usually this is by virtue of these methods being
coroutine functions, but a normal function that directly returns
an awaitable also works, which is what I’ll be doing for __aenter__
.
class _AsyncOpen():
def __init__(self, args, kwargs):
...
def __aenter__(self):
...
async def __aexit__(self, exc_type, exc, tb):
...
Ultimately we have to call open()
. The arguments for open()
will be
given to the constructor to be used later. This will make more sense
when you see the definition for aopen()
.
def __init__(self, args, kwargs):
self._args = args
self._kwargs = kwargs
When it’s time to actually open the file, Python will call __aenter__
.
We can’t call open()
directly since that will block, so we’ll use a
thread pool to wait on it. Rather than create a thread pool, we’ll use
the one that comes with the current event loop. The run_in_executor()
method runs a function in a thread pool — where None
means use the
default pool — returning an asyncio future representing the future
result, in this case the opened file object.
def __aenter__(self):
def thread_open():
return open(*self._args, **self._kwargs)
loop = asyncio.get_event_loop()
self._future = loop.run_in_executor(None, thread_open)
return self._future
Since this __aenter__
is not a coroutine function, it returns the
future directly as its awaitable result. The caller will await it.
The default thread pool is limited to one thread per core, which I suppose is the most obvious choice, though not ideal here. That’s fine for CPU-bound operations but not for I/O-bound operations. In a real program we may want to use a larger thread pool.
Closing a file may block, so we’ll do that in a thread pool as well. First pull the file object from the future, then close it in the thread pool, waiting until the file has actually closed:
async def __aexit__(self, exc_type, exc, tb):
file = await self._future
def thread_close():
file.close()
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, thread_close)
The open and close are paired in this context manager, but it may be
concurrent with an arbitrary number of other _AsyncOpen
context
managers. There will be some upper limit to the number of open files, so
we need to be careful not to use too many of these things
concurrently, something which easily happens when using unbounded
queues. Lacking back pressure, all it takes is for tasks to be
opening files slightly faster than they close them.
With all the hard work done, the definition for aopen()
is trivial:
def aopen(*args, **kwargs):
return _AsyncOpen(args, kwargs)
That’s it! Let’s try it out with the LD_PRELOAD
test.
First define a “heartbeat” task that will tell us the asyncio loop is still chugging away while we wait on opening the file.
async def heartbeat():
while True:
await asyncio.sleep(0.5)
print('HEARTBEAT')
Here’s a test function for aopen()
that asynchronously opens a file
under /tmp/
named by an integer, (synchronously) writes that integer
to the file, then asynchronously closes it.
async def write(i):
async with aopen(f'/tmp/{i}', 'w') as out:
print(i, file=out)
The main()
function creates the heartbeat task and opens 4 files
concurrently though the intercepted file opening routine:
async def main():
beat = asyncio.create_task(heartbeat())
tasks = [asyncio.create_task(write(i)) for i in range(4)]
await asyncio.gather(*tasks)
beat.cancel()
asyncio.run(main())
The result:
$ LD_PRELOAD=./open64.so python3 aopen.py
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
HEARTBEAT
$ cat /tmp/{1,2,3,4}
1
2
3
4
As expected, 6 heartbeats corresponding to 3 seconds that all 4 tasks
spent concurrently waiting on the intercepted open()
. Here’s the full
source if you want to try it our for yourself:
https://gist.github.com/skeeto/89af673a0a0d24de32ad19ee505c8dbd
Only opening and closing the file is asynchronous. Read and writes are
unchanged, still fully synchronous and blocking, so this is only a half
solution. A full solution is not nearly as simple because asyncio is
async/await. Asynchronous reads and writes would require all new APIs
with different coloring. You’d need an aprint()
to complement
print()
, and so on, each returning an awaitable
to be awaited.
This is one of the unfortunate downsides of async/await. I strongly prefer conventional, preemptive concurrency, but we don’t always have that luxury.
]]>Command line interfaces have varied throughout their brief history but have largely converged to some common, sound conventions. The core originates from unix, and the Linux ecosystem extended it, particularly via the GNU project. Unfortunately some tools initially appear to follow the conventions, but subtly get them wrong, usually for no practical benefit. I believe in many cases the authors simply didn’t know any better, so I’d like to review the conventions.
The simplest case is the short option flag. An option is a hyphen — specifically HYPHEN-MINUS U+002D — followed by one alphanumeric character. Capital letters are acceptable. The letters themselves have conventional meanings and are worth following if possible.
program -a -b -c
Flags can be grouped together into one program argument. This is both convenient and unambiguous. It’s also one of those often missed details when programs use hand-coded argument parsers, and the lack of support irritates me.
program -abc
program -acb
The next simplest case are short options that take arguments. The argument follows the option.
program -i input.txt -o output.txt
The space is optional, so the option and argument can be packed together into one program argument. Since the argument is required, this is still unambiguous. This is another often-missed feature in hand-coded parsers.
program -iinput.txt -ooutput.txt
This does not prohibit grouping. When grouped, the option accepting an argument must be last.
program -abco output.txt
program -abcooutput.txt
This technique is used to create another category, optional option arguments. The option’s argument can be optional but still unambiguous so long as the space is always omitted when the argument is present.
program -c # omitted
program -cblue # provided
program -c blue # omitted (blue is a new argument)
program -c -x # two separate flags
program -c-x # -c with argument "-x"
Optional option arguments should be used judiciously since they can be surprising, but they have their uses.
Options can typically appear in any order — something parsers often achieve via permutation — but non-options typically follow options.
program -a -b foo bar
program -b -a foo bar
GNU-style programs usually allow options and non-options to be mixed, though I don’t consider this to be essential.
program -a foo -b bar
program foo -a -b bar
program foo bar -a -b
If a non-option looks like an option because it starts with a hyphen,
use --
to demarcate options from non-options.
program -a -b -- -x foo bar
An advantage of requiring that non-options follow options is that the
first non-option demarcates the two groups, so --
is less often
needed.
# note: without argument permutation
program -a -b foo -x bar # 2 options, 3 non-options
Since short options can be cryptic, and there are such a limited number of them, more complex programs support long options. A long option starts with two hyphens followed by one or more alphanumeric, lowercase words. Hyphens separate words. Using two hyphens prevents long options from being confused for grouped short options.
program --reverse --ignore-backups
Occasionally flags are paired with a mutually exclusive inverse flag
that begins with --no-
. This avoids a future flag day where the
default is changed in the release that also adds the flag implementing
the original behavior.
program --sort
program --no-sort
Long options can similarly accept arguments.
program --output output.txt --block-size 1024
These may optionally be connected to the argument with an equals sign
=
, much like omitting the space for a short option argument.
program --output=output.txt --block-size=1024
Like before, this opens up the doors for optional option arguments. Due
to the required =
this is still unambiguous.
program --color --reverse
program --color=never --reverse
The --
retains its original behavior of disambiguating option-like
non-option arguments:
program --reverse -- --foo bar
Some programs, such as Git, have subcommands each with their own options. The main program itself may still have its own options distinct from subcommand options. The program’s options come before the subcommand and subcommand options follow the subcommand. Options are never permuted around the subcommand.
program -a -b -c subcommand -x -y -z
program -abc subcommand -xyz
Above, the -a
, -b
, and -c
options are for program
, and the
others are for subcommand
. So, really, the subcommand is another
command line of its own.
There’s little excuse for not getting these conventions right assuming you’re interested in following the conventions. Short options can be parsed correctly in just ~60 lines of C code. Long options are just slightly more complex.
GNU’s getopt_long()
supports long option abbreviation — with no way to
disable it (!) — but this should be avoided.
Go’s flag package intentionally deviates from the conventions.
It only supports long option semantics, via a single hyphen. This makes
it impossible to support grouping even if all options are only one
letter. Also, the only way to combine option and argument into a single
command line argument is with =
. It’s sound, but I miss both features
every time I write programs in Go. That’s why I wrote my own argument
parser. Not only does it have a nicer feature set, I like the API a
lot more, too.
Python’s primary option parsing library is argparse
, and I just can’t
stand it. Despite appearing to follow convention, it actually breaks
convention and its behavior is unsound. For instance, the following
program has two options, --foo
and --bar
. The --foo
option accepts
an optional argument, and the --bar
option is a simple flag.
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument('--foo', type=str, nargs='?', default='X')
parser.add_argument('--bar', action='store_true')
print(parser.parse_args(sys.argv[1:]))
Here are some example runs:
$ python parse.py
Namespace(bar=False, foo='X')
$ python parse.py --foo
Namespace(bar=False, foo=None)
$ python parse.py --foo=arg
Namespace(bar=False, foo='arg')
$ python parse.py --bar --foo
Namespace(bar=True, foo=None)
$ python parse.py --foo arg
Namespace(bar=False, foo='arg')
Everything looks good except the last. If the --foo
argument is
optional then why did it consume arg
? What happens if I follow it with
--bar
? Will it consume it as the argument?
$ python parse.py --foo --bar
Namespace(bar=True, foo=None)
Nope! Unlike arg
, it left --bar
alone, so instead of following the
unambiguous conventions, it has its own ambiguous semantics and attempts
to remedy them with a “smart” heuristic: “If an optional argument looks
like an option, then it must be an option!” Non-option arguments can
never follow an option with an optional argument, which makes that
feature pretty useless. Since argparse
does not properly support --
,
that does not help.
$ python parse.py --foo -- arg
usage: parse.py [-h] [--foo [FOO]] [--bar]
parse.py: error: unrecognized arguments: -- arg
Please, stick to the conventions unless you have really good reasons to break them!
]]>Each project is in a ready-to-run state of compile, then run with the output piped into a media player or video encoding. The header includes the exactly commands you need. Since that’s probably inconvenient for most readers, I’ve included a pre-recorded sample of each. Though in a few cases, especially those displaying random data, video encoding really takes something away from the final result, and it may be worth running yourself.
The projects are not in any particular order.
Source: randu.c
This is a little demonstration of the poor quality of the RANDU pseudorandom number generator. Note how the source embeds a monospace font so that it can render the text in the corner. For the 3D effect, it includes an orthographic projection function. This function will appear again later since I tend to cannibalize my own projects.
Source: colorsort.c
The original idea came from an old reddit post.
Source: animaze.c
This effect was invented by my current mentee student while
working on maze / dungeon generation late last year. This particular
animation is my own implementation. It outputs Netpbm by default, but,
for both fun and practice, also includes an entire implementation in
OpenGL. It’s enabled at compile time with -DENABLE_GL
so long
as you have GLFW and GLEW (even on Windows!).
Source: rooks.c
I wanted to watch an animated solution to the sliding rooks puzzle. This program solves the puzzle using a bitboard, then animates the solution. The rook images are embedded in the program, compressed using a custom run-length encoding (RLE) scheme with a tiny palette.
Source: magnet.c
My own animation of Glauber’s dynamics using a totally unoriginal color palette.
Source: fire.c
This is the classic Doom fire animation. I later implemented it in WebGL with a modified algorithm.
Source: mtvisualize.c
A visualization of the Mersenne Twister pseudorandom number generator. Not terribly interesting, so I almost didn’t include it.
Source: pixelsort.c
Another animation inspired by a reddit post. Starting from the top-left corner, swap the current pixel to the one most like its neighbors.
Source: walkers.c
Another reproduction of a reddit post. This is recent enough that I’m using a disposable LCG.
Source: voronoi.c
Another reddit post, though I think my version looks a lot nicer. I like to play this one over and over on repeat with different seeds.
Source: walk3d.c
Another stolen idea personal take on a reddit post. This
features the orthographic projection function from the RANDU animation.
Video encoding makes a real mess of this one, and I couldn’t work out
encoding options to make it look nice, so this one looks a lot better
“in person.”
Source: lorenz.c
A 3D animation I adapted from the 3D random walk above, meaning it uses the same orthographic projection. I have a WebGL version of this one, but I like that I could do this in such a small amount of code and without an existing rendering engine. Like before, this is really damaged by video encoding and is best seen live.
Bonus: I made an obfuscated version just to show how small this can get!
]]>As a computer engineer, my job is to use computers to solve important problems. Ideally my solutions will be efficient, and typically that means making the best use of the resources at hand. Quite often these resources are machines running Windows and, despite my misgivings about the platform, there is much to be gained by properly and effectively leveraging it.
Sometimes targeting Windows while working from another platform is sufficient, but other times I must work on the platform itself. There are various options available for C development, and I’ve finally formalized my own development kit: w64devkit.
For most users, the value is in the 78MiB .zip available in the “Releases” on GitHub. This (relatively) small package includes a state-of-the-art C and C++ compiler (latest GCC), a powerful text editor, debugger, a complete x86 assembler, and miniature unix environment. It’s “portable” in that there’s no installation. Just unzip it and start using it in place. With w64devkit, it literally takes a few seconds on any Windows to get up and running with a fully-featured, fully-equipped, first-class development environment.
The development kit is cross-compiled entirely from source using Docker, though Docker is not needed to actually use it. The repository is just a Dockerfile and some documentation. The only build dependency is Docker itself. It’s also easy to customize it for your own personal use, or to audit and build your own if, for whatever reason, you didn’t trust my distribution. This is in stark contrast to Windows builds of most open source software where the build process is typically undocumented, under-documented, obtuse, or very complicated.
Publishing this is not necessarily a commitment to always keep w64devkit up to date, but this Dockerfile is derived from (and replaces) a shell script I’ve been using continuously for over two years now. In this period, every time GCC has made a release, I’ve built myself a new development kit, so I’m already in the habit.
I’ve been using Docker on and off for about 18 months now. It’s an oddball in that it’s something I learned on the job rather than my own time. I formed an early impression that still basically holds: The main purpose of Docker is to contain and isolate misbehaved software to improve its reliability. Well-behaved, well-designed software benefits little from containers.
My unusual application of Docker here is no exception. Most software
builds are needlessly complicated and fragile, especially
Autoconf-based builds. Ironically, the worst configure scripts I’ve
dealt with come from GNU projects. They waste time on superfluous checks
(“Does your compiler define size_t
?”) then produce a build that
doesn’t work anyway because you’re doing something slightly unusual.
Worst of all, despite my best efforts, the build will be contaminated by
the state of the system doing the build.
My original build script was fragile by extension. It would work on one system, but not another due to some subtle environment change — a slightly different system header that reveals a build system bug (example in GCC), or the system doesn’t have a file at a certain hard-coded absolute path that shouldn’t be hard-coded. Converting my script to a Dockerfile locks these problems in place and makes builds much more reliable and repeatable. The misbehavior is contained and isolated by Docker.
Unfortunately it’s not completely contained. In each case I use make’s
-j
option to parallelize the build since otherwise it would take
hours. Some of the builds have subtle race conditions, and some bad luck
in timing can cause a build to fail. Docker is good about picking up
where it left off, so it’s just a matter of trying again.
In one case a build failed because Bison and flex were not installed
even though they’re not normally needed. Some dependency isn’t expressed
correctly, and unlucky ordering leads to an unused .y
file having the
wrong timestamp. Ugh. I’ve had this happen a lot more in Docker than
out, probably because file system operations are slow inside Docker and
it creates greater timing variance.
The README explains some of my decisions, but I’ll summarize a few here:
Git. Important and useful, so I’d love to have it. But it has a weird
installation (many .zip-unfriendly symlinks) tightly-coupled
with msys2, and its build system does not support cross-compilation.
I’d love to see a clean, straightforward rewrite of Git in a single,
appropriate implementation language. Imagine installing the latest Git
with go get git-scm.com/git
. (Update: libgit2 is working on
it!)
Bash. It’s a much nicer interactive shell than BusyBox-w32 ash
. But
the build system doesn’t support cross-compilation, and I’m not sure
it supports Windows without some sort of compatibility layer anyway.
Emacs. Another powerful editor. But the build system doesn’t support cross-compilation. It’s also way too big.
Go. Tempting to toss it in, but Go already does this all correctly and effectively. It simply doesn’t require a specialized distribution. It’s trivial to manage a complete Go toolchain with nothing but Go itself on any system. People may say its language design comes from the 1970s, but the tooling is decades ahead of everyone else.
For a long, long time Cygwin filled this role for me. However, I never liked its bulky nature, the complete opposite of portable. Cygwin processes always felt second-class on Windows, particularly in that it has its own view of the file system compared to other Windows processes. They could never fully cooperate. I also don’t like that there’s no toolchain for cross-compiling with Cygwin as a target — e.g. compile Cygwin binaries from Linux. Finally it’s been essentially obsoleted by WSL which matches or surpasses it on every front.
There’s msys and msys2, which are a bit lighter. However, I’m still in an isolated, second-class environment with weird path translation issues. These tools do have important uses, and it’s the only way to compile most open source software natively on Windows. For those builds that don’t support cross-compilation, it’s the only path for producing Windows builds. It’s just not what I’m looking for when developing my own software.
Update: llvm-mingw is an eerily similar project using Docker the same way, but instead builds LLVM.
I also converted my GnuPG build script to a Dockerfile. Of course I don’t plan to actually use GnuPG on Windows. I just need it for passphrase2pgp, which I test against GnuPG. This tests the Windows build.
In the future I may extend this idea to a few other tools I don’t intend to include with w64devkit. If you have something in mind, you could use my Dockerfiles as a kind of starter template.
]]>Suppose you’re writing a command line program that prompts the user for a password or passphrase, and Windows is one of the supported platforms (even very old versions). This program uses UTF-8 for its string representation, as it should, and so ideally it receives the password from the user encoded as UTF-8. On most platforms this is, for the most part, automatic. However, on Windows finding the correct answer to this problem is a maze where all the signs lead towards dead ends. I recently navigated this maze and found the way out.
I knew it was possible because my passphrase2pgp tool has been using the golang.org/x/crypto/ssh/terminal package, which gets it very nearly perfect. Though they were still fixing subtle bugs as recently as 6 months ago.
The first step is to ignore just everything you find online, because it’s either wrong or it’s solving a slightly different problem. I’ll discuss the dead ends later and focus on the solution first. Ultimately I want to implement this on Windows:
// Display prompt then read zero-terminated, UTF-8 password.
// Return password length with terminator, or zero on error.
int read_password(char *buf, int len, const char *prompt);
I chose int
for the length rather than size_t
because it’s a
password and should not even approach INT_MAX
.
For the impatient: complete, working, ready-to-use example
On a unix-like system, the program would:
open(2)
the special /dev/tty
file for reading and writingwrite(2)
the prompttcgetattr(3)
and tcsetattr(3)
to disable ECHO
read(2)
a line of inputtcsetattr(3)
close(2)
the fileA great advantage of this approach is that it doesn’t depend on standard input and standard output. Either or both can be redirected elsewhere, and this function still interacts with the user’s terminal. The Windows version will have the same advantage.
Despite some tempting shortcuts that don’t work, the steps on Windows are basically the same but with different names. There are a couple subtleties and extra steps. I’ll be ignoring errors in my code snippets below, but the complete example has full error handling.
Instead of /dev/tty
, the program opens two files: CONIN$
and
CONOUT$
using CreateFileA()
. Note: The “A” stands for ANSI,
as opposed to “W” for wide (Unicode). This refers to the encoding of the
file name, not to how the file contents are encoded. CONIN$
is opened
for both reading and writing because write permissions are needed to
change the console’s mode.
HANDLE hi = CreateFileA(
"CONIN$",
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0
);
HANDLE ho = CreateFileA(
"CONOUT$",
GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0
);
To write the prompt, call WriteConsoleA()
on the output handle.
On its own, this assumes the prompt is plain ASCII (i.e. "password:
"
), not UTF-8 (i.e. "contraseña: "
):
WriteConsoleA(ho, prompt, strlen(prompt), 0, 0);
If the prompt may contain UTF-8 data, perhaps because it displays a username or isn’t in English, you have two options:
WriteConsoleW()
instead.SetConsoleOutputCP()
with CP_UTF8
(65001). This is a global
(to the console) setting and should be restored when done.Next use GetConsoleMode()
and SetConsoleMode()
to
disable echo. The console usually has ENABLE_PROCESSED_INPUT
already
set, which tells the console to handle CTRL-C and such, but I set it
explicitly just in case. I also set ENABLE_LINE_INPUT
so that the user
can use backspace and so that the entire line is delivered at once.
DWORD orig = 0;
GetConsoleMode(hi, &orig);
DWORD mode = orig;
mode |= ENABLE_PROCESSED_INPUT;
mode &= ~ENABLE_ECHO_INPUT;
SetConsoleMode(hi, mode);
There are reports that ENABLE_LINE_INPUT
limits reads to 254 bytes,
but I was unable to reproduce it. My full example can read huge
passwords without trouble.
The old mode is saved in orig
so that it can be restored later.
Here’s where you have to pay the piper. As of the date of this article, the Windows API offers no method for reading UTF-8 input from the console. Give up on that hope now. If you use the “ANSI” functions to read input under any configuration, they will to the usual Windows thing of silently mangling your input.
So you must use the UTF-16 API, ReadConsoleW()
, and then
encode it yourself. Fortunately Win32 provides a UTF-8 encoder,
WideCharToMultiByte()
, which will even handle surrogate pairs
for all those people who like putting PILE OF POO
(U+1F4A9
) in their
passwords:
SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0; // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);
I use SecureZeroMemory()
to erase the UTF-16 version of the
password before freeing the buffer. The + 2
in the allocation is for
the CRLF line ending that will later be chopped off. The error handling
version checks that the input did indeed end with CRLF. Otherwise it was
truncated (too long).
Finally print a newline since the user-typed one wasn’t echoed, restore the old console mode, close the console handles, and return the final encoded length:
WriteConsoleA(ho, "\n", 1, 0, 0);
SetConsoleMode(hi, orig);
CloseHandle(ho);
CloseHandle(hi);
return r;
The error checking version doesn’t check for errors from any of these functions since either they cannot fail, or there’s nothing reasonable to do in the event of an error.
If you look around the Win32 API you might notice SetConsoleCP()
. A
reasonable person might think that setting the “code page” to UTF-8
(CP_UTF8
) might configure the console to encode input in UTF-8. The
good news is Windows will no longer mangle your input as before. The bad
news is that it will be mangled differently.
You might think you can use the CRT function _setmode()
with
_O_U8TEXT
on the FILE *
connected to the console. This does nothing
useful. (The only use for _setmode()
is with _O_BINARY
, to disable
braindead character translation on standard input and output.) The best
you’ll be able to do with the CRT is the same sort of wide character
read using non-standard functions, followed by conversion to UTF-8.
CredUICmdLinePromptForCredentials()
promises to be both a
mouthful of a function name, and a prepacked solution to this problem.
It only delivers on the first. This function seems to have broken some
time ago and nobody at Microsoft noticed — probably because nobody has
ever used this function. I couldn’t find a working example, nor a use
in any real application. When I tried to use it, I got a nonsense error
code it never worked. There’s a GUI version of this function that does
work, and it’s a viable alternative for certain situations, though not
mine.
At my most desperate, I hoped ENABLE_VIRTUAL_TERMINAL_PROCESSING
would
be a magical switch. On Windows 10 it magically enables some ANSI escape
sequences. The documentation in no way suggests it would work, and I
confirmed by experimentation that it does not. Pity.
I spent a lot of time searching down these dead ends until finally
settling with ReadConsoleW()
above. I hoped it would be more
automatic, but I’m glad I have at least some solution figured out.
I’ve noticed a small pattern across a few of my projects where I had vectorized and parallelized some code. The original algorithm had a “push” approach, the optimized version instead took a “pull” approach. In this article I’ll describe what I mean, though it’s mostly just so I can show off some pretty videos, pictures, and demos.
A good place to start is the Abelian sandpile model, which, like many before me, completely captured my attention for awhile. It’s a cellular automaton where each cell is a pile of grains of sand — a sandpile. At each step, any sandpile with more than four grains of sand spill one grain into its four 4-connected neighbors, regardless of the number of grains in those neighboring cell. Cells at the edge spill their grains into oblivion, and those grains no longer exist.
With excess sand falling over the edge, the model eventually hits a stable state where all piles have three or fewer grains. However, until it reaches stability, all sorts of interesting patterns ripple though the cellular automaton. In certain cases, the final pattern itself is beautiful and interesting.
Numberphile has a great video describing how to form a group over recurrent configurations (also). In short, for any given grid size, there’s a stable identity configuration that, when “added” to any other element in the group will stabilize back to that element. The identity configuration is a fractal itself, and has been a focus of study on its own.
Computing the identity configuration is really just about running the simulation to completion a couple times from certain starting configurations. Here’s an animation of the process for computing the 64x64 identity configuration:
As a fractal, the larger the grid, the more self-similar patterns there are to observe. There are lots of samples online, and the biggest I could find was this 3000x3000 on Wikimedia Commons. But I wanted to see one that’s even bigger, damnit! So, skipping to the end, I eventually computed this 10000x10000 identity configuration:
This took 10 days to compute using my optimized implementation:
https://github.com/skeeto/scratch/blob/master/animation/sandpiles.c
I picked an algorithm described in a code golf challenge:
f(ones(n)*6 - f(ones(n)*6))
Where f()
is the function that runs the simulation to a stable state.
I used OpenMP to parallelize across cores, and SIMD to parallelize within a thread. Each thread operates on 32 sandpiles at a time. To compute the identity sandpile, each sandpile only needs 3 bits of state, so this could potentially be increased to 85 sandpiles at a time on the same hardware. The output format is my old mainstay, Netpbm, including the video output.
So, what do I mean about pushing and pulling? The naive approach to simulating sandpiles looks like this:
for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
for each j in neighbors {
output[j] = output[j] + 1
}
}
}
As the algorithm examines each cell, it pushes results into neighboring cells. If we’re using concurrency, that means multiple threads of execution may be mutating the same cell, which requires synchronization — locks, atomics, etc. That much synchronization is the death knell of performance. The threads will spend all their time contending for the same resources, even if it’s just false sharing.
The solution is to pull grains from neighbors:
for each i in sandpiles {
if input[i] < 4 {
output[i] = input[i]
} else {
output[i] = input[i] - 4
}
for each j in neighbors {
if input[j] >= 4 {
output[i] = output[i] + 1
}
}
}
Each thread only modifies one cell — the cell it’s in charge of updating — so no synchronization is necessary. It’s shader-friendly and should sound familiar if you’ve seen my WebGL implementation of Conway’s Game of Life. It’s essentially the same algorithm. If you chase down the various Abelian sandpile references online, you’ll eventually come across a 2017 paper by Cameron Fish about running sandpile simulations on GPUs. He cites my WebGL Game of Life article, bringing everything full circle. We had spoken by email at the time, and he shared his interactive simulation with me.
Vectorizing this algorithm is straightforward: Load multiple piles at once, one per SIMD channel, and use masks to implement the branches. In my code I’ve also unrolled the loop. To avoid bounds checking in the SIMD code, I pad the state data structure with zeros so that the edge cells have static neighbors and are no longer special.
Back in the old days, one of the cool graphics tricks was fire animations. It was so easy to implement on limited hardware. In fact, the most obvious way to compute it was directly in the framebuffer, such as in the VGA buffer, with no outside state.
There’s a heat source at the bottom of the screen, and the algorithm runs from bottom up, propagating that heat upwards randomly. Here’s the algorithm using traditional screen coordinates (top-left corner origin):
func rand(min, max) // random integer in [min, max]
for each x, y from bottom {
buf[y-1][x+rand(-1, 1)] = buf[y][x] - rand(0, 1)
}
As a push algorithm it works fine with a single-thread, but it doesn’t translate well to modern video hardware. So convert it to a pull algorithm!
for each x, y {
sx = x + rand(-1, 1)
sy = y + rand(1, 2)
output[y][x] = input[sy][sx] - rand(0, 1)
}
Cells pull the fire upward from the bottom. Though this time there’s a catch: This algorithm will have subtly different results.
In the original, there’s a single state buffer and so a flame could propagate upwards multiple times in a single pass. I’ve compensated here by allowing a flames to propagate further at once.
In the original, a flame only propagates to one other cell. In this version, two cells might pull from the same flame, cloning it.
In the end it’s hard to tell the difference, so this works out.
There’s still potentially contention in that rand()
function, but this
can be resolved with a hash function that takes x
and y
as
inputs.
Linux has a mechanism like this, madvise(2)
, that allows
processes to provide hints to the kernel on how memory is expected to be
used. The flag of interest is MADV_FREE
:
The application no longer requires the pages in the range specified by
addr
andlen
. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page.
So, given this, I built a proof of concept / toy on top of MADV_FREE
that provides this functionality for Linux:
https://github.com/skeeto/purgeable
It allocates anonymous pages using mmap(2)
. When the allocation
is “unlocked” — i.e. the process isn’t actively using it — its pages are
marked with MADV_FREE
so that the kernel can reclaim them at any time.
To lock the allocation so that the process can safely make use of them,
the MADV_FREE
is canceled. This is all a little trickier than it sounds,
and that’s the subject of this article.
Note: There’s also MADV_DONTNEED
which seems like it would fit the
bill, but it’s implemented incorrectly in Linux. It immediately
frees the pages, and so it’s useless for implementing purgeable memory.
Before diving into the implementation, here’s the API. It’s just four
functions with no structure definitions. The pointer used by the
API is the memory allocation itself. All the bookkeeping associated
with that pointer is hidden away, out of sight from the API’s
consumer. The full documentation is in purgeable.h
.
void *purgeable_alloc(size_t);
void purgeable_unlock(void *);
void *purgeable_lock(void *);
void purgeable_free(void *);
The semantics are much like a C++ weak_ptr
in that locking both
validates that the allocation is still available and creates a “strong”
reference to it that prevents it from being purged. Though unlike a weak
reference, the allocation is stickier. It will remain until the system is
actually under pressure, not just when the garbage collector happens to
run or the last strong reference is gone.
Here’s how it might be used to, say, store decoded PNG data that can decompressed again if needed:
uint32_t *texture = 0;
struct png *png = png_load("texture.png");
if (!png) die();
/* ... */
for (;;) {
if (!texture) {
texture = purgeable_alloc(png->width * png->height * 4);
if (!texture) die();
png_decode_rgba(png, texture);
} else if (!purgeable_lock(texture)) {
purgeable_free(texture);
texture = 0;
continue;
}
glTexImage2D(
GL_TEXTURE_2D, 0,
GL_RGBA, png->width, png->height, 0,
GL_RGBA, GL_UNSIGNED_BYTE, texture
);
purgeable_unlock(texture);
break;
}
Memory is allocated in a locked state since it’s very likely to be
immediately filled with data. The application should unlock it before
moving on with other tasks. The purgeable memory must always be freed
using purgeable_free()
, even if purgeable_lock()
failed. This not only
frees the bookkeeping, but also releases the now-zero pages and the
mapping itself. Originally I had purgeable_lock()
free the purgeable
memory on failure, but I felt this was clearer. There’s no technical
reason it couldn’t, though.
The main challenge is that the kernel doesn’t necessarily treat the
MADV_FREE
range contiguously. It might reclaim just some pages, and do
so in an arbitrary order. In order to lock the region, each page must be
handled individually. Per the man page quoted above, reversing
MADV_FREE
requires a write to each page — to either trigger a page
fault or set a dirty bit.
The only way to tell if a page has been purged is to check if it’s been filled with zeros. That’s easy if we’re sure a particular byte in the page should be zero, but, since this is a library, the caller might just store anything on these pages.
So here’s my solution: To unlock a page, look at the first byte on the
page. Remember whether or not it’s zero. If it’s zero, write a 1 into
that byte. Once this has been done for all pages, use madvise(2)
to
mark them all MADV_FREE
.
With this approach, the library only needs to track one bit of information per page regardless of the page’s contents. Assuming 4kB pages, each 32kB of allocation has 1 byte of overhead (amortized) — or ~0.003% overhead. Not too bad!
Locking purgeable memory is a little trickier. Again, each page must be visited in turn, and if any page was purged, then the whole allocation is considered lost. If the first byte was non-zero when unlocking, the library checks that it’s still non-zero. If the first byte was zero when unlocking, then it prepares to write a zero back into that byte, which must currently be non-zero.
In either case, the MADV_FREE
needs to be canceled using a write, so
the library does an atomic compare-and-swap (CAS) to write the
correct byte into the page, even if it’s the same value in the
non-zero case. The atomic CAS is essential because it ensures the page
wasn’t purged between the check and the write, as both are done
together, atomically. If every page has the expected first byte, and
every CAS succeeded, then the purgeable memory has been successfully
locked.
As an optimization, the library could consider more than just the first
byte, and look at, say, the first long int
on each page. The library
does less work when the page contains a non-zero value, and the chance of
an arbitrary 8-byte value being zero is much lower. However, I wanted to
avoid potential aliasing issues, especially if this library were
to be embedded, so I passed on the idea.
The bookkeeping data is stored just before the buffer returned as the
purgeable memory, and it’s never marked with MADV_FREE
. Assuming 4kB
pages, for each 128MB of purgeable memory the library allocates one extra
anonymous page to track it. The number of pages in the allocation is
stored just before the purgeable memory as a size_t
, and the rest is the
per-page bit table described above.
size_t *p = purgeable_alloc(1<<14);
size_t numpages = p[-1];
So the library can immediately find it starting from the purgeable memory address. Here’s an illustration:
,--- p
|
v
----------------------------------------------
|...Z| | | | | | | | |
----------------------------------------------
^ ^
| |
| `--- size_t numpages
|
`--- bit table
The downside is that buffer underflows in the application would easily
trample the numpages
value because it’s located immediately adjacent. It
would be safer to move it to the beginning of the first page before the
purgeable memory, but this would have made bit table access more
complicated. While the region is locked, the contents of the bit table
don’t matter, so it won’t be damaged by an underflow. Another idea: put a
checksum alongside numpages
. It could just be a simple integer
hash.
This makes for a really slick API since the consumer doesn’t need to track anything more than a single pointer, the address of the purgeable memory allocation itself.
I’m not quite sure how often I’d actually use purgeable memory in real programs, especially in software intended to be portable. Each operating system needs its own implementation, and this library is not portable since it relies on interfaces and behaviors specific to Linux.
It also has a not-so-unlikely pathological case: Imagine a program that makes two purgeable memory allocation, and they’re large enough that one always evicts the other. The program would thrash back and forth fighting itself as it used each allocation. Detecting this situation might be difficult, especially as the number of purgeable memory allocations increases.
Regardless, it’s another tool for my software toolbelt.
]]>The same advice also often applies to compilers.
Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of data. The simplest approach would be to do it a byte at a time:
/* XOR src into dst */
void
xor512a(void *dst, void *src)
{
unsigned char *pd = dst;
unsigned char *ps = src;
for (int i = 0; i < 64; i++) {
pd[i] ^= ps[i];
}
}
Maybe you benchmark it or you look at the assembly output, and the
results are disappointing. Your compiler did exactly what you asked
of it and produced code that performs 64 single-byte XOR operations
(GCC 9.2.0, x86-64, -Os
):
xor512a:
xor eax, eax
.L0: mov cl, [rsi+rax]
xor [rdi+rax], cl
inc rax
cmp rax, 64
jne .L0
ret
The target architecture has wide registers so it could be doing at least 8 bytes at a time. Since your compiler isn’t doing it, you decide to chunk the work into 8 byte blocks yourself in an attempt to manually implement a chunking operation. Here’s some real world code that does so:
/* WARNING: Broken, do not use! */
void
xor512b(void *dst, void *src)
{
uint64_t *pd = dst;
uint64_t *ps = src;
for (int i = 0; i < 8; i++) {
pd[i] ^= ps[i];
}
}
You check the assembly output of this function, and it looks much better. It’s now processing 8 bytes at a time, so it should be about 8 times faster than before.
xor512b:
xor eax, eax
.L0: mov rcx, [rsi+rax*8]
xor [rdi+rax*8], rcx
inc rax
cmp rax, 8
jne .L0
ret
Still, this machine has 16-byte wide registers (SSE2 xmm
), so there
could be another doubling in speed. Oh well, this is good enough, so you
plug it into your program. But something strange happens: The output
is now wrong!
int
main(void)
{
uint32_t dst[32] = {
1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16
};
uint32_t src[32] = {
1, 4, 9, 16, 25, 36, 49, 64,
81, 100, 121, 144, 169, 196, 225, 256,
};
xor512b(dst, src);
for (int i = 0; i < 16; i++) {
printf("%d\n", (int)dst[i]);
}
}
Your program prints 1..16 as if xor512b()
was never called. You check
over everything a dozen times, and you can’t find anything wrong. Even
crazier, if you disable optimizations then the bug goes away. It must be
some kind of compiler bug!
Investigating a bit more, you learn that the -fno-strict-aliasing
option also fixes the bug. That’s because this program violates C strict
aliasing rules. An array of uint32_t
was accessed as a uint64_t
. As
an important optimization, compilers are allowed to assume such
variables do not alias and generate code accordingly. Otherwise every
memory store could potentially modify any variable, which limits the
compiler’s ability to produce decent code.
The original version is fine because char *
, including both signed
and unsigned
, has a special exemption and may alias with anything. For
the same reason, using char *
unnecessarily can also make your
programs slower.
What could you do to keep the chunking operation while not running afoul
of strict aliasing? Counter-intuitively, you could use memcpy()
. Copy
the chunks into legitimate, local uint64_t
variables, do the work, and
copy the result back out.
void
xor512c(void *dst, void *src)
{
for (int i = 0; i < 8; i++) {
uint64_t buf[2];
memcpy(buf + 0, (char *)dst + i*8, 8);
memcpy(buf + 1, (char *)src + i*8, 8);
buf[0] ^= buf[1];
memcpy((char *)dst + i*8, buf, 8);
}
}
Since memcpy()
is a built-in function, your compiler knows its
semantics and can ultimately elide all that copying. The assembly
listing for xor512c
is identical to xor512b
, but it won’t go haywire
when integrated into a real program.
It works and it’s correct, but you can still do much better than this!
The problem is you’re forcing the knife and not letting it do the work. There’s a constraint on your compiler that hasn’t been considered: It must work correctly for overlapping inputs.
char buf[74] = {...};
xor512a(buf, buf + 10);
In this situation, the byte-by-byte and chunked versions of the function will have different results. That’s exactly why your compiler can’t do the chunking operation itself. However, you don’t care about this situation because the inputs never overlap.
Let’s revisit the first, simple implementation, but this time being
smarter about it. The restrict
keyword indicates that the inputs
will not overlap, freeing your compiler of this unwanted concern.
void
xor512d(void *restrict dst, void *restrict src)
{
unsigned char *pd = dst;
unsigned char *ps = src;
for (int i = 0; i < 64; i++) {
pd[i] ^= ps[i];
}
}
(Side note: Adding restrict
to the manually chunked function,
xor512b()
, will not fix it. Using restrict
can never make an
incorrect program correct.)
Compiled with GCC 9.2.0 and -O3
, the resulting unrolled code
processes 16-byte chunks at a time (pxor
):
xor512d:
movdqu xmm0, [rdi+0x00]
movdqu xmm1, [rsi+0x00]
movdqu xmm2, [rsi+0x10]
movdqu xmm3, [rsi+0x20]
pxor xmm0, xmm1
movdqu xmm4, [rdi+0x30]
movups [rdi+0x00], xmm0
movdqu xmm0, [rdi+0x10]
pxor xmm0, xmm2
movups [rdi+0x10], xmm0
movdqu xmm0, [rdi+0x20]
pxor xmm0, xmm3
movups [rdi+0x20], xmm0
movdqu xmm0, [rsi+0x30]
pxor xmm0, xmm4
movups [rdi+0x30], xmm0
ret
Compiled with Clang 9.0.0 with AVX-512 enabled in the target
(-mavx512bw
), it does the entire operation in a single, big chunk!
xor512d:
vmovdqu64 zmm0, [rdi]
vpxorq zmm0, zmm0, [rsi]
vmovdqu64 [rdi], zmm0
vzeroupper
ret
“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.
]]>/tmp
. For Emacs Lisp, the equivalent is the
*scratch*
buffer. These are places where I can make a mess, and the
mess usually gets cleaned up before it becomes a problem. A lot of my
established projects (ex.) start out in volatile storage and
only graduate to more permanent storage once the concept has proven
itself.
Throughout my whole career, this sort of throwaway experimentation has been an important part of my personal growth, and I try to encourage it in others. Even if the idea I’m trying doesn’t pan out, I usually learn something new, and occasionally it translates into an article here.
I also enjoy small programming challenges. One of the most abused tools in my mental toolbox is the Monte Carlo method, and I readily apply it to solve toy problems. Even beyond this, random number generators are frequently a useful tool (1, 2), so I find myself reaching for one all the time.
Nearly every programming language comes with a pseudo-random number generation function or library. Unfortunately the language’s standard PRNG is usually a poor choice (C, C++, C#, Go). It’s probably mediocre quality, slower than it needs to be (also), lacks reliable semantics or behavior between implementations, or is missing some other property I want. So I’ve long been a fan of BYOPRNG: Bring Your Own Pseudo-random Number Generator. Just embed a generator with the desired properties directly into the program. The best non-cryptographic PRNGs today are tiny and exceptionally friendly to embedding. Though, depending on what you’re doing, you might need to be creative about seeding.
On occasion I don’t have an established, embeddable PRNG in reach, and I have yet to commit xoshiro256** to memory. Or maybe I want to use a totally unique PRNG for a particular project. In these cases I make one up. With just a bit of know-how it’s not too difficult.
Probably the easiest decent PRNG to code from scratch is the venerable Linear Congruential Generator (LCG). It’s a simple recurrence relation:
x[1] = (x[0] * A + C) % M
That’s trivial to remember once you know the details. You only need to
choose appropriate values for A
, C
, and M
. Done correctly, it
will be a full-period generator — a generator that visits a
permutation of each of the numbers between 0 and M - 1
. The seed —
the value of x[0]
— is chooses a starting position in this (looping)
permutation.
M
has a natural, obvious choice: a power of two matching the range of
operands, such as 2^32 or 2^64. With this the modulo operation is free
as a natural side effect of the computer architecture.
Choosing C
also isn’t difficult. It must be co-prime with M
, and
since M
is a power of two, any odd number is valid. Even 1. In
theory choosing a small value like 1 is faster since the compiler
won’t need to embed a large integer in the code, but this difference
doesn’t show up in any micro-benchmarks I tried. If you want a cool,
unique generator, then choose a large random integer. More on that
below.
The tricky value is A
, and getting it right is the linchpin of the
whole LCG. It must be coprime with M
(i.e. not even), and, for a
full-period generator, A-1
must be divisible by four. For better
results, A-1
should not be divisible by 8. A good choice is a prime
number that satisfies these properties.
If your operands are 64-bit integers, or larger, how are you going to generate a prime number?
Emacs Calc can solve this problem. I’ve noted before how
featureful it is. It has arbitrary precision, random number
generation, and primality testing. It’s everything we need to choose
A
. (In fact, this is nearly identical to the process I used to
implement RSA.) For this example I’m going to generate a 64-bit
LCG for the C programming language, but it’s easy to use whatever
width you like and mostly whatever language you like. If you wanted a
minimal standard 128-bit LCG, this will still work.
Start by opening up Calc with M-x calc
, then:
2
on the stack64
on the stack^
, computing 2^64 and pushing it on the stackk r
to generate a random number in this ranged r 16
to switch to hexadecimal displayk n
to find the next prime following the random value5
or D
k p
a few times to avoid false positives.What’s left on the stack is your A
! If you want a random value for
C
, you can follow a similar process. Heck, make it prime, too!
The reason for using hexadecimal (step 5) and looking for 5
or D
(step 7) is that such numbers satisfy both of the important properties
for A-1
.
Calc doesn’t try to factor your random integer. Instead it uses the Miller–Rabin primality test, a probabilistic test that, itself, requires random numbers. It has false positives but no false negatives. The false positives can be mitigated by repeating the test multiple times, hence step 8.
Trying this all out right now, I got this implementation (in C):
uint64_t lcg1(void)
{
static uint64_t s = 0;
s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
return s;
}
However, we can still do a little better. Outputting the entire state doesn’t have great results, so instead it’s better to create a truncated LCG and only return some portion of the most significant bits.
uint32_t lcg2(void)
{
static uint64_t s = 0;
s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
return s >> 32;
}
This won’t quite pass BigCrush in 64-bit form, but the results are pretty reasonable for most purposes.
But we can still do better without needing to remember much more than this.
A Permuted Congruential Generator (PCG) is really just a truncated LCG with a permutation applied to its output. Like LCGs themselves, there are arbitrarily many variations. The “official” implementation has a data-dependent shift, for which I can never remember the details. Fortunately a couple of simple, easy to remember transformations is sufficient. Basically anything I used while prospecting for hash functions. I love xorshifts, so lets add one of those:
uint32_t pcg1(void)
{
static uint64_t s = 0;
s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
uint32_t r = s >> 32;
r ^= r >> 16;
return r;
}
This is a big improvement, but it still fails one BigCrush test. As they say, when xorshift isn’t enough, use xorshift-multiply! Below I generated a 32-bit prime for the multiply, but any odd integer is a valid permutation.
uint32_t pcg2(void)
{
static uint64_t s = 0;
s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
uint32_t r = s >> 32;
r ^= r >> 16;
r *= UINT32_C(0x60857ba9);
return r;
}
This passes BigCrush, and I can reliably build a new one entirely from scratch using Calc any time I need it.
Sometimes it’s not so straightforward to adapt this technique to other languages. For example, JavaScript has limited support for 32-bit integer operations (enough for a poor 32-bit LCG) and no 64-bit integer operations. Though BigInt is now a thing, and should make a great 96- or 128-bit LCG easy to build.
function lcg(seed) {
let s = BigInt(seed);
return function() {
s *= 0xef725caa331524261b9646cdn;
s += 0x213734f2c0c27c292d814385n;
s &= 0xffffffffffffffffffffffffn;
return Number(s >> 64n);
}
}
Java doesn’t have unsigned integers, so how could you build the above
PCG in Java? Easy! First, remember is that Java has two’s complement
semantics, including wrap around, and that two’s complement doesn’t
care about unsigned or signed for multiplication (or addition, or
subtraction). The result is identical. Second, the oft-forgotten >>>
operator does an unsigned right shift. With these two tips:
long s = 0;
int pcg2() {
s = s*0x7c3c3267d015ceb5L + 0x24bd2d95276253a9L;
int r = (int)(s >>> 32);
r ^= r >>> 16;
r *= 0x60857ba9;
return r;
}
So, in addition to the Calc step list above, you may need to know some of the finer details of your target language.
]]>In software development there are many concepts that at first glance seem useful and sound, but, after considering the consequences of their implementation and use, are actually horrifying. Examples include thread cancellation, variable length arrays, and memory aliasing. GCC’s closure extension to C is another, and this little feature compromises the entire GNU toolchain.
GCC has its own dialect of C called GNU C. One feature unique to GNU C is nested functions, which allow C programs to define functions inside other functions:
void intsort1(int *base, size_t nmemb)
{
int cmp(const void *a, const void *b)
{
return *(int *)a - *(int *)b;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
The nested function above is straightforward and harmless. It’s nothing
groundbreaking, and it is trivial for the compiler to implement. The
cmp
function is really just a static function whose scope is limited
to the containing function, no different than a local static variable.
With one slight variation the nested function turns into a closure. This is where things get interesting:
void intsort2(int *base, size_t nmemb, _Bool invert)
{
int cmp(const void *a, const void *b)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
The invert
variable from the outer scope is accessed from the inner
scope. This has clean, proper closure semantics and works
correctly just as you’d expect. It fits quite well with traditional C
semantics. The closure itself is re-entrant and thread-safe. It’s
automatically (read: stack) allocated, and so it’s automatically freed
when the function returns, including when the stack is unwound via
longjmp()
. It’s a natural progression to support closures like this
via nested functions. The eventual caller, qsort
, doesn’t even know
it’s calling a closure!
While this seems so useful and easy, its implementation has serious consequences that, in general, outweigh its benefits. In fact, in order to make this work, the whole GNU toolchain has been specially rigged!
How does it work? The function pointer, cmp
, passed to qsort
must
somehow be associated with its lexical environment, specifically the
invert
variable. A static address won’t do. When I implemented
closures as a toy library, I talked about the function address for
each closure instance somehow needing to be unique.
GCC accomplishes this by constructing a trampoline on the stack. That
trampoline has access to the local variables stored adjacent to it, also
on the stack. GCC also generates a normal cmp
function, like the
simple nested function before, that accepts invert
as an additional
argument. The trampoline calls this function, passing the local variable
as this additional argument.
To illustrate this, I’ve manually implemented intsort2()
below for
x86-64 (System V ABI) without using GCC’s nested function
extension:
int cmp(const void *a, const void *b, _Bool invert)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
void intsort3(int *base, size_t nmemb, _Bool invert)
{
unsigned long fp = (unsigned long)cmp;
volatile unsigned char buf[] = {
// mov edx, invert
0xba, invert, 0x00, 0x00, 0x00,
// mov rax, cmp
0x48, 0xb8, fp >> 0, fp >> 8, fp >> 16, fp >> 24,
fp >> 32, fp >> 40, fp >> 48, fp >> 56,
// jmp rax
0xff, 0xe0
};
int (*trampoline)(const void *, const void *) = (void *)buf;
qsort(base, nmemb, sizeof(*base), trampoline);
}
Here’s a complete example you can try yourself on nearly any x86-64 unix-like system: trampoline.c. It even works with Clang. The two notable systems where stack trampolines won’t work are OpenBSD and WSL.
(Note: The volatile
is necessary because C compilers rightfully do
not see the contents of buf
as being consumed. Execution of the
contents isn’t considered.)
In case you hadn’t already caught it, there’s a catch. The linker needs
to link a binary that asks the loader for an executable stack (-z
execstack
):
$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c
That’s because buf
contains x86 code implementing the trampoline:
mov edx, invert ; assign third argument
mov rax, cmp ; store cmp address in RAX register
jmp rax ; jump to cmp
(Note: The absolute jump through a 64-bit register is necessary because
the trampoline on the stack and the jump target will be very far apart.
Further, these days the program will likely be compiled as a Position
Independent Executable (PIE), so cmp
might itself have an high
address rather than load into the lowest 32 bits of the address
space.)
However, executable stacks were phased out ~15 years ago because it makes buffer overflows so much more dangerous! Attackers can inject and execute whatever code they like, typically shellcode. That’s why we need this unusual linker option.
You can see that the stack will be executable using our old friend,
readelf
:
$ readelf -l a.out
...
GNU_STACK 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 RWE 0x10
...
Note the “RWE” at the bottom right, meaning read-write-execute. This is a really bad sign in a real binary. Do any binaries installed on your system right now have an executable stack? I found one on mine. (Update: A major one was found in the comments by Walter Misar.)
When compiling the original version using a nested function there’s no need for that special linker option. That’s because GCC saw that it would need an executable stack and used this option automatically.
Or, more specifically, GCC stopped requesting a non-executable stack in the object file it produced. For the GNU Binutils linker, the default is an executable stack.
Since this is the default, the only way to get a non-executable stack is
if every object file input to the linker explicitly declares that it
does not need an executable stack. To request a non-executable stack, an
object file must contain the (empty) section .note.GNU-stack
.
If even a single object file fails to do this, then the final program
gets an executable stack.
Not only does one contaminated object file infect the binary, everything
dynamically linked with it also gets an executable stack. Entire
processes are infected! This occurs even via dlopen()
, where the stack
is dynamically made executable to accomodate the new shared object.
I’ve been bit myself. In Baking Data with Serialization I did it completely by accident, and I didn’t notice my mistake until three years later. The GNU linker outputs object files without the special note by default even though the object file only contains data.
$ echo hello world >hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$
This is fixed with -z noexecstack
:
$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
[ 2] .note.GNU-stack PROGBITS 00000000 0000004c
$
This may happen any time you link object files not produced by GCC, such as output from the NASM assembler or hand-crafted object files.
Nested C closures are super slick, but they’re just not worth the risk of an executable stack, and they’re certainly not worth an entire toolchain being fail open about it.
Update: A rebuttal. My short response is that the issue discussed in my article isn’t really about C the language but rather about an egregious issue with one particular toolchain. The problem doesn’t even arise if you use only C, but instead when linking in object files specifically not derived from C code.
]]>Yesterday I wrote about a legitimate use for variable length
arrays. While recently discussing this topic with a
co-worker, I also thought of a semi-legitimate use for
alloca()
, a non-standard “function” for dynamically allocating
memory on the stack.
void *alloca(size_t);
I say “function” in quotes because it’s not truly a function and cannot be implemented as a function or by a library. It’s implemented in the compiler and is essentially part of the language itself. It’s a tool allowing a function to manipulate its own stack frame.
Like VLAs, it has the problem that if you’re able to use alloca()
safely, then you really don’t need it in the first place. Allocation
failures are undetectable and once they happen it’s already too late.
To set the scene, let’s talk about opaque structs. Suppose you’re writing a C library with a clean interface. It’s set up so that changing your struct fields won’t break the Application Binary Interface (ABI), and callers are largely unable to depend on implementation details, even by accident. To achieve this, it’s likely you’re making use of opaque structs in your interface. Callers only ever receive pointers to library structures, which are handed back into the interface when they’re used. The internal details are hidden away.
/* opaque float stack API */
struct stack *stack_create(void);
void stack_destroy(struct stack *);
int stack_push(struct stack *, float v);
float stack_pop(struct stack *);
Callers can use the API above without ever knowing the layout or even
the size of struct stack
. Only a pointer to the struct is ever needed.
However, in order for this to work, the library must allocate the struct
itself. If this is a concern, then the library will typically allow the
caller to supply an allocator via function pointers. To see a really
slick version of this in practice, check out Lua’s lua_Alloc
, a
single function allocator API.
Suppose we wanted to support something simpler: The library will advertise the size of the struct so the caller can allocate it.
/* API additions */
size_t stack_sizeof(void);
void stack_init(struct stack *); // like stack_create()
void stack_free(struct stack *); // like stack_destroy()
The implementation of stack_sizeof()
would literally just be return
sizeof struct stack
. The caller might use it like so:
size_t len = stack_sizeof();
struct stack *s = malloc(len);
if (s) {
stack_init(s);
/* ... */
stack_free(s);
free(s);
}
However, that’s still a heap allocation. If this wasn’t an opaque
struct, the caller could very naturally use automatic (i.e. stack)
allocation, which is likely even preferred in this case. Is this still
possible? Idea: Allocate it via a generic char
array (VLA in this
case).
size_t len = stack_sizeof();
char buf[len];
struct stack *s = (struct stack *)buf;
stack_init(s);
However, this is technically undefined behavior. While a char
pointer
is special and permitted to alias with anything, the inverse isn’t true.
Pointers to other types don’t get a free pass to alias with a char
array. Accessing a char
value as if it were a different type just
isn’t allowed. Why? Because the standard says so. If you want one of the
practical reasons: the alignment might be incorrect.
Hmmm, is there another option? Maybe with alloca()
!
size_t len = stack_sizeof();
struct stack *s = alloca(len);
stack_init(s);
Since len
is expected to be small, it’s not any less safe than the
non-opaque alternative. It doesn’t undermine the type system, either,
since alloca()
has the same semantics as malloc()
. The downsides
are:
alloca()
is only a common extension, never
standardized, and for good reason.alloca()
?The second issue can possibly be resolved if the size is available as a compile time constant. This starts to break the abstraction provided by opaque structs, but they’re still mostly opaque. For example:
/* API additions */
#define STACK_SIZE 24
/* In practice, this would likely be horrific #ifdef spaghetti! */
The caller might use it like this:
struct stack *s = alloca(STACK_SIZE);
stack_init(s);
Now the compiler can see the allocation size, and potentially optimize
away the alloca()
. As of this writing, Clang (all versions) can
optimize these fixed-size alloca()
usages, but GCC (9.2) still does
not. Here’s a simple example:
#include <alloca.h>
void
foo(void)
{
#ifdef ALLOCA
volatile char *s = alloca(64);
#else
volatile char s[64];
#endif
s[63] = 0;
}
With the char
array version, both GCC and Clang produce optimal code:
0000000000000000 <foo>:
0: c6 44 24 f8 00 mov BYTE PTR [rsp-0x1],0x0
5: c3 ret
Side note: This is on x86-64 Linux, which uses the System V ABI. The entire array falls within the red zone, so it doesn’t need to be explicitly allocated.
With -DALLOCA
, Clang does the same, but GCC does the allocation
inefficiently as if it were dynamic:
0000000000000000 <foo>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 50 sub rsp,0x50
8: 48 8d 44 24 0f lea rax,[rsp+0xf]
d: 48 83 e0 f0 and rax,0xfffffffffffffff0
11: c6 40 3f 00 mov BYTE PTR [rax+0x3f],0x0
15: c9 leave
16: c3 ret
It would make a slightly better case for alloca()
here if GCC was
better about optimizing it. Regardless, this is another neat little
trick that I probably wouldn’t use in practice.
The C99 (ISO/IEC 9899:1999) standard of C introduced a new, powerful
feature called Variable Length Arrays (VLAs). The size of an array with
automatic storage duration (i.e. stack allocated) can be determined at
run time. Each instance of the array may even have a different length.
Unlike alloca()
, they’re a sanctioned form of dynamic stack
allocation.
At first glance, VLAs seem convenient, useful, and efficient. Heap allocations have a small cost because the allocator needs to do some work to find or request some free memory, and typically the operation must be synchronized since there may be other threads also making allocations. Stack allocations are trivial and fast by comparison: Allocation is a matter of bumping the stack pointer, and no synchronization is needed.
For example, here’s a function that non-destructively finds the median of a buffer of floats:
/* note: nmemb must be non-zero */
float
median(const float *a, size_t nmemb)
{
float copy[nmemb];
memcpy(copy, a, sizeof(a[0]) * nmemb);
qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
return copy[nmemb / 2];
}
It uses a VLA, copy
, as a temporary copy of the input for sorting. The
function doesn’t know at compile time how big the input will be, so it
cannot just use a fixed size. With a VLA, it efficiently allocates
exactly as much memory as needed on the stack.
Well, sort of. If nmemb
is too large, then the VLA will silently
overflow the stack. By silent I mean that the program has no way to
detect it and avoid it. In practice, it can be a lot louder, from a
segmentation fault in the best case, to an exploitable vulnerability in
the worst case: stack clashing. If an attacker can control
nmemb
, they might choose a value that causes copy
to overlap with
other allocations, giving them control over those values as well.
If there’s any risk that nmemb
is too large, it must be guarded.
#define COPY_MAX 4096
float
median(const float *a, size_t nmemb)
{
if (nmemb > COPY_MAX)
abort(); /* or whatever */
float copy[nmemb];
memcpy(copy, a, sizeof(a[0]) * nmemb);
qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
return copy[nmemb / 2];
}
However, if median
is expected to safely accommodate COPY_MAX
elements, it may as well always allocate an array of this size. If it
can’t, then that’s not a safe maximum.
float
median(const float *a, size_t nmemb)
{
if (nmemb > COPY_MAX)
abort();
float copy[COPY_MAX];
memcpy(copy, a, sizeof(a[0]) * nmemb);
qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
return copy[nmemb / 2];
}
And rather than abort, you might still want to support arbitrary input sizes:
float
median(const float *a, size_t nmemb)
{
float buf[COPY_MAX];
float *copy = buf;
if (nmemb > COPY_MAX)
copy = malloc(sizeof(a[0]) * nmemb);
memcpy(copy, a, sizeof(a[0]) * nmemb);
qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
float result = copy[nmemb / 2];
if (copy != buf)
free(copy);
return result;
}
Then small inputs are fast, but large inputs still work correctly. This is called small size optimization.
If the correct solution ultimately didn’t use a VLA, then what good are they? In general, VLAs not useful. They’re time bombs. VLAs are nearly always the wrong choice. You must be careul to check that they don’t exceed some safe maximum, and there’s no reason not to always use the maximum. This problem was realized for the C11 standard (ISO/IEC 9899:2011) where VLAs were made optional. A program containing a VLA will not necessarily compile on a C11 compiler.
Some purists also object to a special exception required for VLAs: The
sizeof
operator may evaluate its operand, and so it does not always
evaluate to compile-time constant. If the operand contains a VLA, then
the result depends on a run-time value.
Because they’re optional, it’s best to avoid even trivial VLAs like this:
float
median(const float *a, size_t nmemb)
{
int max = 4096;
if (nmemb > max)
abort();
float copy[max];
memcpy(copy, a, sizeof(a[0]) * nmemb);
qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
return copy[nmemb / 2];
}
It’s easy to prove that the array length is always 4096, but technically
this is still a VLA. That would still be true even if max
were const
int
, because the array length still isn’t a constant integral
expression.
Finally, there’s also the problem that VLAs just aren’t as efficient as you might hope. A function that does dynamic stack allocation requires additional stack management. It must track additional memory addresses and will require extra instructions.
void
fixed(int n)
{
if (n <= 1<<14) {
volatile char buf[1<<14];
buf[n - 1] = 0;
}
}
void
dynamic(int n)
{
if (n <= 1<<14) {
volatile char buf[n];
buf[n - 1] = 0;
}
}
Compiled with gcc -Os
and viewed with objdump -d -Mintel
:
0000000000000000 <fixed>:
0: 81 ff 00 40 00 00 cmp edi,0x4000
6: 7f 19 jg 21 <fixed+0x21>
8: ff cf dec edi
a: 48 81 ec 88 3f 00 00 sub rsp,0x3f88
11: 48 63 ff movsxd rdi,edi
14: c6 44 3c 88 00 mov BYTE PTR [rsp+rdi*1-0x78],0x0
19: 48 81 c4 88 3f 00 00 add rsp,0x3f88
20: c3 ret
21: c3 ret
0000000000000022 <dynamic>:
22: 81 ff 00 40 00 00 cmp edi,0x4000
28: 7f 23 jg 4d <dynamic+0x2b>
2a: 55 push rbp
2b: 48 63 c7 movsxd rax,edi
2e: ff cf dec edi
30: 48 83 c0 0f add rax,0xf
34: 48 63 ff movsxd rdi,edi
37: 48 83 e0 f0 and rax,0xfffffffffffffff0
3b: 48 89 e5 mov rbp,rsp
3e: 48 89 e2 mov rdx,rsp
41: 48 29 c4 sub rsp,rax
44: c6 04 3c 00 mov BYTE PTR [rsp+rdi*1],0x0
48: 48 89 d4 mov rsp,rdx
4b: c9 leave
4c: c3 ret
4d: c3 ret
Note the use of a base pointer, rbp
and leave
, in the second
function in order to dynamically track the stack frame. (Hmm, in both
cases GCC could easily shave off the extra ret
at the end of each
function. Missed optimization?)
The story is even worse when stack clash protection is enabled
(-fstack-clash-protection
). The compiler generates extra code to probe
every page of allocation in case one of those pages is a guard page.
That’s also more complex when the allocation is dynamic. The VLA version
more than doubles in size (from 44 bytes to 101 bytes)!
There is one convenient, useful, and safe form of VLAs: a pointer to a VLA. It’s convenient and useful because it makes some expressions simpler. It’s safe because there’s no arbitrary stack allocation.
Pointers to arrays are a rare sight in C code, whether variable length or not. That’s because, the vast majority of the time, C programmers implicitly rely on array decay: arrays quietly “decay” into pointers to their first element the moment you do almost anything with them. Also because they’re really awkward to use.
For example, the function sum3
takes a pointer to an array of exactly
three elements.
int
sum3(int (*array)[3])
{
return (*array)[0] + (*array)[1] + (*array)[2];
}
The parentheses are necessary because, without them, array
would be an
array of pointers — a type far more common than a pointer to an array.
To index into the array, first the pointer to the array must be
dereferenced to the array value itself, then this intermediate array is
indexed triggering array decay. Conceptually there’s quite a bit to it,
but, in practice, it’s all as efficient as the conventional approach to
sum3
that accepts a plain int *
.
The caller must take the address of an array of exactly the right length:
int buf[] = {1, 2, 4};
int r = sum3(&buf);
Or if dynamically allocating the array:
int (*array)[3] = malloc(sizeof(*array));
(*array)[0] = 1;
(*array)[1] = 2;
(*array)[2] = 4;
int r = sum3(array);
free(array);
The mandatory parentheses and strict type requirements make this awkward and rarely useful. However, with VLAs perhaps it’s worth the trouble! Consider an NxN matrix expressed using a pointer to a VLA:
int n = /* run-time value */;
/* TODO: Check for integer overflow. See note. */
float (*identity)[n][n] = malloc(sizeof(*identity));
if (identity) {
for (int y = 0; y < n; y++) {
for (int x = 0; x < n; x++) {
(*identity)[y][x] = x == y;
}
}
}
When indexing, the parentheses are weird, but the indices have the
convenient [y][x]
format. The non-VLA alternative is to compute a 1D
index manually from 2D indices (y*n+x
):
int n = /* run-time value */;
/* TODO: Check for integer overflow. */
float *identity = malloc(sizeof(*identity) * n * n);
if (identity) {
for (int y = 0; y < n; y++) {
for (int x = 0; x < n; x++) {
identity[y*n + x] = x == y;
}
}
}
Note: What’s the behavior in the VLA version when n
is so large that
sizeof(*identity)
doesn’t fit in a size_t
? I couldn’t find anything
in the standard about it, though I bet it’s undefined behavior. Neither
GCC and Clang check for overflow and, when it occurs, the overflow is
silent. Neither the undefined behavior sanitizer nor address sanitizer
complain when this happens.
Update: bru del pointed out that these multi-dimensional VLAs can be simplified such that the parenthesis may be omitted when indexing. The trick is to omit the first dimension from the VLA expression:
float (*identity)[n] = malloc(sizeof(*identity) * n);
if (identity) {
for (int y = 0; y < n; y++) {
for (int x = 0; x < n; x++) {
identity[y][x] = x == y;
}
}
}
So VLAs might be worth the trouble when using pointers to multi-dimensional, dynamically-allocated arrays. However, I’m still judicious about their use due to reduced portability. As a practical example, MSVC famously does not, and likely will never will, support VLAs.
]]>One of the frequent challenges in C is that pointers are nothing but a
memory address. A callee who is passed a pointer doesn’t truly know
anything other than the type of object being pointed at, which says some
things about alignment and how that pointer can be used… maybe. If it’s
a pointer to void (void *
) then not even that much is known.
The number of consecutive elements being pointed at is also not known. It could be as few as zero, so dereferencing would be illegal. This can be true even when the pointer is not null. Pointers can go one past the end of an array, at which point it points to zero elements. For example:
void foo(int *);
void bar(void)
{
int array[4];
foo(array + 4); // pointer one past the end
}
In some situations, the number of elements is known, at least to the programmer. For example, the function might have a contract that says it must be passed at least N elements, or exactly N elements. This could be communicated through documentation.
/** Foo accepts 4 int values. */
void foo(int *);
Or it could be implied by the function’s prototype. Despite the following function appearing to accept an array, that’s actually a pointer, and the “4” isn’t relevant to the prototype.
void foo(int[4]);
C99 introduced a feature to make this a formal part of the prototype, though, unfortunately, I’ve never seen a compiler actually use this information.
void foo(int[static 4]); // >= 4 elements, cannot be null
Another common pattern is for the callee to accept a count parameter.
For example, the POSIX write()
function:
ssize_t write(int fd, const void *buf, size_t count);
The necessary information describing the buffer is split across two arguments. That can become tedious, and it’s also a source of serious bugs if the two parameters aren’t in agreement (buffer overflow, information disclosure, etc.). Wouldn’t it be nice if this information was packed into the pointer itself? That’s essentially the definition of a fat pointer.
If we assume some things about the target platform, we can encode fat pointers inside a plain pointer with some dirty pointer tricks, exploiting unused bits in the pointer value. For example, currently on x86-64, only the lower 48 bits of a pointer are actually used. The other 16 bits could carefully be used for other information, like communicating the number of elements or bytes:
// NOTE: x86-64 only!
unsigned char buf[1000];
uintptr addr = (uintptr_t)buf & 0xffffffffffff;
uintptr pack = (sizeof(buf) << 48) | addr;
void *fatptr = (void *)pack;
The other side can unpack this to get the components back out. Obviously 16 bits for the count will often be insufficient, so this would more likely be used for baggy bounds checks.
Further, if we know something about the alignment — say, that it’s 16-byte aligned — then we can also encode information in the least significant bits, such as a type tag.
That’s all fragile, non-portable, and rather limited. A more robust approach is to lift pointers up into a richer, heavier type, like a structure.
struct fatptr {
void *ptr;
size_t len;
};
Functions accepting these fat pointers no longer need to accept a count parameter, and they’d generally accept the fat pointer by value.
fatptr_write(int fd, struct fatptr);
In typical C implementations, the structure fields would be passed practically, if not exactly, same way as the individual parameters would have been passed, so it’s really no less efficient.
To help keep this straight, we might employ some macros:
#define COUNTOF(array) \
(sizeof(array) / sizeof(array[0]))
#define FATPTR(ptr, count) \
(struct fatptr){ptr, count}
#define ARRAYPTR(array) \
FATPTR(array, COUNTOF(array))
/* ... */
unsigned char buf[40];
fatptr_write(fd, ARRAYPTR(buf));
There are obvious disadvantages of this approach, like type confusion
due to that void pointer, the inability to use const
, and just being
weird for C. I wouldn’t use it in a real program, but bear with me.
Before I move on, I want to add one more field to that fat pointer struct: capacity.
struct fatptr {
void *ptr;
size_t len;
size_t cap;
};
This communicates not how many elements are present (len
), but how
much additional space is left in the buffer. This allows callees know
how much room is left for, say, appending new elements.
// Fix the remainder of an int buffer with a value.
void
fill(struct fatptr ptr, int value)
{
int *buf = ptr.ptr;
for (size_t i = ptr.len; i < ptr.cap; i++) {
buf[i] = value;
}
}
Since the callee modifies the fat pointer, it should be returned:
struct fatptr
fill(struct fatptr ptr, int value)
{
int *buf = ptr.ptr;
for (size_t i = ptr.len; i < ptr.cap; i++) {
buf[i] = value;
}
ptr.len = ptr.cap;
return ptr;
}
Congratulations, you’ve got slices! Except that in Go they’re a proper
part of the language and so doesn’t rely on hazardous hacks or tedious
bookkeeping. The fatptr_write()
function above is nearly functionally
equivalent to the Writer.Write()
method in Go, which accepts a slice:
type Writer interface {
Write(p []byte) (n int, err error)
}
The buf
and count
parameters are packed together as a slice, and
fd
parameter is instead the receiver (the object being acted upon by
the method).
Go famously has pointers, including internal pointers, but not pointer
arithmetic. You can take the address of (nearly) anything, but
you can’t make that pointer point at anything else, even if you took the
address of an array element. Pointer arithmetic would undermine Go’s
type safety, so it can only be done through special mechanisms in the
unsafe
package.
But pointer arithmetic is really useful! It’s handy to take an address
of an array element, pass it to a function, and allow that function to
modify a slice (wink, wink) of the array. Slices are pointers that
support exactly this sort of pointer arithmetic, but safely. Unlike
the &
operator which creates a simple pointer, the slice operator
derives a fat pointer.
func fill([]int, int) []int
var array [8]int
// len == 0, cap == 8, like &array[0]
fill(array[:0], 1)
// array is [1, 1, 1, 1, 1, 1, 1, 1]
// len == 0, cap == 4, like &array[4]
fill(array[4:4], 2)
// array is [1, 1, 1, 1, 2, 2, 2, 2]
The fill
function could take a slice of the slice, effectively moving
the pointer around with pointer arithmetic, but without violating memory
safety due to the additional “fat pointer” information. In other words,
fat pointers can be derived from other fat pointers.
Slices aren’t as universal as pointers, at least at the moment. You can
take the address of any variable using &
, but you can’t take a slice
of any variable, even if it would be logically sound.
var foo int
// attempt to make len = 1, cap = 1 slice backed by foo
var fooslice []int = foo[:] // compile-time error!
That wouldn’t be very useful anyway. However, if you really wanted to
do this, the unsafe
package can accomplish it. I believe the resulting
slice would be perfectly safe to use:
// Convert to one-element array, then slice
fooslice = (*[1]int)(unsafe.Pointer(&foo))[:]
Update: Chris Siebenmann speculated about why this requires
unsafe
.
Of course, slices are super flexible and have many more uses that look less like fat pointers, but this is still how I tend to reason about slices when I write Go.
]]>rand()
function for some of these purposes.
int r = rand();
There are some problems with this. Typically the implementation is a rather poor PRNG, and we can do much better. It’s a poor choice for Monte Carlo simulations, and outright dangerous for cryptography. Furthermore, it’s usually a dynamic function call, which has a high overhead compared to how little the function actually does. In glibc, it’s also synchronized, adding even more overhead.
But, more importantly, this function returns the same sequences of values each time the program runs. If we want different numbers each time the program runs, it needs to be seeded — but seeded with what? Regardless of what PRNG we ultimately use, we need inputs unique to this particular execution.
On any modern unix-like system, the classical approach is to open
/dev/urandom
and read some bytes. It’s not part of POSIX but it is a
de facto standard. These random bits are seeded from the physical
world by the operating system, making them highly unpredictable and
uncorrelated. They’re are suitable for keying a CSPRNG and, from
there, generating all the secure random bits you will ever
need (perhaps with fast-key-erasure). Why not
/dev/random
? Because on Linux it’s pointlessly
superstitious, which has basically ruined that path for
everyone.
/* Returns zero on failure. */
int
getbits(void *buf, size_t len)
{
int result = 0;
FILE *f = fopen("/dev/urandom", "rb");
if (f) {
result = fread(buf, len, 1, f);
fclose(f);
}
return result;
}
int
main(void)
{
unsigned seed;
if (getbits(&seed, sizeof(seed))) {
srand(seed);
} else {
die();
}
/* ... */
}
Note how there are two different places getbits()
could fail, with
multiple potential causes.
It could fail to open the file. Perhaps the program isn’t running on a
modern unix-like system. Perhaps it’s running in a chroot and
/dev/urandom
wasn’t created. Perhaps there are too many file
descriptors already open. Perhaps there isn’t enough memory available
to open a file. Perhaps the file permissions disallow it or it’s
blocked by Mandatory Access Control (MAC).
It could fail to read the file. This essentially can’t happen unless the system is severely misconfigured, in which case a successful read would be suspect anyway. In this case it’s probably still a good idea to check the result.
The need for creating a file descriptor a serious issue for libraries.
Libraries that quietly create and close file descriptors can interfere
with the main program, especially if its asynchronous. The main program
might rely on file descriptors being consecutive, predictable, or
monotonic (example). File descriptors are also a limited resource,
so it may exhaust a file descriptor slot needed for the main program.
For a network service, a remote attacker could perhaps open enough
sockets to deny a file descriptor to getbits()
, blocking the program
from gathering entropy.
/dev/urandom
is simple, but it’s not an ideal API.
Wouldn’t it be nicer if our program could just directly ask the
operating system to fill a buffer with random bits? That’s what the
OpenBSD folks thought, so they introduced a getentropy(2)
system call. When called correctly it cannot fail!
int getentropy(void *buf, size_t buflen);
Other operating systems followed suit, including Linux, though
on Linux getentropy(2)
is a library function implemented using
getrandom(2)
, the actual system call. It’s been in the Linux
kernel since version 3.17 (October 2014), but the libc wrapper didn’t
appear in glibc until version 2.25 (February 2017). So as of this
writing, there are still many systems where it’s still not practical
to use even if their kernel is new enough.
For now on Linux you may still want to check, and have a strategy in
place, for an ENOSYS
result. Some systems are still running kernels
that are 5 years old, or older.
OpenBSD also has another trick up its trick-filled sleeves: the
.openbsd.randomdata
section. Just as the .bss
section is
filled with zeros, the .openbsd.randomdata
section is filled with
securely-generated random bits. You could put your PRNG state in this
section and it will be seeded as part of loading the program. Cool!
Windows doesn’t have /dev/urandom
. Instead it has:
CryptGenRandom()
CryptAcquireContext()
CryptReleaseContext()
Though in typical Win32 fashion, the API is ugly, overly-complicated, and has multiple possible failure points. It’s essentially impossible to use without referencing documentation. Ugh.
However, Windows 98 and later has RtlGenRandom()
,
which has a much more reasonable interface. Looks an awful lot like
getentropy(2)
, eh?
BOOLEAN RtlGenRandom(
PVOID RandomBuffer,
ULONG RandomBufferLength
);
The problem is that it’s not quite an official API, and no promises
are made about it. In practice, far too much software now depends on
it that the API is unlikely to ever break. Despite the prototype
above, this function is actually named SystemFunction036()
, and
you have to supply your own prototype. Here’s my little drop-in
snippet that turns it nearly into getentropy(2)
:
#ifdef _WIN32
# define WIN32_LEAN_AND_MEAN
# include <windows.h>
# pragma comment(lib, "advapi32.lib")
BOOLEAN NTAPI SystemFunction036(PVOID, ULONG);
# define getentropy(buf, len) (SystemFunction036(buf, len) ? 0 : -1)
#endif
It works in Wine, too, where, at least in my version, it reads from
/dev/urandom
.
That’s all well and good, but suppose we’re masochists. We want our
program to be maximally portable so we’re sticking strictly to
functionality found in the standard C library. That means no
getentropy(2)
and no RtlGenRandom()
. We can still try to open
/dev/urandom
, but it might fail, or it might not actually be useful,
so we’ll want a backup.
The usual approach found in a thousand tutorials is time(3)
:
srand(time(NULL));
It would be better to use an integer hash function to mix up the
result from time(0)
before using it as a seed. Otherwise two programs
started close in time may have similar initial sequences.
srand(triple32(time(NULL)));
The more pressing issue is that time(3)
has a resolution of one
second. If the program is run twice inside of a second, they’ll both
have the same sequence of numbers. It would be better to use a higher
resolution clock, but, standard C doesn’t provide a clock with greater
than one second resolution. That normally requires calling into POSIX
or Win32.
So, we need to find some other sources of entropy unique to each execution of the program.
Before we get into that, we need a way to mix these different sources
together. Here’s a small, 32-bit “string” hash function. The loop
is the same algorithm as Java’s hashCode()
, and I appended my own
integer hash as a finalizer for much better diffusion.
uint32_t
hash32s(const void *buf, size_t len, uint32_t h)
{
const unsigned char *p = buf;
for (size_t i = 0; i < len; i++)
h = h * 31 + p[i];
h ^= h >> 17;
h *= UINT32_C(0xed5ad4bb);
h ^= h >> 11;
h *= UINT32_C(0xac4c1b51);
h ^= h >> 15;
h *= UINT32_C(0x31848bab);
h ^= h >> 14;
return h;
}
It accepts a starting hash value, which is essentially a “context” for the digest that allows different inputs to be appended together. The finalizer acts as an implicit “stop” symbol in between inputs.
I used fixed-width integers, but it could be written nearly as concisely
using only unsigned long
and some masking to truncate to 32-bits. I
leave this as an exercise to the reader.
Some of the values to be mixed in will be pointers themselves. These could instead be cast to integers and passed through an integer hash function, but using string hash avoids various caveats. Besides, one of the inputs will be a string, so we’ll need this function anyway.
Attackers can use predictability to their advantage, so modern systems use unpredictability to improve security. Memory addresses for various objects and executable code are randomized since some attacks require an attacker to know their addresses. We can skim entropy from these pointers to seed our PRNG.
Address Space Layout Randomization (ASLR) is when executable code and its associated data is loaded to a random offset by the loader. Code designed for this is called Position Independent Code (PIC). This has long been used when loading dynamic libraries so that all of the libraries on a system don’t have to coordinate with each other to avoid overlapping.
To improve security, it has more recently been extended to programs themselves. On both modern unix-like systems and Windows, position-independent executables (PIE) are now the default.
To skim entropy from ASLR, we just need the address of one of our
functions. All the functions in our program will have the same relative
offset, so there’s no reason to use more than one. An obvious choice is
main()
:
uint32_t h = 0; /* initial hash value */
int (*mainptr)() = main;
h = hash32s(&mainptr, sizeof(mainptr), h);
Notice I had to store the address of main()
in a variable, and then
treat the pointer itself as a buffer for the hash function? It’s not
hashing the machine code behind main
, just its address. The symbol
main
doesn’t store an address, so it can’t be given to the hash
function to represent its address. This is analogous to an array
versus a pointer.
On a typical x86-64 Linux system, and when this is a PIE, that’s about 3 bytes worth of entropy. On 32-bit systems, virtual memory is so tight that it’s worth a lot less. We might want more entropy than that, and we want to cover the case where the program isn’t compiled as a PIE.
On unix-like systems, programs are typically dynamically linked against
the C library, libc. Each shared object gets its own ASLR offset, so we
can skim more entropy from each shared object by picking a function or
variable from each. Let’s do malloc(3)
for libc ASLR:
void *(*mallocptr)() = malloc;
h = hash32s(&mallocptr, sizeof(mallocptr), h);
Allocators themselves often randomize the addresses they return so that
data objects are stored at unpredictable addresses. In particular, glibc
uses different strategies for small (brk(2)
) versus big (mmap(2)
)
allocations. That’s two different sources of entropy:
void *small = malloc(1); /* 1 byte */
h = hash32s(&small, sizeof(small), h);
free(small);
void *big = malloc(1UL << 20); /* 1 MB */
h = hash32s(&big, sizeof(big), h);
free(big);
Finally the stack itself is often mapped at a random address, or at least started with a random gap, so that local variable addresses are also randomized.
void *ptr = &ptr;
h = hash32s(&ptr, sizeof(ptr), h);
We haven’t used time(3)
yet! Let’s still do that, using the full
width of time_t
this time around:
time_t t = time(0);
h = hash32s(&t, sizeof(t), h);
We do have another time source to consider: clock(3)
. It returns an
approximation of the processor time used by the program. There’s a
tiny bit of noise and inconsistency between repeated calls. We can use
this to extract a little bit of entropy over many repeated calls.
Naively we might try to use it like this:
/* Note: don't use this */
for (int i = 0; i < 1000; i++) {
clock_t c = clock();
h = hash32s(&c, sizeof(c), h);
}
The problem is that the resolution for clock()
is typically rough
enough that modern computers can execute multiple instructions between
ticks. On Windows, where CLOCKS_PER_SEC
is low, that entire loop
will typically complete before the result from clock()
increments
even once. With that arrangement we’re hardly getting anything from
it! So here’s a better version:
for (int i = 0; i < 1000; i++) {
unsigned long counter = 0;
clock_t start = clock();
while (clock() == start)
counter++;
h = hash32s(&start, sizeof(start), h);
h = hash32s(&counter, sizeof(counter), h);
}
The counter makes the resolution of the clock no longer important. If
it’s low resolution, then we’ll get lots of noise from the counter. If
it’s high resolution, then we get noise from the clock value itself.
Running the hash function an extra time between overall clock(3)
samples also helps with noise.
We’ve got one more source of entropy available: tmpnam(3)
. This
function generates a unique, temporary file name. It’s dangerous to
use as intended because it doesn’t actually create the file. There’s a
race between generating the name for the file and actually creating
it.
Fortunately we don’t actually care about the name as a filename. We’re using this to sample entropy not directly available to us. In attempt to get a unique name, the standard C library draws on its own sources of entropy.
char buf[L_tmpnam] = {0};
tmpnam(buf);
h = hash32s(buf, sizeof(buf), h);
The rather unfortunately downside is that lots of modern systems produce
a linker warning when it sees tmpnam(3)
being linked, even though in
this case it’s completely harmless.
So what goes into a temporary filename? It depends on the implementation.
Both get a high resolution timestamp and generate the filename directly
from the timestamp (no hashing, etc.). Unfortunately glibc does a very
poor job of also mixing getpid(2)
into the timestamp before using it,
and probably makes things worse by doing so.
On these platforms, this is is a way to sample a high resolution timestamp without calling anything non-standard.
In the latest release as of this writing it uses rand(3)
, which makes
this useless. It’s also a bug since the C library isn’t allowed to
affect the state of rand(3)
outside of rand(3)
and srand(3)
. I
submitted a bug report and this has since been fixed.
In the next release it will use a generator seeded by the ELF
AT_RANDOM
value if available, or ASLR otherwise. This makes
it moderately useful.
Generated from getpid(2)
alone, with a counter to handle multiple
calls. It’s basically a way to sample the process ID without actually
calling getpid(2)
.
Actually gathers real entropy from the operating system (via
arc4random(2)
), which means we’re getting a lot of mileage out of this
one.
Its implementation is obviously forked from glibc. However, it first
tries to read entropy from /dev/urandom
, and only if that fails does
it fallback to glibc’s original high resolution clock XOR getpid(2)
method (still not hashing it).
Finally, still use /dev/urandom
if it’s available. This doesn’t
require us to trust that the output is anything useful since it’s just
being mixed into the other inputs.
char rnd[4];
FILE *f = fopen("/dev/urandom", "rb");
if (f) {
if (fread(rnd, sizeof(rnd), 1, f))
h = hash32s(rnd, sizeof(rnd), h);
fclose(f);
}
When we’re all done gathering entropy, set the seed from the result.
srand(h); /* or whatever you're seeding */
That’s bound to find some entropy on just about any host. Though definitely don’t rely on the results for cryptography.
I recently tackled this problem in Lua. It has a no-batteries-included design, demanding very little of its host platform: nothing more than an ANSI C implementation. Because of this, a Lua program has even fewer options for gathering entropy than C. But it’s still not impossible!
To further complicate things, Lua code is often run in a sandbox with
some features removed. For example, Lua has os.time()
and os.clock()
wrapping the C equivalents, allowing for the same sorts of entropy
sampling. When run in a sandbox, os
might not be available. Similarly,
io
might not be available for accessing /dev/urandom
.
Have you ever printed a table, though? Or a function? It evaluates to a string containing the object’s address.
$ lua -e 'print(math)'
table: 0x559577668a30
$ lua -e 'print(math)'
table: 0x55e4a3679a30
Since the raw pointer values are leaked to Lua, we can skim allocator entropy like before. Here’s the same hash function in Lua 5.3:
local function hash32s(buf, h)
for i = 1, #buf do
h = h * 31 + buf:byte(i)
end
h = h & 0xffffffff
h = h ~ (h >> 17)
h = h * 0xed5ad4bb
h = h & 0xffffffff
h = h ~ (h >> 11)
h = h * 0xac4c1b51
h = h & 0xffffffff
h = h ~ (h >> 15)
h = h * 0x31848bab
h = h & 0xffffffff
h = h ~ (h >> 14)
return h
end
Now hash a bunch of pointers in the global environment:
local h = hash32s({}, 0) -- hash a new table
for varname, value in pairs(_G) do
h = hash32s(varname, h)
h = hash32s(tostring(value), h)
if type(value) == 'table' then
for k, v in pairs(value) do
h = hash32s(tostring(k), h)
h = hash32s(tostring(v), h)
end
end
end
math.randomseed(h)
Unfortunately this doesn’t actually work well on one platform I tested: Cygwin. Cygwin has few security features, notably lacking ASLR, and having a largely deterministic allocator.
In practice it’s not really necessary to use these sorts of tricks of gathering entropy from odd places. It’s something that comes up more in coding challenges and exercises than in real programs. I’m probably already making platform-specific calls in programs substantial enough to need it anyway.
On a few occasions I have thought about these things when debugging. ASLR makes return pointers on the stack slightly randomized on each run, which can change the behavior of some kinds of bugs. Allocator and stack randomization does similar things to most of your pointers. GDB tries to disable some of these features during debugging, but it doesn’t get everything.
]]>The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.
That’s why, when I recently revisited the Fibers API, I was pleasantly surprised. It’s one of the exceptions — much cleaner than the optional, deprecated, and now obsolete POSIX equivalent. It’s not quite an apples-to-apples comparison since the POSIX version is slightly more powerful, and more complicated as a result. I’ll cover the difference in this article.
For the last part of this article, I’ll walk through an async/await framework build on top of fibers. The framework allows coroutines in C programs to await on arbitrary kernel objects.
Windows fibers are really just stackful, symmetric coroutines.
From a different point of view, they’re cooperatively scheduled threads,
which is the source of the analogous name, fibers. They’re symmetric
because all fibers are equal, and no fiber is the “main” fiber. If any
fiber returns from its start routine, the program exits. (Older versions
of Wine will crash when this happens, but it was recently fixed.) It’s
equivalent to the process’ main thread returning from main()
. The
initial fiber is free to create a second fiber, yield to it, then the
second fiber destroys the first.
For now I’m going to focus on the core set of fiber functions. There are some additional capabilities I’m going to ignore, including support for fiber local storage. The important functions are just these five:
void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void SwitchToFiber(void *fiber);
bool ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void DeleteFiber(void *fiber);
To emphasize its simplicity, I’ve shown them here with more standard
prototypes than seen in their formal documentation. That documentation
uses the clunky Windows API typedefs still burdened with its 16-bit
heritage — e.g. LPVOID
being a “long pointer” from the segmented memory
of the 8086:
Fibers are represented using opaque, void pointers. Maybe that’s a little
too simple since it’s easy to misuse in C, but I like it. The return
values for CreateFiber()
and ConvertThreadToFiber()
are void pointers
since these both create fibers.
The fiber start routine returns nothing and takes a void “user pointer”.
That’s nearly what I’d expect, except that it would probably make more
sense for a fiber to return int
, which is more in line with
main
/ WinMain
/ mainCRTStartup
/ WinMainCRTStartup
. As I said,
when any fiber returns from its start routine, it’s like returning from
the main function, so it should probably have returned an integer.
A fiber may delete itself, which is the same as exiting the thread.
However, a fiber cannot yield (e.g. SwitchToFiber()
) to itself. That’s
undefined behavior.
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
void
coup(void *king)
{
puts("Long live the king!");
DeleteFiber(king);
ConvertFiberToThread(); /* seize the main thread */
/* ... */
}
int
main(void)
{
void *king = ConvertThreadToFiber(0);
void *pretender = CreateFiber(0, coup, king);
SwitchToFiber(pretender);
abort(); /* unreachable */
}
Only fibers can yield to fibers, but when the program starts up, there
are no fibers. At least one thread must first convert itself into a
fiber using ConvertThreadToFiber()
, which returns the fiber object
that represents itself. It takes one argument analogous to the last
argument of CreateFiber()
, except that there’s no start routine to
accept it. The process is reversed with ConvertFiberToThread()
.
Fibers don’t belong to any particular thread and can be scheduled on any thread if properly synchronized. Obviously one should never yield to the same fiber in two different threads at the same time.
The equivalent POSIX systems was context switching. It’s also stackful
and symmetric, but it has just three important functions:
getcontext(3)
, makecontext(3)
, and
swapcontext
.
int getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int swapcontext(ucontext_t *oucp, const ucontext_t *ucp);
These are roughly equivalent to GetCurrentFiber()
,
CreateFiber()
, and SwitchToFiber()
. There is no need for
ConvertFiberToThread()
since threads can context switch without
preparation. There’s also no DeleteFiber()
because the resources are
managed by the program itself. That’s where POSIX contexts are a little
bit more powerful.
The first argument to CreateFiber()
is the desired stack size, with
zero indicating the default stack size. The stack is allocated and freed
by the operating system. The downside is that the caller doesn’t have a
choice in managing the lifetime of this stack and how it’s allocated. If
you’re frequently creating and destroying coroutines, those stacks are
constantly being allocated and freed.
In makecontext(3)
, the caller allocates and supplies the stack. Freeing
that stack is equivalent to destroying the context. A program that
frequently creates and destroys contexts can maintain a stack pool or
otherwise more efficiently manage their allocation. This makes it more
powerful, but it also makes it a little more complicated. It would be hard
to remember how to do all this without a careful reading of the
documentation:
/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);
/* Destroy a context */
free(ctx.uc_stack.ss_sp);
Note how makecontext(3)
is variadic (...
), passing its arguments on
to the start routine of the context. This seems like it might be better
than a user pointer. Unfortunately it’s not, since those arguments are
strictly limited to integers.
Ultimately I like the fiber API better. The first time I tried it out, I could guess my way through it without looking closely at the documentation.
Why was I looking at the Fiber API? I’ve known about coroutines for years but I didn’t understand how they could be useful. Sure, the function can yield, but what other coroutine should it yield to? It wasn’t until I was recently bit by the async/await bug that I finally saw a “killer feature” that justified their use. Generators come pretty close, though.
Windows fibers are a coroutine primitive suitable for async/await in C programs, where it can also be useful. To prove that it’s possible, I built async/await on top of fibers in 95 lines of code.
The alternatives are to use a third-party coroutine library or to do it myself with some assembly programming. However, having it built into the operating system is quite convenient! It’s unfortunate that it’s limited to Windows. Ironically, though, everything I wrote for this article, including the async/await demonstration, was originally written on Linux using Mingw-w64 and tested using Wine. Only after I was done did I even try it on Windows.
Before diving into how it works, there’s a general concept about the
Windows API that must be understood: All kernel objects can be in
either a signaled or unsignaled state. The API provides functions that
block on a kernel object until it is signaled. The two important ones
are WaitForSingleObject()
and WaitForMultipleObjects()
.
The latter behaves very much like poll(2)
in POSIX.
Usually the signal is tied to some useful event, like a process or
thread exiting, the completion of an I/O operation (i.e. asynchronous
overlapped I/O), a semaphore being incremented, etc. It’s a generic way
to wait for some event. However, instead of blocking the thread,
wouldn’t it be nice to await on the kernel object? In my aio
library for Emacs, the fundamental “wait” object was a promise. For this
API it’s a kernel object handle.
So, the await function will take a kernel object, register it with the scheduler, then yield to the scheduler. The scheduler — which is a global variable, so there’s only one scheduler per process — looks like this:
struct {
void *main_fiber;
HANDLE handles[MAXIMUM_WAIT_OBJECTS];
void *fibers[MAXIMUM_WAIT_OBJECTS];
void *dead_fiber;
int count;
} async_loop;
While fibers are symmetric, coroutines in my async/await implementation
are not. One fiber is the scheduler, main_fiber
, and the other fibers
always yield to it.
There is an array of kernel object handles, handles
, and an array of
fibers
. The elements in these arrays are paired with each other, but
it’s convenient to store them separately, as I’ll show soon. fibers[0]
is waiting on handles[0]
, and so on.
The array is a fixed size, MAXIMUM_WAIT_OBJECTS
(64), because there’s
a hard limit on the number of fibers that can wait at once. This
pathetically small limitation is an unfortunate, hard-coded restriction
of the Windows API. It kills most practical uses of my little library.
Fortunately there’s no limit on the number of handles we might want to
wait on, just the number of co-existing fibers.
When a fiber is about to return from its start routine, it yields one
last time and registers itself on the dead_fiber
member. The scheduler
will delete this fiber as soon as it’s given control. Fibers never
truly return since that would terminate the program.
With this, the await function, async_await()
, is pretty simple. It
registers the handle with the scheduler, then yields to the scheduler
fiber.
void
async_await(HANDLE h)
{
async_loop.handles[async_loop.count] = h;
async_loop.fibers[async_loop.count] = GetCurrentFiber();
async_loop.count++;
SwitchToFiber(async_loop.main_fiber);
}
Caveat: The scheduler destroys this handle with CloseHandle()
after it
signals, so don’t try to reuse it. This made my demonstration simpler,
but it might be better to not do this.
A fiber can exit at any time. Such an exit is inserted implicitly before a fiber actually returns:
void
async_exit(void)
{
async_loop.dead_fiber = GetCurrentFiber();
SwitchToFiber(async_loop.main_fiber);
}
The start routine given to async_start()
is actually wrapped in the
real start routine. This is how async_exit()
is injected:
struct fiber_wrapper {
void (*func)(void *);
void *arg;
};
static void
fiber_wrapper(void *arg)
{
struct fiber_wrapper *fw = arg;
fw->func(fw->arg);
async_exit();
}
int
async_start(void (*func)(void *), void *arg)
{
if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
return 0;
} else {
struct fiber_wrapper fw = {func, arg};
SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
return 1;
}
}
The library provides a single awaitable function, async_sleep()
. It
creates a “waitable timer” object, starts the countdown, and returns it.
(Notice how SetWaitableTimer()
is a typically-ugly Win32 function with
excessive parameters.)
HANDLE
async_sleep(double seconds)
{
HANDLE promise = CreateWaitableTimer(0, 0, 0);
LARGE_INTEGER t;
t.QuadPart = (long long)(seconds * -10000000.0);
SetWaitableTimer(promise, &t, 0, 0, 0, 0);
return promise;
}
A more realistic example would be overlapped I/O. For example, you’d
open a file (CreateFile()
) in overlapped mode, then when you, say,
read from that file (ReadFile()
) you create an event object
(CreateEvent()
), populate an overlapped I/O structure with the event,
offset, and length, then finally await on the event object. The fiber
will be resumed when the operation is complete.
Side note: Unfortunately overlapped I/O doesn’t work correctly for files, and many operations can’t be done asynchronously, like opening files. When it comes to files, you’re better off using dedicated threads as libuv does instead of overlapped I/O. You can still await on these operations. You’d just await on the signal from the thread doing synchronous I/O, not from overlapped I/O.
The most complex part is the scheduler, and it’s really not complex at all:
void
async_run(void)
{
while (async_loop.count) {
/* Wait for next event */
DWORD nhandles = async_loop.count;
HANDLE *handles = async_loop.handles;
DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);
/* Remove event and fiber from waiting array */
void *fiber = async_loop.fibers[r];
CloseHandle(async_loop.handles[r]);
async_loop.handles[r] = async_loop.handles[nhandles - 1];
async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
async_loop.count--;
/* Run the fiber */
SwitchToFiber(fiber);
/* Destroy the fiber if it exited */
if (async_loop.dead_fiber) {
DeleteFiber(async_loop.dead_fiber);
async_loop.dead_fiber = 0;
}
}
}
This is why the handles are in their own array. The array can be passed
directly to WaitForMultipleObjects()
. The return value indicates which
handle was signaled. The handle is closed, the entry removed from the
scheduler, and then the fiber is resumed.
That WaitForMultipleObjects()
is what limits the number of fibers.
It’s not possible to wait on more than 64 handles at once! This is
hard-coded into the API. How? A return value of 64 is an error code, and
changing this would break the API. Remember what I said about being
locked into bad design decisions of the past?
To be fair, WaitForMultipleObjects()
was a doomed API anyway, just
like select(2)
and poll(2)
in POSIX. It scales very poorly since the
entire array of objects being waited on must be traversed on each call.
That’s terribly inefficient when waiting on large numbers of objects.
This sort of problem is solved by interfaces like kqueue (BSD), epoll
(Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this
particular problem well — awaiting on kernel objects — so I
couldn’t use it.
When the awaiting fiber count is zero and the scheduler has control, all fibers must have completed and there’s nothing left to do. However, the caller can schedule more fibers and then restart the scheduler if desired.
That’s all there is to it. Have a look at demo.c
to see how
the API looks in some trivial examples. On Linux you can see it in
action with make check
. On Windows, you just need to compile
it, then run it like a normal program. If there was a better
function than WaitForMultipleObjects()
in the Windows API, I would
have considered turning this demonstration into a real library.
I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.
The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:
This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):
$ make
$ ./endlessh &
$ ssh -p2222 localhost
Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.
My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.
How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:
SSH-protoversion-softwareversion SP comments CR LF
The RFC also notes:
The server MAY send other lines of data before sending the version string.
There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.
This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.
Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)
Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.
The most straightforward way to implement something like Endlessh is a
fork server: accept a connection, fork, and the child simply alternates
between sleep(3)
and write(2)
:
for (;;) {
ssize_t r;
char line[256];
sleep(DELAY);
generate_line(line);
r = write(fd, line, strlen(line));
if (r == -1 && errno != EINTR) {
exit(0);
}
}
A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.
A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.
For Endlessh I went for an even more lightweight version: a
single-threaded poll(2)
server, analogous to stackless green threads.
The overhead per connection is about as low as it gets.
Clients that are being delayed are not registered in poll(2)
. Their
only overhead is the socket object in the kernel, and another 78 bytes
to track them in Endlessh. Most of those bytes are used only for
accurate logging. Only those clients that are overdue for a new line
are registered for poll(2)
.
When clients are waiting, but no clients are overdue, poll(2)
is
essentially used in place of sleep(3)
. Though since it still needs
to manage the accept server socket, it (almost) never actually waits
on nothing.
There’s an option to limit the total number of client connections so
that it doesn’t get out of hand. In this case it will stop polling the
accept socket until a client disconnects. I probably shouldn’t have
bothered with this option and instead relied on ulimit
, a feature
already provided by the operating system.
I could have used epoll (Linux) or kqueue (BSD), which would be much
more efficient than poll(2)
. The problem with poll(2)
is that it’s
constantly registering and unregistering Endlessh on each of the
overdue sockets each time around the main loop. This is by far the
most CPU-intensive part of Endlessh, and it’s all inflicted on the
kernel. Most of the time, even with thousands of clients trapped in
the tarpit, only a small number of them at polled at once, so I opted
for better portability instead.
One consequence of not polling connections that are waiting is that
disconnections aren’t noticed in a timely fashion. This makes the logs
less accurate than I like, but otherwise it’s pretty harmless.
Unforunately even if I wanted to fix this, the poll(2)
interface
isn’t quite equipped for it anyway.
With a poll(2)
server, the biggest overhead remaining is in the
kernel, where it allocates send and receive buffers for each client
and manages the proper TCP state. The next step to reducing this
overhead is Endlessh opening a raw socket and speaking TCP itself,
bypassing most of the operating system’s TCP/IP stack.
Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.
Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.
If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.
I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.
Since writing Endless I’ve learned about Python’s asyncio
, and
it’s actually a near perfect fit for this problem. I should have just
used it in the first place. The hard part is already implemented within
asyncio
, and the problem isn’t CPU-bound, so being written in Python
doesn’t matter.
Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:
import asyncio
import random
async def handler(_reader, writer):
try:
while True:
await asyncio.sleep(10)
writer.write(b'%x\r\n' % random.randint(0, 2**32))
await writer.drain()
except ConnectionResetError:
pass
async def main():
server = await asyncio.start_server(handler, '0.0.0.0', 2222)
async with server:
await server.serve_forever()
asyncio.run(main())
Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asyncio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:
import asyncio
import random
async def handler(_reader, writer):
writer.write(b'HTTP/1.1 200 OK\r\n')
try:
while True:
await asyncio.sleep(5)
header = random.randint(0, 2**32)
value = random.randint(0, 2**32)
writer.write(b'X-%x: %x\r\n' % (header, value))
await writer.drain()
except ConnectionResetError:
pass
async def main():
server = await asyncio.start_server(handler, '0.0.0.0', 8080)
async with server:
await server.serve_forever()
asyncio.run(main())
Try it out for yourself. Firefox and Chrome will spin on that server
for hours before giving up. I have yet to see curl actually timeout on
its own in the default settings (--max-time
/-m
does work
correctly, though).
Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.
]]>In 2007 I wrote a pair of modding tools, binitools, for a space trading and combat simulation game named Freelancer. The game stores its non-art assets in the format of “binary INI” files, or “BINI” files. The motivation for the binary format over traditional INI files was probably performance: it’s faster to load and read these files than it is to parse arbitrary text in INI format.
Much of the in-game content can be changed simply by modifying these files — changing time names, editing commodity prices, tweaking ship statistics, or even adding new ships to the game. The binary nature makes them unsuitable to in-place modification, so the natural approach is to convert them to text INI files, make the desired modifications using a text editor, then convert back to the BINI format and replace the file in the game’s installation.
I didn’t reverse engineer the BINI format, nor was I the first person the create tools to edit them. The existing tools weren’t to my tastes, and I had my own vision for how they should work — an interface more closely following the Unix tradition despite the target being a Windows game.
When I got started, I had just learned how to use yacc (really Bison) and lex (really flex), as well as Autoconf, so I went all-in with these newly-discovered tools. It was exciting to try them out in a real-world situation, though I slavishly aped the practices of other open source projects without really understanding why things were they way they were. Due to the use of yacc/lex and the configure script build, compiling the project required a full, Unix-like environment. This is all visible in the original version of the source.
The project was moderately successful in two ways. First, I was able to use the tools to modify the game. Second, other people were using the tools, since the binaries I built show up in various collections of Freelancer modding tools online.
That’s the way things were until mid-2018 when I revisited the project. Ever look at your own old code and wonder what they heck you were thinking? My INI format was far more rigid and strict than necessary, I was doing questionable things when writing out binary data, and the build wasn’t even working correctly.
With an additional decade of experience under my belt, I knew I could do way better if I were to rewrite these tools today. So, over the course of a few days, I did, from scratch. That’s what’s visible in the master branch today.
I like to keep things simple which meant no more Autoconf, and
instead a simple, portable Makefile. No more yacc or lex, and
instead a hand-coded parser. Using only conforming, portable C. The
result was so simple that I can build using Visual Studio in a
single, short command, so the Makefile isn’t all that necessary. With
one small tweak (replace stdint.h
with a typedef
), I can even build
and run binitools in DOS.
The new version is faster, leaner, cleaner, and simpler. It’s far more flexible about its INI input, so its easier to use. But is it more correct?
I’ve been interested in fuzzing for years, especially american fuzzy lop, or afl. However, I wasn’t having success with it. I’d fuzz some of the tools I use regularly, and it wouldn’t find anything of note, at least not before I gave up. I fuzzed my JSON library, and somehow it turned up nothing. Surely my JSON parser couldn’t be that robust already, could it? Fuzzing just wasn’t accomplishing anything for me. (As it turns out, my JSON library is quite robust, thanks in large part to various contributors!)
So I’ve got this relatively new INI parser, and while it can successfully parse and correctly re-assemble the game’s original set of BINI files, it hasn’t really been exercised that much. Surely there’s something in here for a fuzzer to find. Plus I don’t even have to write a line of code in order to run afl against it. The tools already read from standard input by default, which is perfect.
Assuming you’ve got the necessary tools installed (make, gcc, afl), here’s how easy it is to start fuzzing binitools:
$ make CC=afl-gcc
$ mkdir in out
$ echo '[x]' > in/empty
$ afl-fuzz -i in -o out -- ./bini
The bini
utility takes INI as input and produces BINI as output, so
it’s far more interesting to fuzz than its inverse, unbini
. Since
unbini
parses relatively simple binary data, there are (probably) no
bugs for the fuzzer to find. I did try anyway just in case.
In my example above, I swapped out the default compiler for afl’s GCC
wrapper (CC=afl-gcc
). It calls GCC in the background, but in doing so
adds its own instrumentation to the binary. When fuzzing, afl-fuzz
uses that instrumentation to monitor the program’s execution path. The
afl whitepaper explains the technical details.
I also created input and output directories, placing a minimal, working example into the input directory, which gives afl a starting point. As afl runs, it mutates a queue of inputs and observes the changes on the program’s execution. The output directory contains the results and, more importantly, a corpus of inputs that cause unique execution paths. In other words, the fuzzer output will be lots of inputs that exercise many different edge cases.
The most exciting and dreaded result is a crash. The first time I ran it
against binitools, bini
had many such crashes. Within minutes, afl
was finding a number of subtle and interesting bugs in my program, which
was incredibly useful. It even discovered an unlikely stale pointer
bug by exercising different orderings for various memory
allocations. This particular bug was the turning point that made me
realize the value of fuzzing.
Not all the bugs it found led to crashes. I also combed through the outputs to see what sorts of inputs were succeeding, what was failing, and observe how my program handled various edge cases. It was rejecting some inputs I thought should be valid, accepting some I thought should be invalid, and interpreting some in ways I hadn’t intended. So even after I fixed the crashing inputs, I still made tweaks to the parser to fix each of these troublesome inputs.
Once I combed out all the fuzzer-discovered bugs, and I agreed with the parser on how all the various edge cases should be handled, I turned the fuzzer’s corpus into a test suite — though not directly.
I had run the fuzzer in parallel — a process that is explained in the
afl documentation — so I had lots of redundant inputs. By redundant I
mean that the inputs are different but have the same execution path.
Fortunately afl has a tool to deal with this: afl-cmin
, the corpus
minimization tool. It eliminates all the redundant inputs.
Second, many of these inputs were longer than necessary in order to
invoke their unique execution path. There’s afl-tmin
, the test case
minimizer, which I used to further shrink my test corpus.
I sorted the valid from invalid inputs and checked them into the repository. Have a look at all the wacky inputs invented by the fuzzer starting from my single, minimal input:
This essentially locks down the parser, and the test suite ensures a particular build behaves in a very specific way. This is most useful for ensuring that builds on other platforms and by other compilers are indeed behaving identically with respect to their outputs. My test suite even revealed a bug in diet libc, as binitools doesn’t pass the tests when linked against it. If I were to make non-trivial changes to the parser, I’d essentially need to scrap the current test suite and start over, having afl generate an entire new corpus for the new parser.
Fuzzing has certainly proven itself to be a powerful technique. It found a number of bugs that I likely wouldn’t have otherwise discovered on my own. I’ve since gotten more savvy on its use and have used it on other software — not just software I’ve written myself — and discovered more bugs. It’s got a permanent slot on my software developer toolbelt.
]]>r
?
let array = new Uint8Array([255]);
let r = ++array[0];
The increment and decrement operators originated in the B programming
language. Its closest living relative today is C, and, as far as these
operators are concered, C can be considered an ancestor of JavaScript.
So what is the value of r
in this similar C code?
uint8_t array[] = {255};
int r = ++array[0];
Of course, if they were the same then there would be nothing to write
about, so that should make it easier to guess if you aren’t sure. The
answer: In JavaScript, r
is 256. In C, r
is 0.
What happened to me was that I wrote an 80-bit integer increment routine in C like this:
uint8_t array[10];
/* ... */
for (int i = 9; i >= 0; i--)
if (++array[i])
break;
But I was getting the wrong result over in JavaScript from essentially the same code:
let array = new Uint8Array(10);
/* ... */
for (let i = 9; i >= 0; i--)
if (++array[i])
break;
So what’s going on here?
The ES5 specification says this about the prefix increment operator:
Let expr be the result of evaluating UnaryExpression.
Throw a SyntaxError exception if the following conditions are all true: [omitted]
Let oldValue be ToNumber(GetValue(expr)).
Let newValue be the result of adding the value 1 to oldValue, using the same rules as for the + operator (see 11.6.3).
Call PutValue(expr, newValue).
Return newValue.
So, oldValue is 255. This is a double precision float because all numbers in JavaScript (outside of the bitwise operations) are double precision floating point. Add 1 to this value to get 256, which is newValue. When newValue is stored in the array via PutValue(), it’s converted to an unsigned 8-bit integer, which truncates it to 0.
However, newValue is returned, not the value that was actually stored in the array!
Since JavaScript is dynamically typed, this difference did not actually matter until typed arrays are involved. I suspect if typed arrays were in JavaScript from the beginning, the specified behavior would be more in line with C.
This behavior isn’t limited to the prefix operators. Consider assignment:
let array = new Uint8Array([255]);
let r = (array[0] = array[0] + 1);
let s = (array[0] += 1);
Both r
and s
will still be 256. The result of the assignment
operators is a similar story:
LeftHandSideExpression = AssignmentExpression is evaluated as follows:
Let lref be the result of evaluating LeftHandSideExpression.
Let rref be the result of evaluating AssignmentExpression.
Let rval be GetValue(rref).
Throw a SyntaxError exception if the following conditions are all true: [omitted]
Call PutValue(lref, rval).
Return rval.
Again, the result of the expression is independent of how it was stored with PutValue().
I’ll be referencing the original C89/C90 standard. The C specification requires a little more work to get to the bottom of the issue. Starting with 3.3.3.1 (Prefix increment and decrement operators):
The value of the operand of the prefix ++ operator is incremented. The result is the new value of the operand after incrementation. The expression ++E is equivalent to (E+=1).
Later in 3.3.16.2 (Compound assignment):
A compound assignment of the form E1 op = E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.
Then finally in 3.3.16 (Assignment operators):
An assignment operator stores a value in the object designated by the left operand. An assignment expression has the value of the left operand after the assignment, but is not an lvalue.
So the result is explicitly the value after assignment. Let’s look at this step by step after rewriting the expression.
int r = (array[0] = array[0] + 1);
In C, all integer operations are performed with at least int
precision. Smaller integers are implicitly promoted to int
before the
operation. The value of array[0] is 255, and, since uint8_t
is smaller
than int
, it gets promoted to int
. Additionally, the literal
constant 1 is also an int
, so there are actually two reasons for this
promotion.
So since these are int
values, the result of the addition is 256, like
in JavaScript. To store the result, this value is then demoted to
uint8_t
and truncated to 0. Finally, this post-assignment 0 is the
result of the expression, not the right-hand result as in JavaScript.
These situations are why I prefer programming languages that have a formal and approachable specification. If there’s no specification and I’m observing undocumented, idiosyncratic behavior, is this just some subtle quirk of the current implementation — e.g. something that might change without notice in the future — or is it intended behavior that I can rely upon for correctness?
]]>RANDOM
environment
variable that evaluates to a random value between 0 and 32,767 (e.g.
15 bits). Assigment to the variable seeds the generator. This variable
is an extension and did not appear in the original Unix Bourne
shell. Despite this, the different Bourne-like shells that implement
it have converged to the same interface, but only the interface.
Each implementation differs in interesting ways. In this article we’ll
explore how $RANDOM
is implemented in various Bourne-like shells.
Unfortunately I was unable to determine the origin of
Nobody was doing a good job tracking source code changes before the
mid-1990s, so that history appears to be lost. Bash was first released
in 1989, but the earliest version I could find was 1.14.7, released in 1996.
KornShell was first released in 1983, but the earliest source I could
find was from 1993. In both cases $RANDOM
.$RANDOM
already existed. My
guess is that it first appeared in one of these two shells, probably
KornShell.
Update: Quentin Barnes has informed me that his 1986 copy of
KornShell (a.k.a. ksh86) implements $RANDOM
. This predates Bash and
makes it likely that this feature originated in KornShell.
Of all the shells I’m going to discuss, Bash has the most interesting
history. It never made use use of srand(3)
/ rand(3)
and instead
uses its own generator — which is generally what I prefer. Prior
to Bash 4.0, it used the crummy linear congruential generator (LCG)
found in the C89 standard:
static unsigned long rseed = 1;
static int
brand ()
{
rseed = rseed * 1103515245 + 12345;
return ((unsigned int)((rseed >> 16) & 32767));
}
For some reason it was naïvely decided that $RANDOM
should never
produce the same value twice in a row. The caller of brand()
filters
the output and discards repeats before returning to the shell script.
This actually reduces the quality of the generator further since it
increases correlation between separate outputs.
When the shell starts up, rseed
is seeded from the PID and the current
time in seconds. These values are literally summed and used as the seed.
/* Note: not the literal code, but equivalent. */
rseed = getpid() + time(0);
Subshells, which fork and initally share an rseed
, are given similar
treatment:
rseed = rseed + getpid() + time(0);
Notice there’s no hashing or mixing of these values, so there’s no avalanche effect. That would have prevented shells that start around the same time from having related initial random sequences.
With Bash 4.0, released in 2009, the algorithm was changed to a Park–Miller multiplicative LCG from 1988:
static int
brand ()
{
long h, l;
/* can't seed with 0. */
if (rseed == 0)
rseed = 123459876;
h = rseed / 127773;
l = rseed % 127773;
rseed = 16807 * l - 2836 * h;
return ((unsigned int)(rseed & 32767));
}
There’s actually a subtle mistake in this implementation compared to the generator described in the paper. This function will generate different numbers than the paper, and it will generate different numbers on different hosts! More on that later.
This algorithm is a much better choice than the previous LCG. There were many more options available in 2009 compared to 1989, but, honestly, this generator is pretty reasonable for this application. Bash is so slow that you’re never practically going to generate enough numbers for the small state to matter. Since the Park–Miller algorithm is older than Bash, they could have used this in the first place.
I considered submitting a patch to switch to something more modern.
However, given Bash’s constraints, it’s harder said than done.
Portability to weird systems is still a concern, and I expect they’d
reject a patch that started making use of long long
in the PRNG.
They still support pre-ANSI C compilers that don’t have 64-bit
arithmetic.
However, what still really could be improved is seeding. In Bash 4.x here’s what it looks like:
static void
seedrand ()
{
struct timeval tv;
gettimeofday (&tv, NULL);
sbrand (tv.tv_sec ^ tv.tv_usec ^ getpid ());
}
Seeding is both better and worse. It’s better that it’s seeded from a higher resolution clock (milliseconds), so two shells started close in time have more variation. However, it’s “mixed” with XOR, which, in this case, is worse than addition.
For example, imagine two Bash shells started one millsecond apart. Both
tv_usec
and getpid()
are incremented by one. Those increments are
likely to cancel each other out by an XOR, and they end up with the same
seed.
Instead, each of those quantities should be hashed before mixing. Here’s
a rough example using my triple32()
hash (adapted to glorious
GNU-style pre-ANSI C):
static unsigned long
hash32 (x)
unsigned long x;
{
x ^= x >> 17;
x *= 0xed5ad4bbUL;
x &= 0xffffffffUL;
x ^= x >> 11;
x *= 0xac4c1b51UL;
x &= 0xffffffffUL;
x ^= x >> 15;
x *= 0x31848babUL;
x &= 0xffffffffUL;
x ^= x >> 14;
return x;
}
static void
seedrand ()
{
struct timeval tv;
gettimeofday (&tv, NULL);
sbrand (hash32 (tv.tv_sec) ^
hash32 (hash32 (tv.tv_usec) ^ getpid ()));
}
I had said there’s there’s a mistake in the Bash implementation of Park–Miller. Take a closer look at the types and the assignment to rseed:
/* The variables */
long h, l;
unsigned long rseed;
/* The assignment */
rseed = 16807 * l - 2836 * h;
The result of the substraction can be negative, and that negative
value is converted to unsigned long
. The C standard says
ULONG_MAX + 1
is added to make the value positive. ULONG_MAX
varies by platform — typicially long
is either 32 bits or 64 bits —
so the results also vary. Here’s how the paper defined it:
long test;
test = 16807 * l - 2836 * h;
if (test > 0)
rseed = test;
else
rseed = test + 2147483647;
As far as I can tell, this mistake doesn’t hurt the quality of the generator.
$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634
$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115
In contrast to Bash, Zsh is the most straightforward: defer to
rand(3)
. Its $RANDOM
can return the same value twice in a row,
assuming that rand(3)
does.
zlong
randomgetfn(UNUSED(Param pm))
{
return rand() & 0x7fff;
}
void
randomsetfn(UNUSED(Param pm), zlong v)
{
srand((unsigned int)v);
}
A cool feature is that means you could override it if you wanted with a custom generator.
int
rand(void)
{
return 4; // chosen by fair dice roll.
// guaranteed to be random.
}
Usage:
$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4
This trick also applies to the rest of the shells below.
KornShell originated in 1983, but it was finally released under an open source license in 2005. There’s a clone of KornShell called Public Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but I’ll get to that next.
KornShell defers to rand(3)
, but it does some additional naïve
filtering on the output. When the shell starts up, it generates 10
values from rand()
. If any of them are larger than 32,767 then it will
shift right by three all generated numbers.
#define RANDMASK 0x7fff
for (n = 0; n < 10; n++) {
// Don't use lower bits when rand() generates large numbers.
if (rand() > RANDMASK) {
rand_shift = 3;
break;
}
}
Why not just look at RAND_MAX
? I guess they didn’t think of it.
Update: Quentin Barnes pointed out that RAND_MAX
didn’t exist
until POSIX standardization in 1988. The constant first appeared in
Unix in 1990. This KornShell code either predates the standard
or needed to work on systems that predate the standard.
Like Bash, repeated values are not allowed. I suspect one shell got this idea from the other.
do {
cur = (rand() >> rand_shift) & RANDMASK;
} while (cur == last);
Who came up with this strange idea first?
I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I
ever touch in practice, and its $RANDOM
is the most interesting of the
pdksh forks — at least since 2014.
Like Zsh, pdksh simply defers to rand(3)
. However, OpenBSD’s rand(3)
is infamously and proudly non-standard. By default it returns
non-deterministic, cryptographic-quality results seeded from system
entropy (via the misnamed arc4random(3)
), à la /dev/urandom
.
Its $RANDOM
inherits this behavior.
setint(vp, (int64_t) (rand() & 0x7fff));
However, if a value is assigned to $RANDOM
in order to seed it, it
reverts to its old pre-2014 deterministic generation via
srand_deterministic(3)
.
srand_deterministic((unsigned int)intval(vp));
OpenBSD’s deterministic rand(3)
is the crummy LCG from the C89
standard, just like Bash 3.x. So if you assign to $RANDOM
, you’ll get
nearly the same results as Bash 3.x and earlier — the only difference
being that it can repeat numbers.
That’s a slick upgrade to the old interface without breaking anything,
making it my favorite version $RANDOM
for any shell.
Most widely-used programming languages have at least one regular conference dedicated to discussing it. Heck, even Lisp has one. It’s a place to talk about the latest developments of the language, recent and upcoming standards, and so on. However, C is a notable exception. Despite its role as the foundation of the entire software ecosystem, there aren’t any regular conferences about C. I have a couple of theories about why.
First, C is so fundamental and ubiquitous that a conference about C would be too general. There are so many different uses ranging across embedded development, operating system kernels, systems programming, application development, and, most recently, web development (WebAssembly). It’s just not a cohesive enough topic. Any conference that might be about C is instead focused on some particular subset of its application. It’s not a C conference, it’s a database conference, or an embedded conference, or a Linux conference, or a BSD conference, etc.
Second, C has a tendency to be conservative, changing and growing very slowly. This is a feature, and one that is often undervalued by developers. (In fact, I’d personally like to see a future revision that makes the C language specification smaller and simpler, rather than accumulate more features.) The last major revision to C happened in 1999 (C99). There was a minor revision in 2011 (C11), and an even smaller revision in 2018 (C17). If there was a C conference, recent changes to the language wouldn’t be a very fruitful topic.
However, the tooling has advanced significantly in recent years, especially with the advent of LLVM and Clang. This is largely driven by the C++ community, and C has significantly benefited as a side effect due to its overlap. Those are topics worthy of conferences, but these are really C++ conferences.
The closest thing we have to a C conference every year is CppCon. A lot of CppCon isn’t really just about C++, and the subjects of many of the talks are easily applied to C, since C++ builds so much upon C. In a sense, a subset of CppCon could be considered a C conference. That’s what I’m looking for when I watch the CppCon presentations each year on YouTube.
Starting last year, I began a list of all the talks that I thought would be useful to C programmers. Some are entirely relevant to C, others just have significant portions that are relevant to C. When someone asks about where they can find a C conference, I send them my list.
I’m sharing them here so you can bookmark this page and never return again.
Here’s the list for CppCon 2017. These are roughly ordered from highest to lowest recommendation:
The final CppCon 2018 videos were uploaded this week, so my 2018 listing can be complete:
There were three talks strictly about C++ that I thought were interesting from a language design perspective. So I think they’re worth recommending, too. (In fact, they’re a sort of ammo against using C++ due to its insane complexity.)
Only three this year. The last is about C++, but I thought it was interesting.
Four more worthwhile talks in 2020. The first is about the C++ abstract machine, but is nearly identical to the C abstract machine. The second is a proverbial warning about builds. The rest are about performance, and while the context is C++ the concepts are entirely applicable to C.
CppCon’s current sponsor interferes with scheduling and video releases, deliberately reducing accessibility to the outside (unlisted videos, uploading talks multiple times, etc.). Since it’s too time consuming to track it all myself, I’ve given up on following CppCon, at least until they get better-behaved sponsor.
Finally, here are a few more good presentations from other C++ conferences which you can just pretend are about C:
Once upon a time I wrote a fancy data conversion utility. The input was a complex binary format defined by a data dictionary supplied at run time by the user alongside the input data. Since the converter was typically used to process massive quantities of input, and the nature of that input wasn’t known until run time, I wrote an x86-64 JIT compiler to speed it up. The converter generated a fast, native binary parser in memory according to the data dictionary specification. Processing data now took much less time and everyone rejoiced.
Then along came SELinux, Sheriff of Pedantry. Not liking all the
shenanigans with page protections, SELinux huffed and puffed and made
mprotect(2)
return EACCES
(“Permission denied”). Believing I was
following all the rules and so this would never happen, I foolishly
did not check the result and the converter was now crashing for its
users. What made SELinux so unhappy, and could this somehow be
resolved?
Before going further, let’s back up and review how this works. Suppose I want to generate code at run time and execute it. In the old days this was as simple as writing some machine code into a buffer and jumping to that buffer — e.g. by converting the buffer to a function pointer and calling it.
typedef int (*jit_func)(void);
/* NOTE: This doesn't work anymore! */
jit_func
jit_compile(int retval)
{
unsigned char *buf = malloc(6);
if (buf) {
/* mov eax, retval */
buf[0] = 0xb8;
buf[1] = retval >> 0;
buf[2] = retval >> 8;
buf[3] = retval >> 16;
buf[4] = retval >> 24;
/* ret */
buf[5] = 0xc3;
}
return (jit_func)buf;
}
int
main(void)
{
jit_func f = jit_compile(1001);
printf("f() = %d\n", f());
free(f);
}
This situation was far too easy for malicious actors to abuse. An attacker could supply instructions of their own choosing — i.e. shell code — as input and exploit a buffer overflow vulnerability to execute the input buffer. These exploits were trivial to craft.
Modern systems have hardware checks to prevent this from happening. Memory containing instructions must have their execute protection bit set before those instructions can be executed. This is useful both for making attackers work harder and for catching bugs in programs — no more executing data by accident.
This is further complicated by the fact that memory protections have
page granularity. You can’t adjust the protections for a 6-byte
buffer. You do it for the entire surrounding page — typically 4kB, but
sometimes as large as 2MB. This requires replacing that malloc(3)
with a more careful allocation strategy. There are a few ways to go
about this.
The most common and most sensible is to create an anonymous memory
mapping: a file memory map that’s not actually backed by a file. The
mmap(2)
function has a flag specifically for this purpose:
MAP_ANONYMOUS
.
#include <sys/mman.h>
void *
anon_alloc(size_t len)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
void *p = mmap(0, len, prot, flags, -1, 0);
return p != MAP_FAILED ? p : 0;
}
void
anon_free(void *p, size_t len)
{
munmap(p, len);
}
Unfortunately, MAP_ANONYMOUS
not part of POSIX. If you’re being super
strict with your includes — as I tend to be — this flag won’t be
defined, even on systems where it’s supported.
#define _POSIX_C_SOURCE 200112L
#include <sys/mman.h>
// MAP_ANONYMOUS undefined!
To get the flag, you must use the _BSD_SOURCE
, or, more recently,
the _DEFAULT_SOURCE
feature test macro to explicitly enable that
feature.
#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE /* for MAP_ANONYMOUS */
#include <sys/mman.h>
The POSIX way to do this is to instead map /dev/zero
. So, wanting to
be Mr. Portable, this is what I did in my tool. Take careful note of
this.
#define _POSIX_C_SOURCE 200112L
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
void *
anon_alloc(size_t len)
{
int fd = open("/dev/zero", O_RDWR);
if (fd == -1)
return 0;
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_PRIVATE;
void *p = mmap(0, len, prot, flags, fd, 0);
close(fd);
return p != MAP_FAILED ? p : 0;
}
Another, less common (and less portable) strategy is to lean on the
existing C memory allocator, being careful to allocate on page
boundaries so that the page protections don’t affect other allocations.
The classic allocation functions, like malloc(3)
, don’t allow for this
kind of control. However, there are a couple of aligned allocation
alternatives.
The first is posix_memalign(3)
:
int posix_memalign(void **ptr, size_t alignment, size_t size);
By choosing page alignment and a size that’s a multiple of the page
size, it’s guaranteed to return whole pages. When done, pages are freed
with free(3)
. Though, unlike unmapping, the original page protections
must first be restored since those pages may be reused.
#define _POSIX_C_SOURCE 200112L
#include <stdlib.h>
#include <unistd.h>
void *
anon_alloc(size_t len)
{
void *p;
long pagesize = sysconf(_SC_PAGE_SIZE); // TODO: cache this
size_t roundup = (len + pagesize - 1) / pagesize * pagesize;
return posix_memalign(&p, pagesize, roundup) ? 0 : p;
}
If you’re using C11, there’s also aligned_alloc(3)
. This is the most
uncommon of all since most C programmers refuse to switch to a new
standard until it’s at least old enough to drive a car.
So we’ve allocated our memory, but it’s not going to start in an executable state. Why? Because a W^X (“write xor execute”) policy is becoming increasingly common. Attempting to set both write and execute protections at the same time may be denied. (In fact, there’s an SELinux policy for this.)
As a JIT compiler, we need to write to a page and execute it. Again, there are two strategies. The complicated strategy is to map the same memory at two different places, one with the execute protection, one with the write protection. This allows the page to be modified as it’s being executed without violating W^X.
The simpler and more secure strategy is to write the machine
instructions, then swap the page over to executable using mprotect(2)
once it’s ready. This is what I was doing in my tool.
unsigned char *buf = anon_alloc(len);
/* ... write instructions into the buffer ... */
mprotect(buf, len, PROT_EXEC);
jit_func func = (jit_func)buf;
func();
At a high level, That’s pretty close to what I was actually doing. That
includes neglecting to check the result of mprotect(2)
. This worked
fine and dandy for several years, when suddenly (shown here in the style
of strace):
mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)
Then the program would crash trying to execute the buffer. Suddenly it wasn’t allowed to make this buffer executable. My program hadn’t changed. What had changed was the SELinux security policy on this particular system.
The problem is that I don’t administer this (Red Hat) system. I can’t access the logs and I didn’t set the policy. I don’t have any insight on why this call was suddenly being denied. To make this more challenging, the folks that manage this system didn’t have the necessary knowledge to help with this either.
So to figure this out, I need to treat it like a black box and probe at system calls until I can figure out just what SELinux policy I’m up against. I only have practical experience administrating Debian systems (and its derivatives like Ubuntu), which means I’ve hardly ever had to deal with SELinux. I’m flying fairly blind here.
Since my real application is large and complicated, I code up a
minimal example, around a dozen lines of code: allocate a single page
of memory, write a single return (ret
) instruction into it, set it
as executable, and call it. The program checks for errors, and I can
run it under strace if that’s not insightful enough. This program is
also something simple I could provide to the system administrators,
since they were willing to turn some of the knobs to help narrow down
the problem.
However, here’s where I made a major mistake. Assuming the problem
was solely in mprotect(2)
, and wanting to keep this as absolutely
simple as possible, I used posix_memalign(3)
to allocate that page. I
saw the same EACCES
as before, and assumed I was demonstrating the
same problem. Take note of this, too.
Eventually I’d need to figure out what policy was blocking my JIT compiler, then see if there was an alternative route. The system loader still worked after all, and I could plainly see that with strace. So it wasn’t a blanket policy that completely blocked the execute protection. Perhaps the loader was given an exception?
However, the very first order of business was to actually check the
result from mprotect(2)
and do something more graceful rather than
crash. In my case, that meant falling back to executing a byte-code
virtual machine. I added the check, and now the program ran slower
instead of crashing.
The program runs on both Linux and Windows, and the allocation and
page protection management is abstracted. On Windows it uses
VirtualAlloc()
and VirtualProtect()
instead of mmap(2)
and
mprotect(2)
. Neither implementation checked that the protection
change succeeded, so I fixed the Windows implementation while I was at
it.
Thanks to Mingw-w64, I actually do most of my Windows
development on Linux. And, thanks to Wine, I mean
everything, including running and debugging. Calling
VirtualProtect()
in Wine would ultimately call mprotect(2)
in the
background, which I expected would be denied. So running the Windows
version with Wine under this SELinux policy would be the perfect test.
Right?
Except that mprotect(2)
succeeded under Wine! The Windows version
of my JIT compiler was working just fine on Linux. Huh?
This system doesn’t have Wine installed. I had built and packaged it myself. This Wine build definitely has no SELinux exceptions. Not only did the Wine loader work correctly, it can change page protections in ways my own Linux programs could not. What’s different?
Debugging this with all these layers is starting to look silly, but this is exactly why doing Windows development on Linux is so useful. I run my program under Wine under strace:
$ strace wine ./mytool.exe
I study the system calls around mprotect(2)
. Perhaps there’s some
stricter alignment issue? No. Perhaps I need to include PROT_READ
?
No. The only difference I can find is they’re using the
MAP_ANONYMOUS
flag. So, armed with this knowledge, I modify my
minimal example to allocate 1024 pages instead of just one, and
suddenly it works correctly. I was most of the way to figuring this
all out.
Why did increasing the allocation size change anything? This is a typical Linux system, so my program is linked against the GNU C library, glibc. This library allocates memory from two places depending on the allocation size.
For small allocations, glibc uses brk(2)
to extend the executable
image — i.e. to extend the .bss
section. These resources are not
returned to the operating system after they’re freed with free(3)
.
They’re reused.
For large allocations, glibc uses mmap(2)
to create a new, anonymous
mapping for that allocation. When freed with free(3)
, that memory is
unmapped and its resources are returned to the operating system.
By increasing the allocation size, it became a “large” allocation and
was backed by an anonymous mapping. Even though I didn’t use mmap(2)
,
to the operating system this would be indistinguishable to what Wine was
doing (and succeeding at).
Consider this little example program:
int
main(void)
{
printf("%p\n", malloc(1));
printf("%p\n", malloc(1024 * 1024));
}
When not compiled as a Position Independent Executable (PIE), here’s what the output looks like. The first pointer is near where the program was loaded, low in memory. The second pointer is a randomly selected address high in memory.
0x1077010
0x7fa9b998e010
And if you run it under strace, you’ll see that the first allocation
comes from brk(2)
and the second comes from mmap(2)
.
With a little bit of research, I found the two SELinux policies
at play here. In my minimal example, I was blocked by allow_execheap
.
/selinux/booleans/allow_execheap
This prohibits programs from setting the execute protection on any “heap” page.
The POSIX specification does not permit it, but the Linux implementation of
mprotect
allows changing the access protection of memory on the heap (e.g., allocated usingmalloc
). This error indicates that heap memory was supposed to be made executable. Doing this is really a bad idea. If anonymous, executable memory is needed it should be allocated usingmmap
which is the only portable mechanism.
Obviously this is pretty loose since I was still able to do it with
posix_memalign(3)
, which, technically speaking, allocates from the
heap. So this policy applies to pages mapped by brk(2)
.
The second policy was allow_execmod
.
/selinux/booleans/allow_execmod
The program mapped from a file with
mmap
and theMAP_PRIVATE
flag and write permission. Then the memory region has been written to, resulting in copy-on-write (COW) of the affected page(s). This memory region is then made executable […]. Themprotect
call will fail withEACCES
in this case.
I don’t understand what purpose this policy serves, but this is what
was causing my original problem. Pages mapped to /dev/zero
are not
actually considered anonymous by Linux, at least as far as this
policy is concerned. I think this is a mistake, and that mapping the
special /dev/zero
device should result in effectively anonymous
pages.
From this I learned a little lesson about baking assumptions — that
mprotect(2)
was solely at fault — into my minimal debugging examples.
And the fix was ultimately easy: I just had to suck it up and use the
slightly less pure MAP_ANONYMOUS
flag.
I recently got an itch to design my own non-cryptographic integer hash function. Firstly, I wanted to better understand how hash functions work, and the best way to learn is to do. For years I’d been treating them like magic, shoving input into it and seeing random-looking, but deterministic, output come out the other end. Just how is the avalanche effect achieved?
Secondly, could I apply my own particular strengths to craft a hash function better than the handful of functions I could find online? Especially the classic ones from Thomas Wang and Bob Jenkins. Instead of struggling with the mathematics, maybe I could software engineer my way to victory, working from the advantage of access to the excessive computational power of today.
Suppose, for example, I wrote tool to generate a random hash function definition, then JIT compile it to a native function in memory, then execute that function across various inputs to evaluate its properties. My tool could rapidly repeat this process in a loop until it stumbled upon an incredible hash function the world had never seen. That’s what I actually did. I call it the Hash Prospector:
https://github.com/skeeto/hash-prospector
It only works on x86-64 because it uses the same JIT compiling technique I’ve discussed before: allocate a page of memory, write some machine instructions into it, set the page to executable, cast the page pointer to a function pointer, then call the generated code through the function pointer.
My focus is on integer hash functions: a function that accepts an n-bit integer and returns an n-bit integer. One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.
This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversible, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input.
For example, XOR with a constant is trivially reversible: XOR the output with the same constant to reverse it. Addition with a constant is reversed by subtraction with the same constant. Since the integer operations are modular arithmetic, modulo 2^n for n-bit integers, multiplication by an odd number is reversible. Odd numbers are coprime with the power-of-two modulus, so there is some modular multiplicative inverse that reverses the operation.
Bret Mulvey’s hash function article provides a convenient list of some reversible operations available for constructing integer hash functions. This list was the catalyst for my little project. Here are the ones used by the hash prospector:
x = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation
I’ve come across a couple more useful operations while studying existing integer hash functions, but I didn’t put these in the prospector.
hash += ~(hash << constant);
hash -= ~(hash << constant);
The prospector picks some operations at random and fills in their constants randomly within their proper constraints. For example, here’s an awful hash function I made it generate as an example:
// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
x *= 0x1eca7d79U;
x ^= x >> 20;
x = (x << 8) | (x >> 24);
x = ~x;
x ^= x << 5;
x += 0x10afe4e7U;
return x;
}
That function is reversible, and it would be relatively straightforward to define its inverse. However, it has awful biases and poor avalanche. How do I know this?
There are two key properties I’m looking for in randomly generated hash functions.
High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.
Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.
Initially I screwed up and only measured the first property. This lead to some hash functions that seemed to be amazing before close inspection, since, for a 32-bit hash function, it was flipping over 15 output bits on average. However, the particular bits being flipped were heavily biased, resulting in obvious patterns in the output.
For example, when hashing a counter starting from zero, the high bits would follow a regular pattern. 15 to 16 bits were being flipped each time, but it was always the same bits.
Conveniently it’s easy to measure both properties at the same time. For an n-bit integer hash function, create an n by n table initialized to zero. The rows are input bits and the columns are output bits. The ith row and jth column track the correlation between the ith input bit and jth output bit.
Then exhaustively iterate over all 2^n inputs, and flip each bit one at a time. Increment the appropriate element in the table if the output bit flips.
When you’re done, ideally each element in the table is exactly 2^(n-1). That is, each output bit was flipped exactly half the time by each input bit. Therefore the bias of the hash function is the distance (the error) of the computed table from the ideal table.
For example, the ideal bias table for an 8-bit hash function would be:
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
The hash prospector computes the standard deviation in order to turn this into a single, normalized measurement. Lower scores are better.
However, there’s still one problem: the input space for a 32-bit hash function is over 4 billion values. The full test takes my computer about an hour and a half. Evaluating a 64-bit hash function is right out.
Again, Monte Carlo to the rescue! Rather than sample the entire space, just sample a random subset. This provides a good estimate in less than a second, allowing lots of terrible hash functions to be discarded early. The full test can be saved only for the known good 32-bit candidates. 64-bit functions will only ever receive the estimate.
Once I got the bias issue sorted out, and after hours and hours of running, followed up with some manual tweaking on my part, the prospector stumbled across this little gem:
// DO use this one!
uint32_t
prospector32(uint32_t x)
{
x ^= x >> 15;
x *= 0x2c1b3c6dU;
x ^= x >> 12;
x *= 0x297a2d39U;
x ^= x >> 15;
return x;
}
According to a full (e.g. not estimated) bias evaluation, this function beats the snot out of most of 32-bit hash functions I could find. It even comes out ahead of this well known hash function that I believe originates from the H2 SQL Database. (Update: Thomas Mueller has confirmed that, indeed, this is his hash function.)
uint32_t
hash32(uint32_t x)
{
x = ((x >> 16) ^ x) * 0x45d9f3bU;
x = ((x >> 16) ^ x) * 0x45d9f3bU;
x = (x >> 16) ^ x;
return x;
}
It’s still an excellent hash function, just slightly more biased than mine.
Very briefly, prospector32()
was the best 32-bit hash function I could
find, and I thought I had a major breakthrough. Then I noticed the
finalizer function for the 32-bit variant of MurmurHash3. It’s
also a 32-bit hash function:
uint32_t
murmurhash32_mix32(uint32_t x)
{
x ^= x >> 16;
x *= 0x85ebca6bU;
x ^= x >> 13;
x *= 0xc2b2ae35U;
x ^= x >> 16;
return x;
}
This one is just barely less biased than mine. So I still haven’t discovered the best 32-bit hash function, only the second best one. :-)
If you’re paying close enough attention, you may have noticed that all three functions above have the same structure. The prospector had stumbled upon it all on its own without knowledge of the existing functions. It may not be so obvious for the second function, but here it is refactored:
uint32_t
hash32(uint32_t x)
{
x ^= x >> 16;
x *= 0x45d9f3bU;
x ^= x >> 16;
x *= 0x45d9f3bU;
x ^= x >> 16;
return x;
}
I hadn’t noticed this until after the prospector had come across it on its own. The pattern for all three is XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift. There’s something particularly useful about this multiply-xorshift construction (also). The XOR-right-shift diffuses bits rightward and the multiply diffuses bits leftward. I like to think it’s “sloshing” the bits right, left, right, left.
It seems that multiplication is particularly good at diffusion, so it makes perfect sense to exploit it in non-cryptographic hash functions, especially since modern CPUs are so fast at it. Despite this, it’s not used much in cryptography due to issues with completing it in constant time.
I like to think of this construction in terms of a five-tuple. For the three functions it’s the following:
(15, 0x2c1b3c6d, 12, 0x297a2d39, 15) // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16) // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16) // murmurhash32_mix32()
The prospector actually found lots of decent functions following this pattern, especially where the middle shift is smaller than the outer shift. Thinking of it in terms of this tuple, I specifically directed it to try different tuple constants. That’s what I meant by “tweaking.” Eventually my new function popped out with its really low bias.
The prospector has a template option (-p
) if you want to try it
yourself:
$ ./prospector -p xorr,mul,xorr,mul,xorr
If you really have your heart set on certain constants, such as my specific selection of shifts, you can lock those in while randomizing the other constants:
$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15
Or the other way around:
$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr
My function seems a little strange using shifts of 15 bits rather than a nice, round 16 bits. However, changing those constants to 16 increases the bias. Similarly, neither of the two 32-bit constants is a prime number, but nudging those constants to the nearest prime increases the bias. These parameters really do seem to be a local minima in the bias, and using prime numbers isn’t important.
So far I haven’t been able to improve on 64-bit hash functions. The main function to beat is SplittableRandom / SplitMix64:
uint64_t
splittable64(uint64_t x)
{
x ^= x >> 30;
x *= 0xbf58476d1ce4e5b9U;
x ^= x >> 27;
x *= 0x94d049bb133111ebU;
x ^= x >> 31;
return x;
}
Here’s its inverse since it’s sometimes useful:
uint64_t
splittable64_r(uint64_t x)
{
x ^= x >> 31 ^ x >> 62;
x *= 0x319642b2d24d8ec3U;
x ^= x >> 27 ^ x >> 54;
x *= 0x96de1b173f119089U;
x ^= x >> 30 ^ x >> 60;
return x;
}
I also came across this function:
uint64_t
hash64(uint64_t x)
{
x ^= x >> 32;
x *= 0xd6e8feb86659fd93U;
x ^= x >> 32;
x *= 0xd6e8feb86659fd93U;
x ^= x >> 32;
return x;
}
Again, these follow the same construction as before. There really is something special about it, and many other people have noticed, too.
Both functions have about the same bias. (Remember, I can only estimate the bias for 64-bit hash functions.) The prospector has found lots of functions with about the same bias, but nothing provably better. Until it does, I have no new 64-bit integer hash functions to offer.
Right now the prospector does a completely random, unstructured search hoping to stumble upon something good by chance. Perhaps it would be worth using a genetic algorithm to breed those 5-tuples towards optimum? Others have had success in this area with simulated annealing.
There’s probably more to exploit from the multiply-xorshift construction that keeps popping up. If anything, the prospector is searching too broadly, looking at constructions that could never really compete no matter what the constants. In addition to everything above, I’ve been looking for good 32-bit hash functions that don’t use any 32-bit constants, but I’m really not finding any with a competitively low bias.
About one week after publishing this article I found an even better hash function. I believe this is the least biased 32-bit integer hash function of this form ever devised. It’s even less biased than the MurmurHash3 finalizer.
// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
x ^= x >> 16;
x *= 0x7feb352dU;
x ^= x >> 15;
x *= 0x846ca68bU;
x ^= x >> 16;
return x;
}
// inverse
uint32_t
lowbias32_r(uint32_t x)
{
x ^= x >> 16;
x *= 0x43021123U;
x ^= x >> 15 ^ x >> 30;
x *= 0x1d69e2a5U;
x ^= x >> 16;
return x;
}
If you’re willing to use an additional round of multiply-xorshift, this next function actually reaches the theoretical bias limit (bias = ~0.021) as exhibited by a perfect integer hash function:
// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
x ^= x >> 17;
x *= 0xed5ad4bbU;
x ^= x >> 11;
x *= 0xac4c1b51U;
x ^= x >> 15;
x *= 0x31848babU;
x ^= x >> 14;
return x;
}
It’s statistically indistinguishable from a random permutation of all 32-bit integers.
Some people have been experimenting with using my hash functions in GLSL shaders, and the results are looking good:
]]>Specifying a particular behavior would have put unnecessary burden on implementations — especially in the earlier days of computing — making for inefficient programs on some platforms. For example, if the result of dereferencing a null pointer was defined to trap — to cause the program to halt with an error — then platforms that do not have hardware trapping, such as those without virtual memory, would be required to instrument, in software, each pointer dereference.
In the 21st century, undefined behavior has taken on a somewhat different meaning. Optimizers use it — or abuse it depending on your point of view — to lift constraints that would otherwise inhibit more aggressive optimizations. It’s not so much a fundamentally different application of undefined behavior, but it does take the concept to an extreme.
The reasoning works like this: A program that evaluates a construct whose behavior is undefined cannot, by definition, have any meaningful behavior, and so that program would be useless. As a result, compilers assume programs never invoke undefined behavior and use those assumptions to prove its optimizations.
Under this newer interpretation, mistakes involving undefined behavior are more punishing and surprising than before. Programs that seem to make some sense when run on a particular architecture may actually compile into a binary with a security vulnerability due to conclusions reached from an analysis of its undefined behavior.
This can be frustrating if your programs are intended to run on a very specific platform. In this situation, all behavior really could be locked down and specified in a reasonable, predictable way. Such a language would be like an extended, less portable version of C or C++. But your toolchain still insists on running your program on the abstract machine rather than the hardware you actually care about. However, even in this situation undefined behavior can still be desirable. I will provide a couple of examples in this article.
To start things off, let’s look at one of my all time favorite examples of useful undefined behavior, a situation involving signed integer overflow. The result of a signed integer overflow isn’t just unspecified, it’s undefined behavior. Full stop.
This goes beyond a simple matter of whether or not the underlying machine uses a two’s complement representation. From the perspective of the abstract machine, just the act a signed integer overflowing is enough to throw everything out the window, even if the overflowed result is never actually used in the program.
On the other hand, unsigned integer overflow is defined — or, more accurately, defined to wrap, not overflow. Both the undefined signed overflow and defined unsigned overflow are useful in different situations.
For example, here’s a fairly common situation, much like what actually happened in bzip2. Consider this function that does substring comparison:
int
cmp_signed(int i1, int i2, unsigned char *buf)
{
for (;;) {
int c1 = buf[i1];
int c2 = buf[i2];
if (c1 != c2)
return c1 - c2;
i1++;
i2++;
}
}
int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
for (;;) {
int c1 = buf[i1];
int c2 = buf[i2];
if (c1 != c2)
return c1 - c2;
i1++;
i2++;
}
}
In this function, the indices i1
and i2
will always be some small,
non-negative value. Since it’s non-negative, it should be unsigned
,
right? Not necessarily. That puts an extra constraint on code generation
and, at least on x86-64, makes for a less efficient function. Most of
the time you actually don’t want overflow to be defined, and instead
allow the compiler to assume it just doesn’t happen.
The constraint is that the behavior of i1
or i2
overflowing as an
unsigned integer is defined, and the compiler is obligated to implement
that behavior. On x86-64, where int
is 32 bits, the result of the
operation must be truncated to 32 bits one way or another, requiring
extra instructions inside the loop.
In the signed case, incrementing the integers cannot overflow since that would be undefined behavior. This permits the compiler to perform the increment only in 64-bit precision without truncation if it would be more efficient, which, in this case, it is.
Here’s the output of Clang 6.0.0 with -Os
on x86-64. Pay close
attention to the main loop, which I named .loop
:
cmp_signed:
movsxd rdi, edi ; use i1 as a 64-bit integer
mov al, [rdx + rdi]
movsxd rsi, esi ; use i2 as a 64-bit integer
mov cl, [rdx + rsi]
jmp .check
.loop: mov al, [rdx + rdi + 1]
mov cl, [rdx + rsi + 1]
inc rdx ; increment only the base pointer
.check: cmp al, cl
je .loop
movzx eax, al
movzx ecx, cl
sub eax, ecx ; return c1 - c2
ret
cmp_unsigned:
mov eax, edi
mov al, [rdx + rax]
mov ecx, esi
mov cl, [rdx + rcx]
cmp al, cl
jne .ret
inc edi
inc esi
.loop: mov eax, edi ; truncated i1 overflow
mov al, [rdx + rax]
mov ecx, esi ; truncated i2 overflow
mov cl, [rdx + rcx]
inc edi ; increment i1
inc esi ; increment i2
cmp al, cl
je .loop
.ret: movzx eax, al
movzx ecx, cl
sub eax, ecx
ret
As unsigned values, i1
and i2
can overflow independently, so they
have to be handled as independent 32-bit unsigned integers. As signed
values they can’t overflow, so they’re treated as if they were 64-bit
integers and, instead, the pointer, buf
, is incremented without
concern for overflow. The signed loop is much more efficient (5
instructions versus 8).
The signed integer helps to communicate the narrow contract of the
function — the limited range of i1
and i2
— to the compiler. In a
variant of C where signed integer overflow is defined (i.e. -fwrapv
),
this capability is lost. In fact, using -fwrapv
deoptimizes the signed
version of this function.
Side note: Using size_t
(an unsigned integer) is even better on x86-64
for this example since it’s already 64 bits and the function doesn’t
need the initial sign/zero extension. However, this might simply move
the sign extension out to the caller.
Another controversial undefined behavior is strict aliasing. This particular term doesn’t actually appear anywhere in the C specification, but it’s the popular name for C’s aliasing rules. In short, variables with types that aren’t compatible are not allowed to alias through pointers.
Here’s the classic example:
int
foo(int *a, int *b)
{
*b = 0; // store
*a = 1; // store
return *b; // load
}
Naively one might assume the return *b
could be optimized to a simple
return 0
. However, since a
and b
have the same type, the compiler
must consider the possibility that they alias — that they point to the
same place in memory — and must generate code that works correctly under
these conditions.
If foo
has a narrow contract that forbids a
and b
to alias, we
have a couple of options for helping our compiler.
First, we could manually resolve the aliasing issue by returning 0 explicitly. In more complicated functions this might mean making local copies of values, working only with those local copies, then storing the results back before returning. Then aliasing would no longer matter.
int
foo(int *a, int *b)
{
*b = 0;
*a = 1;
return 0;
}
Second, C99 introduced a restrict
qualifier to communicate to the
compiler that pointers passed to functions cannot alias. For example,
the pointers to memcpy()
are qualified with restrict
as of C99.
Passing aliasing pointers through restrict
parameters is undefined
behavior, e.g. this doesn’t ever happen as far as a compiler is
concerned.
int foo(int *restrict a, int *restrict b);
The third option is to design an interface that uses incompatible
types, exploiting strict aliasing. This happens all the time, usually
by accident. For example, int
and long
are never compatible even
when they have the same representation.
int foo(int *a, long *b);
If you use an extended or modified version of C without strict
aliasing (-fno-strict-aliasing
), then the compiler must assume
everything aliases all the time, generating a lot more precautionary
loads than necessary.
What irritates a lot of people is that compilers will still apply the strict aliasing rule even when it’s trivial for the compiler to prove that aliasing is occurring:
/* note: forbidden */
long a;
int *b = (int *)&a;
It’s not just a simple matter of making exceptions for these cases. The language specification would need to define all the rules about when and where incompatible types are permitted to alias, and developers would have to understand all these rules if they wanted to take advantage of the exceptions. It can’t just come down to trusting that the compiler is smart enough to see the aliasing when it’s sufficiently simple. It would need to be carefully defined.
Besides, there are probably conforming, portable solutions that, with contemporary compilers, will safely compile to the efficient code you actually want anyway.
There is one special exception for strict aliasing: char *
is
allowed to alias with anything. This is important to keep in mind both
when you intentionally want aliasing, but also when you want to avoid
it. Writing through a char *
pointer could force the compiler to
generate additional, unnecessary loads.
In fact, there’s a whole dimension to strict aliasing that, even today,
no compiler yet exploits: uint8_t
is not necessarily unsigned char
.
That’s just one possible typedef
definition for it. It could instead
typedef
to, say, some internal __byte
type.
In other words, technically speaking, uint8_t
does not have the strict
aliasing exemption. If you wanted to write bytes to a buffer without
worrying the compiler about aliasing issues with other pointers, this
would be the tool to accomplish it. Unfortunately there’s far too much
existing code that violates this part of strict aliasing that no
toolchain is willing to exploit it for optimization purposes.
Some kinds of undefined behavior don’t have performance or portability benefits. They’re only there to make the compiler’s job a little simpler. Today, most of these are caught trivially at compile time as syntax or semantic issues (i.e. a pointer cast to a float).
Some others are obvious about their performance benefits and don’t require much explanation. For example, it’s undefined behavior to index out of bounds (with some special exceptions for one past the end), meaning compilers are not obligated to generate those checks, instead relying on the programmer to arrange, by whatever means, that it doesn’t happen.
Undefined behavior is like nitro, a dangerous, volatile substance that makes things go really, really fast. You could argue that it’s too dangerous to use in practice, but the aggressive use of undefined behavior is not without merit.
]]>ptrace(2)
(“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing strace — system call trace. With Ptrace, tracers
can pause tracees, inspect and set registers and memory, monitor
system calls, or even intercept system calls.
By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.
The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.
For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.
You can find runnable code for the examples in this article here:
https://github.com/skeeto/ptrace-examples
Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.
Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The ptrace(2)
prototype generally looks something like this, though the specific
types may be different.
long ptrace(int request, pid_t pid, void *addr, void *data);
The pid
is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.
The request
field selects a specific Ptrace function, just like the
ioctl(2)
interface. For strace, only two are needed:
PTRACE_TRACEME
: This process is to be traced by its parent.PTRACE_SYSCALL
: Continue, but stop at the next system call
entrance or exit.PTRACE_GETREGS
: Get a copy of the tracee’s registers.The other two fields, addr
and data
, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.
The strace interface is essentially a prefix to another command.
$ strace [strace options] program [arguments]
My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is fork(2)
and exec(2)
the
tracee process on the tail of argv
. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.
pid_t pid = fork();
switch (pid) {
case -1: /* error */
FATAL("%s", strerror(errno));
case 0: /* child */
ptrace(PTRACE_TRACEME, 0, 0, 0);
execvp(argv[1], argv + 1);
FATAL("%s", strerror(errno));
}
The parent waits for the child’s PTRACE_TRACEME
using wait(2)
. When
wait(2)
returns, the child will be paused.
waitpid(pid, 0, 0);
Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
PTRACE_O_TRACEFORK
.
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);
All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:
The PTRACE_SYSCALL
request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
wait(2)
is needed to wait for the tracee to enter the desired state.
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
When wait(2)
returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, the operating system has not yet serviced this system call.
This detail will be important later.
The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, the system call number is
passed in rax
, and the arguments (up to 6) are passed in
rdi
, rsi
, rdx
, r10
, r8
, and r9
. Reading the registers is
another Ptrace call, though there’s no need to wait(2)
since the
tracee isn’t changing state.
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
long syscall = regs.orig_rax;
fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
syscall,
(long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
(long)regs.r10, (long)regs.r8, (long)regs.r9);
There’s one caveat. For internal kernel purposes, the system
call number is stored in orig_rax
rather than rax
. All the other
system call arguments are straightforward.
Next it’s another PTRACE_SYSCALL
and wait(2)
, then another
PTRACE_GETREGS
to fetch the result. The result is stored in rax
.
ptrace(PTRACE_GETREGS, pid, 0, ®s);
fprintf(stderr, " = %ld\n", (long)regs.rax);
The output from this simple program is very crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use process_vm_readv(2)
to
read those buffers from the tracee in order to print them appropriately.
However, this does lay the groundwork for system call interception.
Suppose we want to use Ptrace to implement something like OpenBSD’s
pledge(2)
, in which a process pledges to use only a
restricted set of system calls. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.
Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.
Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call exit(2)
in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.
The tricky part is that there’s no way to abort a system call once
it’s started. When tracer returns from wait(2)
on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.
However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” EPERM
error
in errno
via the normal in-band signaling.
for (;;) {
/* Enter next system call */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
/* Is this system call permitted? */
int blocked = 0;
if (is_syscall_blocked(regs.orig_rax)) {
blocked = 1;
regs.orig_rax = -1; // set to invalid syscall
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
/* Run system call and stop on exit */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
if (blocked) {
/* errno = EPERM */
regs.rax = -EPERM; // Operation not permitted
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
}
This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (open(2)
) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.
How could the tracee communicate to the tracer? Use an artificial system call!
For my new pledge-like system call — which I call xpledge()
to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.
#define SYS_xpledge 10000
Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s pledge(2)
, which uses a string interface.
Actually designing robust and secure sets of privileges is really
complicated, as the pledge(2)
manpage shows. Here’s the entire
interface and implementation of the system call for the tracee:
#define _GNU_SOURCE
#include <unistd.h>
#define XPLEDGE_RDWR (1 << 0)
#define XPLEDGE_OPEN (1 << 1)
#define xpledge(arg) syscall(SYS_xpledge, arg)
If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. brk(2)
). The
PLEDGE_RDWR
bit allows various read and write system calls
(read(2)
, readv(2)
, pread(2)
, preadv(2)
, etc.). The
PLEDGE_OPEN
bit allows open(2)
.
To prevent privileges from being escalated back, pledge()
blocks
itself — though this also prevents dropping more privileges later down
the line.
In the xpledge tracer, I just need to check for this system call:
/* Handle entrance */
switch (regs.orig_rax) {
case SYS_pledge:
register_pledge(regs.rdi);
break;
}
The operating system will return ENOSYS
(Function not implemented)
since this isn’t a real system call. So on the way out I overwrite
this with a success (0).
/* Handle exit */
switch (regs.orig_rax) {
case SYS_pledge:
ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
break;
}
I wrote a little test program that opens /dev/urandom
, makes a read,
tries to pledge, then tries to open /dev/urandom
a second time, then
confirms it can read from the original /dev/urandom
file descriptor.
Running without a pledge tracer, the output looks like this:
$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:
$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
The pledge succeeds but the second fopen(3)
does not since the tracer
blocked it with EPERM
.
This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.
Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.
You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.
The Linux version of PTrace has had a more efficient operation for
this technique since 2005: PTRACE_SYSEMU
. PTrace stops only once
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.
for (;;) {
ptrace(PTRACE_SYSEMU, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
switch (regs.orig_rax) {
case OS_read:
/* ... */
case OS_write:
/* ... */
case OS_open:
/* ... */
case OS_exit:
/* ... */
/* ... and so on ... */
}
}
To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this PTRACE_SYSEMU
tracer, a loader (to take the place of exec(2)
), and whatever system
libraries the binary needs (or only run static binaries).
In fact, this sounds like a fun weekend project.
In this article I’ll give my definition for minimalist C API, then take you through some of my own recent examples.
A minimalist C library would generally have these properties.
This one’s pretty obvious. More functions means more surface area in the interface. Since these functions typically interact, the relationship between complexity and number of functions will be superlinear.
The library mustn’t call malloc()
internally. It’s up to the caller
to allocate memory for the library. What’s nice about this is that
it’s completely up to the application exactly how memory is allocated.
Maybe it’s using a custom allocator, or it’s not linked against the
standard library.
A common approach is for the application to provide allocation functions
to the library — e.g. function pointers at run time, or define functions
with specific, expected names. The library would call these instead of
malloc()
and free()
. While that’s perfectly reasonable, it’s not
really minimalist, so I’m not including this technique.
Instead a minimalist API is designed such that it’s natural for the application to make the allocations itself. Perhaps the library only needs a single, fixed allocation for all its operations. Or maybe the application specifies its requirements and the library communicates how much memory is needed to meet those requirements. I’ll give specific examples shortly.
One nice result of this property is that it eliminates one of the common failure conditions: the out of memory error. If the library doesn’t allocate memory, then it can’t run out of it!
Another convenient, minor outcome is the lack of casts from void *
to
the appropriate type (e.g. on the return from malloc()
). These casts
are implicit in C but must be made explicit in C++. Often, completely by
accident, my minimalist C libraries can be compiled as C++ without any
changes. This is only a minor benefit since these casts could be made
explicit in C, too, if C++ compatibility was desired. It’s just ugly.
In simple terms, the library mustn’t use functions from stdio.h
—
with the exception of the sprintf()
family. Like with memory
allocation, it leaves input and output to the application, letting it
decide exactly how, where, and when information comes and goes.
Like with memory allocation, maybe the application prefers not to use the C standard library’s buffered IO. Perhaps the application is using cooperative or green threads, and it would be bad for the library to block internally on IO.
Also like avoiding memory allocation, a library that doesn’t perform IO can’t have IO errors. Combined, this means it’s quite possible that a minimalist library may have no error cases at all. Eliminating those error handling paths makes the library a lot simpler. The one major error condition left that’s difficult to eliminate are those pesky integer overflow checks.
Communicating IO preferences to libraries can be a real problem with
C, since the standard library lacks generic input and output. Putting
FILE *
pointers directly into an API mingles it with the C standard
library in potentially bad ways. Passing file names as strings is an
option, but this limits IO to files — versus, say, sockets. On POSIX
systems, at least it could talk about IO in terms of file descriptors,
but even that’s not entirely flexible — e.g. output to a memory
buffer, or anything not sufficiently file-like.
Again, a common way to deal with this is for the application to provide IO function pointers to the library. But a minimalist library’s API would be designed such that not even this is needed, instead operating strictly on buffers. I’ll also have a couple examples of this shortly.
With IO and memory allocation out of the picture, another frequent,
accidental result is no dependency on the C standard library. The only
significant functionality left in the standard library are the
mathematical functions (math.h
), float parsing, and a few of
the string functions (string.h
), like memset()
and memmove()
.
These are valuable since they’re handled specially by the
compiler.
More types means more complexity, perhaps even more so than having lots of functions. Some minimalist libraries can be so straightforward that they can operate solely on simple, homogeneous buffers. I’ll show some examples of this, too.
As I said initially, minimalism is about interface, not implementation. The library is free to define as many structures internally as it needs since the application won’t be concerned with them.
One common way to avoid complicated types in an API is to make them opaque. The structures aren’t defined in the API, and instead the application only touches pointers, making them like handles.
struct foo;
struct foo *foo_create(...);
int foo_method(struct foo *, ...);
void foo_destroy(struct foo *);
However, this is difficult to pull off when the library doesn’t allocate its own memory.
The first example is a library for creating bitmap (BMP) images. As you may already know, I strongly prefer Netpbm, which is so simple that it doesn’t even need a library. But nothing is quite so universally supported as BMP.
24-bit BMP (Bitmap) ANSI C header library
This library is a perfect example of minimalist properties 2, 3, and 4. It also doesn’t use any of the C standard library, though only by accident.
It’s not a general purpose BMP library. It only supports 24-bit true
color, ignoring most BMP features such as palettes. Color is
represented as a 24-bit integer, packed 0xRRGGBB
.
unsigned long bmp_size(long width, long height);
void bmp_init(void *, long width, long height);
void bmp_set(void *, long x, long y, unsigned long color);
unsigned long bmp_get(const void *, long x, long y);
Strictly speaking, even the bmp_get()
function could be tossed since
the library is not intended to load external bitmap images. The
application really shouldn’t need to read back previously set pixels.
There is no allocation, no IO, and no data structures. The application indicates the dimensions of image it wants to create, and the library says how large of a buffer it needs. The remaining functions all operate on this opaque buffer. To write the image out, the application only needs to dump the buffer to a file.
Here’s a complete, strict error checking example of its usage:
#define RED 0xff0000UL
#define BLUE 0x0000ffUL
unsigned long size = bmp_size(width, height);
if (!size || size > SIZE_MAX) die("invalid dimensions");
void *bmp = calloc(size, 1);
if (!bmp) die("out of memory");
bmp_init(bmp, width, height);
/* Checkerboard pattern */
for (long y = 0; y < height; y++)
for (long x = 0; x < width; x++)
bmp_set(bmp, x, y, x % 2 == y % 2 ? RED : BLUE);
if (!fwrite(bmp, size, 1, out))
die("output error");
free(bmp);
The only library function that can fail is bmp_size()
. When the
given image dimensions would overflow one of the BMP header fields, it
returns zero to indicate as such.
In bmp_set()
, how does it know the dimensions of the image so that
it can find the pixel? It reads that from the buffer just like a BMP
reader would — and in a endian-agnostic manner. There are
no bounds checks — that’s the caller’s job — so it only needs to read
the image’s width in order to find the pixel’s location.
Since IO is under control of the application, it can always choose load the original buffer contents back from a file, allowing a minimal sort of BMP loading. However, this only works for trusted input as there are no validation checks on the buffer.
The second example is an integer hash set library. It uses closed hashing. I initially wrote this for r/dailyprogrammer solution and then formalized it into a little reusable library.
C99 32-bit integer hash set header library
Here’s the entire API:
int set32_z(uint32_t max);
void set32_insert(uint32_t *table, int z, uint32_t v);
void set32_remove(uint32_t *table, int z, uint32_t v);
int set32_contains(uint32_t *table, int z, uint32_t v);
Again, it’s a good example of properties 2, 3, and 4. Like the BMP
library, the application indicates the maximum number of integers it
will store in the hash set, and the library returns the power of two
number of uint32_t
it needs to allocate (and zero-initialize).
In this API I’m just barely skirting not defining a data structure. The caller must pass both the table pointer and the power of two size, and these two values would normally be bundled together into a structure.
int z = set32_z(max);
unsigned long long n = 1ULL << z;
if (n > SIZE_MAX) die("table too large");
uint32_t *table = calloc(sizeof(*table), n);
if (!table) die("out of memory");
set32_insert(table, z, value);
if (set32_contains(table, z, value))
/* ... */;
set32_remove(table, z, value);
free(table);
Iteration is straightforward, which is why it’s not in the API: visit each element in the allocated buffer. Zeroes are empty slots.
If a different maximum number of elements is needed, the application initializes a new, separate table, then iterates over the old table inserting each integer in turn.
Perhaps the most interesting part of the API is that it has no errors. No function can fail.
Also, like the BMP library, it accidentally doesn’t use the standard
library, except for a typedef
from stdint.h
.
Nearly a decade ago I cloned in Perl the RinkWorks Fantasy Name Generator. This version was slow and terrible, and I’m sometimes tempted to just delete it.
A few years later I rewrote it in JavaScript using an entirely different approach. In order to improve performance, it has a template compilation step. The compiled template is a hierarchical composition of simple generator objects. It’s much faster, and easily enabled some extensions to the syntax.
Germán Méndez Bravo ported the JavaScript version to C++. This C++ implementation was recently adopted into IVAN, a roguelike game.
This recent commotion made me realize something: I hadn’t yet implemented it in C! So I did.
Fantasy name generator ANSI C header library
The entire API is just a single function with four possible return values. It’s a perfect example of minimalist property 1.
#define NAMEGEN_SUCCESS 0
#define NAMEGEN_TRUNCATED 1 /* Output was truncated */
#define NAMEGEN_INVALID 2 /* Pattern is invalid */
#define NAMEGEN_TOO_DEEP 3 /* Exceeds maximum nesting depth */
int namegen(char *dest,
size_t len,
const char *pattern,
unsigned long *seed);
There’s no template compilation step, and it generates names straight from the template.
There are three kinds of errors.
If the output buffer wasn’t large enough, it warns about the name being truncated.
The template could be invalid — e.g. incorrectly paired brackets.
The template could have too much nesting. I decided to hard code the maximum nesting depth to a generous 32 levels. This limitation makes the generator a lot simpler without any practical impact. It also protects against unbounded memory usage — particularly stack overflows — by arbitrarily complex patterns. This means it’s perfectly safe to generate names from untrusted, arbitrarily long input patterns.
Here’s a usage example:
char name[64];
unsigned long seed = 0xb9584b61UL;
namegen(name, sizeof(name), "!sV'i (the |)!id", &seed);
/* name = "Engia'pin the Doltolph" */
The generator supports UTF-8, almost by accident. (I’d have to go out of my way not to support it.)
Despite the lack of a compilation step, which requires the parsing the template for each generated name, it’s an order of magnitude faster than the C++ version, which caught me by surprise. The high performance is due to name generation being a single pass over the template using reservoir sampling.
Internally it maintains a stack of “reset” pointers, each pointing
into the output buffer where the current nesting level began its
output. Each time it hits an alternation (|
), it generates a random
number and decides whether or not to use the new option. The first
time it’s a 1/2 chance it chooses the new option. The second time, a
1/3 chance. The third time a 1/4 chance, and so on. When the new
option is selected, the reset pointer is used to “undo” any previous
output for the current nesting level.
The reservoir sampling means it needs to generate more random numbers
(once per option) than the JavaScript and C++ version (once per nesting
level). However, it uses its own, fast internal PRNG rather than
rand()
. Generating these random numbers is basically free.
Not using rand()
means that, like the previous libraries, it doesn’t
need anything from the standard library. It also has better quality
results since the typical standard library rand()
is total rubbish,
both in terms of speed and quality (and typically has a PLT
penalty). Finally it means the results are identical across all
platforms for the same template and seed, which is one reason it’s part
of the API.
Another slight performance boost comes from the representation of
pattern substitutions, i.e. i
will select a random “idiot” name from a
fixed selection of strings. The obvious representation is an array of
string pointers, as seen in the C++ version. However, there are a lot of
these little strings, which makes for a lot of pointers cluttering up
the relocation table. Instead, I packed it all into few small
pointerless tables, which on x86-64 are accessed efficiently via
RIP-relative addressing. It’s efficient, though not friendly to
modification.
I’m very happy with how this library turned out.
The last example is a UTF-7 encoder and decoder. UTF-7 is a method for encoding arbitrary Unicode text within ASCII text, created as a nasty hack to allow Unicode messages to be sent over ASCII-limited email infrastructure. The gist of it is that the Unicode parts of a message are encoded as UTF-16, then base64 encoded, then interpolated into the ASCII stream between delimiters.
Einstein (allegedly) said “If you can’t explain it to a six year old, you don’t understand it yourself.” The analog for programming is to replace the six year old with a computer, and explaining an idea to a computer is done by writing a program. I wanted to understand UTF-7, so I implemented it.
A UTF-7 stream encoder and decoder in ANSI C
Here’s the entire API. It’s modeled a little after the zlib API.
/* utf7_encode() special code points */
#define UTF7_FLUSH -1L
/* return codes */
#define UTF7_OK -1
#define UTF7_FULL -2
#define UTF7_INCOMPLETE -3
#define UTF7_INVALID -4
struct utf7 {
char *buf;
size_t len;
/* then some "private" internal fields */
};
void utf7_init(struct utf7 *, const char *indirect);
int utf7_encode(struct utf7 *, long codepoint);
long utf7_decode(struct utf7 *);
Finally a library that defines a structure! The other fields (not shown)
hold important state information, but the application is only concerned
with buf
and len
: an input or output buffer. The same structure is
used for encoding and decoding, though only for one task at a time.
Following the minimalist library principle, there is no memory
allocation. When encoding a UTF-7 stream, the application’s job is to
point buf
to an output buffer, indicating its length with len
. Then
it feeds code points one at a time into the encoder. When the output is
full, it returns UTF7_FULL
. The application must provide a new buffer
and try again.
This example usage is more complicated than I anticipated it would be. Properly pumping code points through the encoder requires a loop (or at least a second attempt).
char buffer[1024];
struct utf7 ctx;
utf7_init(&ctx, 0);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
/* Assumes "wide character" input is Unicode */
for (;;) {
wint_t c = fgetwc(stdin);
if (c == WEOF)
break;
while (utf7_encode(ctx, c) != UTF7_OK) {
/* Flush output and reset buffer */
fwrite(buffer, sizeof(buffer), 1, stdout);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
}
}
/* Flush all pending output */
while (utf7_encode(ctx, UTF7_FLUSH) != UTF7_OK) {
fwrite(buffer, sizeof(buffer), 1, stdout);
ctx.buf = buffer;
ctx.len = sizeof(buffer));
}
/* Write remaining output */
fwrite(buffer, sizeof(buffer) - ctx.len, 1, stdout);
/* Check for errors */
if (fflush(stdout))
die("output error");
if (ferror(stdin))
die("input error");
Flushing (UTF7_FLUSH
) is necessary since, due to base64 encoding,
adjacent Unicode characters usually share a base64 character. Just
because a code point was absorbed into the encoder doesn’t mean it was
written into the output buffer. The encoding for that character may
depend on the next character to come. The special “flush” input forces
this out. It’s valid to flush in the middle of a stream, though this
may penalize encoding efficiency (e.g. the output may be
larger than necessary).
It’s not possible for the encoder to fail, so there are no error conditions to worry about from the library.
Decoding is a different matter. It works almost in reverse from the
encoder: buf
points to the input and the decoder is pumped to
return one code point at a time. It returns one of:
A non-negative value: a valid code point (including ASCII).
UTF7_OK
: Input was exhausted. Stopping here would be valid. This is
what you should get when there’s no more input.
UTF7_INVALID
: The input was invalid. buf
points at the invalid byte.
UTF7_INCOMPLETE
: Input was exhausted, but more is expected. If
there is no more input, then the input must have been truncated,
which is an error.
So there are two possible errors for two kinds of invalid input. Parsing errors are unavoidable when parsing input.
Again, this library accidentally doesn’t require the standard library. It doesn’t even depend on the compiler’s locale being compatible with ASCII since none of its internal tables use string or character literals. It behaves exactly the same across all conforming platforms.
I had a few more examples in mind, but this article has gone on long enough.
Instead I’ll save these for other articles!
]]>Over on GitHub, David Yu has an interesting performance benchmark for function calls of various Foreign Function Interfaces (FFI):
https://github.com/dyu/ffi-overhead
He created a shared object (.so
) file containing a single, simple C
function. Then for each FFI he wrote a bit of code to call this function
many times, measuring how long it took.
For the C “FFI” he used standard dynamic linking, not dlopen()
. This
distinction is important, since it really makes a difference in the
benchmark. There’s a potential argument about whether or not this is a
fair comparison to an actual FFI, but, regardless, it’s still
interesting to measure.
The most surprising result of the benchmark is that LuaJIT’s FFI is substantially faster than C. It’s about 25% faster than a native C function call to a shared object function. How could a weakly and dynamically typed scripting language come out ahead on a benchmark? Is this accurate?
It’s actually quite reasonable. The benchmark was run on Linux, so the performance penalty we’re seeing comes the Procedure Linkage Table (PLT). I’ve put together a really simple experiment to demonstrate the same effect in plain old C:
https://github.com/skeeto/dynamic-function-benchmark
Here are the results on an Intel i7-6700 (Skylake):
plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call
These are three different types of function calls:
dlsym(3)
)As shown, the last one is the fastest. It’s typically not an option for C programs, but it’s natural in the presence of a JIT compiler, including, apparently, LuaJIT.
In my benchmark, the function being called is named empty()
:
void empty(void) { }
And to compile it into a shared object:
$ cc -shared -fPIC -Os -o empty.so empty.c
Just as in my PRNG shootout, the benchmark calls this function repeatedly as many times as possible before an alarm goes off.
When a program or library calls a function in another shared object, the compiler cannot know where that function will be located in memory. That information isn’t known until run time, after the program and its dependencies are loaded into memory. These are usually at randomized locations — e.g. Address Space Layout Randomization (ASLR).
How is this resolved? Well, there are a couple of options.
One option is to make a note about each such call in the binary’s metadata. The run-time dynamic linker can then patch in the correct address at each call site. How exactly this would work depends on the particular code model used when compiling the binary.
The downside to this approach is slower loading, larger binaries, and less sharing of code pages between different processes. It’s slower loading because every dynamic call site needs to be patched before the program can begin execution. The binary is larger because each of these call sites needs an entry in the relocation table. And the lack of sharing is due to the code pages being modified.
On the other hand, the overhead for dynamic function calls would be eliminated, giving JIT-like performance as seen in the benchmark.
The second option is to route all dynamic calls through a table. The original call site calls into a stub in this table, which jumps to the actual dynamic function. With this approach the code does not need to be patched, meaning it’s trivially shared between processes. Only one place needs to be patched per dynamic function: the entries in the table. Even more, these patches can be performed lazily, on the first function call, making the load time even faster.
On systems using ELF binaries, this table is called the Procedure Linkage Table (PLT). The PLT itself doesn’t actually get patched — it’s mapped read-only along with the rest of the code. Instead the Global Offset Table (GOT) gets patched. The PLT stub fetches the dynamic function address from the GOT and indirectly jumps to that address. To lazily load function addresses, these GOT entries are initialized with an address of a function that locates the target symbol, updates the GOT with that address, and then jumps to that function. Subsequent calls use the lazily discovered address.
The downside of a PLT is extra overhead per dynamic function call, which is what shows up in the benchmark. Since the benchmark only measures function calls, this appears to be pretty significant, but in practice it’s usually drowned out in noise.
Here’s the benchmark:
/* Cleared by an alarm signal. */
volatile sig_atomic_t running;
static long
plt_benchmark(void)
{
long count;
for (count = 0; running; count++)
empty();
return count;
}
Since empty()
is in the shared object, that call goes through the PLT.
Another way to dynamically call functions is to bypass the PLT and
fetch the target function address within the program, e.g. via
dlsym(3)
.
void *h = dlopen("path/to/lib.so", RTLD_NOW);
void (*f)(void) = dlsym("f");
f();
Once the function address is obtained, the overhead is smaller than
function calls routed through the PLT. There’s no intermediate stub
function and no GOT access. (Caveat: If the program has a PLT entry for
the given function then dlsym(3)
may actually return the address of
the PLT stub.)
However, this is still an indirect function call. On conventional architectures, direct function calls have an immediate relative address. That is, the target of the call is some hard-coded offset from the call site. The CPU can see well ahead of time where the call is going.
An indirect function call has more overhead. First, the address has to be stored somewhere. Even if that somewhere is just a register, it increases register pressure by using up a register. Second, it provokes the CPU’s branch predictor since the call target isn’t static, making for extra bookkeeping in the CPU. In the worst case the function call may even cause a pipeline stall.
Here’s the benchmark:
volatile sig_atomic_t running;
static long
indirect_benchmark(void (*f)(void))
{
long count;
for (count = 0; running; count++)
f();
return count;
}
The function passed to this benchmark is fetched with dlsym(3)
so the
compiler can’t do something tricky like convert that indirect
call back into a direct call.
If the body of the loop was complicated enough that there was register pressure, thereby requiring the address to be spilled onto the stack, this benchmark might not fare as well against the PLT benchmark.
The first two types of dynamic function calls are simple and easy to use. Direct calls to dynamic functions is trickier business since it requires modifying code at run time. In my benchmark I put together a little JIT compiler to generate the direct call.
There’s a gotcha to this: on x86-64 direct jumps are limited to a 2GB
range due to a signed 32-bit immediate. This means the JIT code has to
be placed virtually nearby the target function, empty()
. If the JIT
code needed to call two different dynamic functions separated by more
than 2GB, then it’s not possible for both to be direct.
To keep things simple, my benchmark isn’t precise or very careful
about picking the JIT code address. After being given the target
function address, it blindly subtracts 4MB, rounds down to the nearest
page, allocates some memory, and writes code into it. To do this
correctly would mean inspecting the program’s own memory mappings to
find space, and there’s no clean, portable way to do this. On Linux
this requires parsing virtual files under /proc
.
Here’s what my JIT’s memory allocation looks like. It assumes
reasonable behavior for uintptr_t
casts:
static void
jit_compile(struct jit_func *f, void (*empty)(void))
{
uintptr_t addr = (uintptr_t)empty;
void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK);
/* ... */
unsigned char *p = mmap(desired, len, prot, flags, fd, 0);
/* ... */
}
It allocates two pages, one writable and the other containing
non-writable code. Similar to my closure library, the lower
page is writable and holds the running
variable that gets cleared by
the alarm. It needed to be nearby the JIT code in order to be an
efficient RIP-relative access, just like the other two benchmark
functions. The upper page contains this assembly:
jit_benchmark:
push rbx
xor ebx, ebx
.loop: mov eax, [rel running]
test eax, eax
je .done
call empty
inc ebx
jmp .loop
.done: mov eax, ebx
pop rbx
ret
The call empty
is the only instruction that is dynamically generated
— necessary to fill out the relative address appropriately (the minus
5 is because it’s relative to the end of the instruction):
// call empty
uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5;
*p++ = 0xe8;
*p++ = rel >> 0;
*p++ = rel >> 8;
*p++ = rel >> 16;
*p++ = rel >> 24;
If empty()
wasn’t in a shared object and instead located in the same
binary, this is essentially the direct call that the compiler would have
generated for plt_benchmark()
, assuming somehow it didn’t inline
empty()
.
Ironically, calling the JIT-compiled code requires an indirect call (e.g. via a function pointer), and there’s no way around this. What are you going to do, JIT compile another function that makes the direct call? Fortunately this doesn’t matter since the part being measured in the loop is only a direct call.
Given these results, it’s really no mystery that LuaJIT can generate more efficient dynamic function calls than a PLT, even if they still end up being indirect calls. In my benchmark, the non-PLT indirect calls were 28% faster than the PLT, and the direct calls 43% faster than the PLT. That’s a small edge that JIT-enabled programs have over plain old native programs, though it comes at the cost of absolutely no code sharing between processes.
]]>So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.
In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.
The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:
double
zero_multiply(double x)
{
return x * 0.0;
}
The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)
As a result, both GCC and Clang perform the multiply:
zero_multiply:
xorpd xmm1, xmm1
mulsd xmm0, xmm1
ret
The -ffast-math
option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
consistency:
zero_multiply:
xorps xmm0, xmm0
ret
Side note: -ffast-math
doesn’t necessarily mean “less precise.”
Sometimes it will actually improve precision.
Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a short
:
double
zero_multiply_short(short x)
{
return x * 0.0;
}
It’s no longer possible for the argument to be one of those special
values. The short
will be promoted to one of 65,535 possible double
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (-Os
):
zero_multiply_short:
movsx edi, di ; sign-extend 16-bit argument
xorps xmm1, xmm1 ; xmm1 = 0.0
cvtsi2sd xmm0, edi ; convert int to double
mulsd xmm0, xmm1
ret
Clang also misses this optimization:
zero_multiply_short:
cvtsi2sd xmm1, edi
xorpd xmm0, xmm0
mulsd xmm0, xmm1
ret
But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (movsx
)? Clang is treating that
short
argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?
It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.
Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.
I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.
Without looking it up or trying it, what does this function return? Think carefully.
int
float_compare(void)
{
float x = 1.3f;
return x == 1.3f;
}
Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.
$ gcc -std=c99 -m32 cmp.c # float_compare() == 0
$ clang -std=c99 -m32 cmp.c # float_compare() == 1
So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations all did it differently. The C99 specification
cleaned this all up and introduced FLT_EVAL_METHOD
.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.
Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in long double
precision and truncated afterward (FLT_EVAL_METHOD == 2
).
In float_compare()
the left-hand side is truncated to a float
by the
assignment, but the right-hand side, despite being a float
literal,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!
The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.
The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (FLT_EVAL_METHOD == 0
). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the backwards compatible GCC on the old x86.
I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.
I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, new_image()
, that allocates a greyscale image
for, say, some multimedia library.
static unsigned char *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, shade, w * h);
}
}
return p;
}
It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.
I write a unit test to make sure it detects overflow. This function should return 0.
/* expected return == 0 */
int
test_new_image_overflow(void)
{
void *p = new_image(2, SIZE_MAX, 0);
return !!p;
}
So far my test passes. Good.
I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make malloc()
fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a malloc(SIZE_MAX)
, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.
/* expected return == 0 */
int
test_new_image_oom(void)
{
void *p = new_image(1, SIZE_MAX, 0xff);
return !!p;
}
I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?
Disassembling the test reveals what’s going on:
test_new_image_overflow:
xor eax, eax
ret
test_new_image_oom:
mov eax, 1
ret
The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with malloc()
became dead code and
was trivially eliminated.
In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the memset()
, so it eliminated the
allocation altogether and then simulated a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.
I soon realized I can take this further and trick Clang into
performing an invalid optimization, revealing a bug. Consider
this slightly-optimized version that uses calloc()
when the shade is
zero (black). The calloc()
function does its own overflow check, so
new_image()
doesn’t need to do it.
static void *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (shade == 0) { // shortcut
p = calloc(w, h);
} else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, color, w * h);
}
}
return p;
}
With this change, my overflow unit test is now also failing. The
situation is even worse than before. The calloc()
is being
eliminated despite the overflow, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, this could introduce a vulnerability in a
real program. The OpenBSD folks are so worried about this sort of
thing that they’ve disabled this optimization.
Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).
struct table {
unsigned *counter;
unsigned *values;
};
static int
table_init(struct table *t, size_t n)
{
t->counter = calloc(n, sizeof(*t->counter));
if (t->counter) {
/* Overflow already tested above */
t->values = malloc(n * sizeof(*t->values));
if (!t->values) {
free(t->counter);
return 0; // fail
}
return 1; // success
}
return 0; // fail
}
This function relies on the overflow test in calloc()
for the second
malloc()
allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the counter
table, and Clang is able to
statically determine this fact, it may eliminate the calloc()
. This
would also eliminate the overflow test, introducing a
vulnerability. If an attacker can control n
, then they can
overwrite arbitrary memory through that values
pointer.
Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.
The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.
My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?
Just before the end it would pause the game and wait… forever.
]]>I thought it would be interesting to revisit this software, to reevaluate it from a far more experienced perspective. Keep in mind that C++ wasn’t even standardized yet, and the most recent C standard was from 1989. Given this, what was it like to be a professional software developer using a Borland toolchain on Windows 20 years ago? Was it miserable, made bearable only by ignorance of how much better the tooling could be? Or maybe it actually wasn’t so bad, and these tools are better than I expect?
Ultimately my conclusion is that it’s a little bit of both. There are some significant capability gaps compared to today, but the core toolchain itself is actually quite reasonable, especially for the mid 1990s.
Before getting into the evaluation, let’s discuss how I got it all up and running. While it’s technically possible to run Windows 95 on a modern x86-64 machine thanks to the architecture’s extreme backwards compatibility, it’s more compatible, simpler, and safer to virtualize it. Most importantly, I can emulate older hardware that will have better driver support.
Despite that early start in Windows all those years ago, I’m primarily a Linux user. The premier virtualization solution on Linux these days is KVM, a kernel module that turns Linux into a hypervisor and makes efficient use of hardware virtualization extensions. Unfortunately pre-XP Windows doesn’t work well on KVM, so instead I’m using QEmu (with KVM disabled), a hardware emulator closely associated with KVM. Since it doesn’t take advantage of hardware virtualization extensions, it will be slower. This is fine since my goal is to emulate slow, 20+ year old hardware anyway.
There’s very little practical difference between Windows 95 and Windows 98. Since Windows 98 runs a lot smoother virtualized, I decided to go with that instead. This will be perfectly sufficient for my toolchain evaluation.
To get started, I’ll need an installer for Windows 98. I thought this would be difficult to find, but there’s a copy available on the Internet Archive. I don’t know how “legitimate” this is, but it works. Since it’s running in a virtual machine without network access, I also don’t really care if this copy is somehow infected with malware.
Internet Archive: Windows 98 Second Edition
Also on the Internet Archive is a complete copy of Borland C++ 5.02, with the same caveats of legitimacy. It works, which is good enough for my purposes.
Internet Archive: Borland C++ 5.02
Thank you Internet Archive!
I’ve got my software, now to set up the virtualized hardware. First I create a drive image:
$ qemu-image create -fqcow2 win98.img 8G
I gave it 8GB, which is actually a bit overkill. Giving Windows 98 a virtual hard drive with modern sizes would probably break the installer. This sort of issue is a common theme among old software, where there may be complaints about negative available disk space due to signed integer overflow.
I decided to give the machine 256MB of memory (-m 256
). This is also a
little excessive, but I wanted to be sure memory didn’t limit Borland’s
capabilities. This amount of memory is close to the upper bound, and
going much beyond will likely cause problems with Windows 98.
For the CPU I settled on a Pentium (-cpu pentium
). My original goal
was to go a little simpler with a 486 (-cpu 486
), but the Windows 98
installer kept crashing when I tried this.
I experimented with different configurations for the network card, but
I couldn’t get anything to work. So I’ve disabled networking (-net
none
). The only reason I’d want this is that it would be easier to
move files in and out of the virtual machine.
Finally, here’s how I ran QEmu. The last two lines are only needed when installing.
$ qemu-system-x86_64 \
-localtime \
-cpu pentium \
-no-acpi \
-no-hpet \
-m 256 \
-hda win98.img \
-soundhw sb16 \
-vga cirrus \
-net none \
-cdrom "Windows 98 Second Edition.iso" \
-boot d
Installation is just a matter of following the instructions. You’ll need that product key listed on the Internet Archive site.
That copy of Borland is just a big .zip file. This presents two problems.
Without network access, I’ll need to figure out how to get this inside the virtual machine.
This version of Windows doesn’t come with software to unzip this file. I’d need to find and install an unzip tool first.
Fortunately I can kill two birds with one stone by converting that .zip archive into a .iso and mounting it in the virtual machine.
unzip "BORLAND C++.zip"
genisoimage -R -J -o borland.iso "BORLAND C++"
Then in the QEmu console (C-A-2) I attach it:
change ide1-cd0 borland.iso
This little trick of generating .iso files and mounting them is how I will be moving all the other files into the virtual machine.
The first thing I did was play around with with Borland IDE. This is what I would have been using 20 years ago.
Despite being Borland C++, I’m personally most interested in its ANSI C compiler. As I already pointed out, this software pre-dates C++’s standardization, and a lot has changed over the past two decades. On the other hand, C hasn’t really changed all that much. The 1999 update to the C standard (e.g. “C99”) was big and important, but otherwise little has changed. The biggest drawback is the lack of “declare anywhere” variables, including in for-loop initializers. Otherwise it’s the same as writing C today.
To test drive the IDE, I made a couple of test projects, built and ran them with different options, and poked around with the debugger. The debugger is actually pretty decent, especially for the 1990s. It can be operated via the IDE or standalone, so I could use it without firing up the IDE and making a project.
The toolchain includes an assembler, and I can inspect the compiler’s assembly output. To nobody’s surprise this is Intel-flavored assembly, which is very welcome. Imagining myself as a software developer in the mid 1990s, this means I can see exactly what the compiler’s doing as well as write some of the performance sensitive parts in assembly if necessary.
The built-in editor is the worst part of the IDE, which is unfortunate since it really spoils the whole experience. It’s easy to jump between warnings and errors, it has incremental search, and it has good syntax highlighting. But these are the only positive things I can say about it. If I had to work with this editor full-time, I’d spend my days pretty irritated.
Like with the debugger, the Borland people did a good job modularizing
their development tools. As part of the installation process, all of the
Borland command line tools are added to the system PATH
(reminder:
this is a single-user system). This includes compiler, linker,
assembler, debugger, and even an incomplete implementation of
make
.
With this, I can essentially pretend the IDE doesn’t exist and replace that crummy editor with something better: Vim.
The last version of Vim to support MS-DOS and Windows 95/98 is Vim 7.3, released in 2010. I download those binaries, trim a few things from my .vimrc, and smuggle it all into my virtual machine via a virtual CD. I’ve now got a powerful text editor in Windows 98 and my situation has drastically improved.
Since I hardly use features added since Vim 7.3, this feels right at home to me. I can invoke the build from Vim, and it can populate the quickfix list from Borland’s output, so I could actually be fairly productive in these circumstances! I’m honestly really impressed with how well this all works together.
At this point I only have two significant annoyances:
Borland’s command line tools belong to that category of irritating
programs that print their version banner on every invocation.
There’s not even a command line switch to turn this off. All this
noise is quickly tiresome. The Visual Studio toolchain does
the same thing by default, though it can be turned off (-nologo
).
I dislike that some GNU tools also commit this sin, but at least
GNU limits this to interactive programs.
The Windows/DOS command shell and console is even worse than it is today. I didn’t think that was possible. This is back when it was still genuinely DOS and not just pretending to be (e.g. in NT). The worst part by far is the lack of command history. There’s no using the up-arrow to get previous commands. There’s no tab completion. Forward slash is not a substitute for backslash in paths. If I wanted to improve my productivity, replacing this console and shell would be the first priority.
Update: In an email, Aristotle Pagaltzis informed me that Windows 98 comes with DOSKEY.COM, which provides command history for COMMAND.EXE. Alternatively there’s Enhanced DOSKEY.com, an open source, alternative implementation that also provides tab completion for commands and filesnames. This makes the console a lot more usable (and, honestly, in some ways better than the modern defaults).
Last year I wrote a backup encryption tool called Enchive, and I still use it regularly. One of my design goals was high portability since it may be needed to decrypt something important in the distant future. It should be as bit-rot-proof as possible. In software, the best way to future-proof is to past-proof.
If I had a time machine that could send source code back in time, and I sent Enchive to a competant developer 20 years ago, would they be able to compile it and run it? If the answer is yes, then that means Enchive already has 20 years of future-proofing built into it.
To accomplish this, Enchive is 3,300 lines of strict ANSI C, 1989-style, with no dependencies other than the C standard library and a handful of operating system functions — e.g. functionality not in the C standard library. In practice, any ANSI C compiler targeting either POSIX, or Windows 95 or later, should be able to compile it.
My Windows 98 virtual machine includes an ANSI C compiler, and can be
used to simulate this time machine. I generated an “amalgamation” build
(make amalgamation
) — essentially a concatenation of all the source
files — and sent this into the virtual machine. Before Borland was able
to compile it, I needed to make three small changes.
First, Enchive includes stdint.h
to get fixed-width integers needed
for the encryption routines. This header comes from C99, and C89 has
no equivalent. I anticipated this problem from the beginning and made
it easy for the person performing the build to correct it. This header
is included exactly once, in config.h
, and this is placed at the top
of the amalgamation build. The include only needs to be replaced with
a handful of manual typedefs. For Borland that looks like this:
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned long uint32_t;
typedef unsigned __int64 uint64_t;
typedef long int32_t;
typedef __int64 int64_t;
#define INT8_C(n) (n)
#define INT16_C(n) (n)
#define INT32_C(n) (n##U)
Second, in more recent versions of Windows, GetFileAttributes()
can
return the value INVALID_FILE_ATTRIBUTES
. Checking for an error that
cannot happen is harmless, but this value isn’t defined in Borland’s
SDK. I only had to eliminate that check.
Third, the CryptGenRandom()
interface isn’t defined in
Borland’s SDK. This is used by Enchive to generate keys. MSDN reports
this function wasn’t available until Windows XP, but it’s definitely
there in Windows 98, exported by ADVAPI32.dll. I’m able to call it,
though it always reports an error. Perhaps it’s been disabled in this
version due to cryptographic export restrictions?
Regardless of what’s wrong, I ripped this out and replaced it with a fatal error. This version of Enchive can’t generate new keys — unless derived from a passphrase — nor encrypt files, including the use of a protection key to encrypt the secret key. However, it can decrypt files, which is the important part that needs to be future-proofed.
With this three changes — which took me about 10 minutes to sort out — Enchive builds and runs, and it correctly decrypts files I encrypted on Linux. So Enchive has at least 20 years of past-proofing! The screenshot at the top of this article shows it running successfully in an MS-DOS console window.
I mentioned that there were some gaps. The most obvious is the lack of the standard POSIX utilities, especially a decent shell. I don’t know if any had been ported to Windows in the mid 1990s. But that could be solved one way or another without too much trouble, even if it meant doing some of that myself.
No, the biggest capability I’d miss, and which wouldn’t be easily obtained, is Git, or a least a decent source control system. I really don’t want to work without proper source control. Git’s support for Windows is second tier, and the port to modern Windows is already a bit of a hack. Getting it to run in Windows 98 would probably be a challenge, especially if I had to compile it with Borland.
The other major issue is the lack of stability. In this experiment, I’ve been seeing this screen a lot:
I remember Windows crashing a lot back in those days, and it certainly
had a bad reputation for being unstable, but this is far worse than I
remembered. While the hardware emulator may be somewhat at fault here,
keep in mind that I never installed third party drivers. Most of these
crashes are Windows’ fault. I found I can reliably bring the whole
system down with a single GetProcAddress()
call on a system DLL. The
only way I can imagine this instability was so tolerated back then was
general ignorance that computing could be so much better.
I was tempted to write this article in Vim on Windows 98, but all this crashing made me too nervous. I didn’t want some stupid filesystem corruption to wipe out my work. Too risky.
If I was stuck working in Windows 98 — or was at least targeting it as a platform — but had access to a modern tooling ecosystem, could I do better than Borland? Yes! Programs built by Mingw-w64 can be run even as far back as Windows 95.
Now, there’s a catch. I thought it would be this simple:
$ i686-w64-mingw32-gcc -Os hello.c
But when I brought the resulting binary into the virtual machine it
crashed when ran it: illegal instruction. Turns out it contained a
conditional move (cmov
) which is an instruction not available until
the Pentium Pro (686). The “pentium” emulation is just a 586.
I tried to disable cmov
by picking the specific architecture:
$ i686-w64-mingw32-gcc -march=pentium -Os hello.c
This still didn’t work because the statically-linked part of the CRT
contained the cmov
. I’d have to recompile that as well.
I could have switched the QEmu options to “upgrade” to a Pentium Pro, but remember that my goal was really the 486. Fortunately this was easy to fix: compile my own Mingw-w64 cross-compiler. I’ve done this a number of times before, so I knew it wouldn’t be difficult.
I could go step by step, but it’s all fairly well documented in the Mingw-64 “howto-build” document. I used GCC 7.3 (the latest version), and for the target I picked “i486-w64-mingw32”. When it was done I could compile binaries on Linux to run in my Windows 98 virtual machine:
$ i486-w64-mingw32-gcc -Os hello.c
This should enable quite a bit of modern software to run inside my virtual machine if I so wanted. I didn’t actually try this (yet?), but, to take this concept all the way, I could use this cross-compiler to cross-compile Mingw-w64 itself to run inside the virtual machine, directly replacing Borland C++.
And the only thing I’d miss about Borland is its debugger.
]]>~/.local/bin
, and the package manager itself is just a 110 line Bourne
shell script. It’s is not intended to replace the system’s package
manager but, instead, compliment it in some cases where I need more
flexibility. I use it to run custom versions of specific pieces of
software — newer or older than the system-installed versions, or with my
own patches and modifications — without interfering with the rest of
system, and without a need for root access. It’s worked out really
well so far and I expect to continue making heavy use of it in the
future.
It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):
Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.
I say it’s crude because its “install” (-I
) procedure is little more
than a wrapper around tar. It doesn’t invoke libtool after installing a
library, and there’s no post-install script — or postinst
as Debian
calls it. It doesn’t check for conflicts between packages, though
there’s a command for doing so manually ahead of time. It doesn’t manage
dependencies, nor even have them as a concept. That’s all on the user to
screw up.
In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:
It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.
Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.
It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.
The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.
The second caveat is that some builds tailor themselves by default to
the host (e.g. -march=native
). If care isn’t taken, those packages may
not be very portable. This is more common than I had expected and has
mildly annoyed me.
While the package manager is new, I’ve been building and installing
software in my home directory for years. I’d follow the normal process
of setting the install prefix to $HOME/.local
, running the build,
and then letting the “install” target do its thing.
$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.
A common difficulty has been handling the release of new versions of
software. I’d like to upgrade to the new version, but lack a way to
cleanly uninstall the previous version. Simply clobbering the old
version by installing it on top usually works. Occasionally it
wouldn’t, and I’d have to blow away ~/.local
and start all over again.
With more and more software installed in my home directory, restarting
has become more and more of a chore that I’d like to avoid.
What I needed was a way to track exactly which files were installed so
that I could remove them later when I needed to uninstall. Fortunately
there’s a widely-used convention for exactly this purpose: DESTDIR
.
It’s expected that when a Makefile provides an “install” target, it
prefixes the installation path with the DESTDIR
macro, which is
assigned to the empty string by default. This allows the user to install
the software to a temporary location for the purposes of packaging.
Unlike the installation prefix (--prefix
) configured before the build
began, the software is not expected to function properly when run in the
DESTDIR
location.
$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install
A different tool will used to copy these files into place and actually
install it. This tool can track what files were installed, allowing them
to be removed later when uninstalling. My package manager uses the tar
program for both purposes. First it creates a package by packing up the
DESTDIR
(at the root of the actual install prefix):
$ tar czf package.tgz -C $DESTDIR$HOME/.local .
So a package is nothing more than a gzipped tarball. To install, it
unpacks the tarball in ~/.local
.
$ cd $HOME/.local
$ tar xzf ~/package.tgz
But how does it uninstall a package? It didn’t keep track of what was
installed. Easy! The tarball itself contains the package list, and it’s
printed with tar’s t
mode.
cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
rm -f "$file"
done
I’m using grep
to skip directories, which are conveniently listed with
a trailing slash. Note that in the example above, there are a couple of
issues with file names containing whitespace. If the file contains a
space character, it will word split incorrectly in the for
loop. A
Makefile couldn’t handle such a file in the first place, but, in case
it’s still necessary, my package manager sets IFS
to just a newline.
If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.
There are five commands, each assigned to a capital letter: -B
, -C
,
-I
, -V
, and -U
. It’s an interface pattern inspired by Ted
Unangst’s signify (see signify(1)
). I also used this
pattern with Blowpipe and, in retrospect, wish I had also used
with Enchive.
-B
)Unlike the other three commands, the “build” command isn’t essential,
and is just for convenience. It assumes the build uses an Autoconfg-like
configure script and runs it automatically, followed by make
with the
appropriate -j
(jobs) option. It automatically sets the --prefix
argument when running the configure script.
If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.
Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.
$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/
In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.
Once qpkg
returns, the program will be fully built — or stuck on a
build error if you’re unlucky. If you need to pass custom configure
options, just tack them on the qpkg
command:
$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses
Since the second and third steps — creating the build directory and
moving into it — is so common, there’s an optional switch for it: -d
.
This option’s argument is the build directory. qpkg
creates that
directory and runs the build inside it. In practice I just use “x” for
the build directory since it’s so quick to add “dx” to the command.
$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/
With the software compiled, the next step is creating the package.
-C
)The “create” command creates the DESTDIR
(_destdir
in the working
directory) and runs the “install” Makefile target to fill it with files.
Continuing with the example above and its x/
build directory:
$ qpkg -Cdx name
Where “name” is the name of the package, without any file name
extension. Like with “build”, extra arguments after the package name are
passed to make
in case there needs to be any additional tweaking.
When the “create” command finishes, there will be new package named
name.tgz
in the working directory. At this point the source and build
directories are no longer needed, assuming everything went fine.
$ rm -rf name-version/
$ rm -rf x/
This package is ready to install, though you may want to verify it first.
-V
)The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.
$ qpkg -V name.tgz
The most common conflict I’ve seen is in the info index (info/dir
)
file, which is safe to ignore since I don’t care about it.
If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.
-I
)The “install” command is just the dumb tar xzf
explained above. It
will clobber anything in its way without warning, which is why, if that
matters, “verify” should be used first.
$ qpkg -I name.tgz
When qpkg
returns, the package has been installed and is probably
ready to go. A lot of packages complain that you need to run libtool to
finalize an installation, but I’ve never had a problem skipping it. This
dumb unpacking generally works fine.
-U
)Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.
$ qpkg -U name.tgz
Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.
I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.
Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):
Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)
To deal with this, I added a “keep” (-k
) option that leaves the
DESTDIR
around after creating the package. To keep things tidy, the
intermediate packages exist and are installed, but the final, big
cross-compiler package accumulates into the DESTDIR
. The final
package at the end is actually the whole cross compiler in one package,
a superset of them all.
Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.
The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.
All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.
]]>I mean, when you started, I’m pretty the initial solution was using branches, right? Then, you’ve applied some techniques to eliminate them.
A bottom-up approach that begins with branches and then proceeds to eliminate them one at a time sounds like a plausible story. However, this story is the inverse of how it actually played out. It began when I noticed a branchless decoder could probably be done, then I put together the pieces one at a time without introducing any branches. But what sparked that initial idea?
The two prior posts reveal my train of thought at the time: a look at the Blowfish cipher and a 64-bit PRNG shootout. My layman’s study of Blowfish was actually part of an examination of a number of different block ciphers. For example, I also read the NSA’s Speck and Simon paper and then implemented the 128/128 variant of Speck — a 128-bit key and 128-bit block. I didn’t take the time to write an article about it, but note how the entire cipher — key schedule, encryption, and decryption — is just 40 lines of code:
struct speck {
uint64_t k[32];
};
void
speck_init(struct speck *ctx, uint64_t x, uint64_t y)
{
ctx->k[0] = y;
for (uint64_t i = 0; i < 31; i++) {
x = (x >> 8) | (x << 56);
x += y;
x ^= i;
y = (y << 3) | (y >> 61);
y ^= x;
ctx->k[i + 1] = y;
}
}
void
speck_encrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
for (int i = 0; i < 32; i++) {
*x = (*x >> 8) | (*x << 56);
*x += *y;
*x ^= ctx->k[i];
*y = (*y << 3) | (*y >> 61);
*y ^= *x;
}
}
static void
speck_decrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
for (int i = 0; i < 32; i++) {
*y ^= *x;
*y = (*y >> 3) | (*y << 61);
*x ^= ctx->k[31 - i];
*x -= *y;
*x = (*x << 8) | (*x >> 56);
}
}
Isn’t that just beautiful? It’s so tiny and fast. Other than the not-very-arbitrary selection of 32 rounds, and the use of 3-bit and 8-bit rotations, there are no magic values. One could fairly reasonably commit this cipher to memory if necessary, similar to the late RC4. Speck is probably my favorite block cipher right now, except that I couldn’t figure out the key schedule for any of the other key/block size variants.
Another cipher I studied, though in less depth, was RC5 (1994), a block cipher by (obviously) Ron Rivest. The most novel part of RC5 is its use of data dependent rotations. This was a very deliberate decision, and the paper makes this clear:
RC5 should highlight the use of data-dependent rotations, and encourage the assessment of the cryptographic strength data-dependent rotations can provide.
What’s a data-dependent rotation. In the Speck cipher shown above, notice how the right-hand side of all the rotation operations is a constant (3, 8, 56, and 61). Suppose that these operands were not constant, instead they were based on some part of the value of the block:
int r = *y & 0x0f;
*x = (*x >> r) | (*x << (64 - r));
Such “random” rotations “frustrate differential cryptanalysis” according to the paper, increasing the strength of the cipher.
Another algorithm that uses data-dependent shift is the PCG family of PRNGs. Honestly, the data-dependent “permutation” shift is the defining characteristic of PCG. As a reminder, here’s the simplified PCG from my shootout:
uint32_t
spcg32(uint64_t s[1])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a = 0xd737232eeccdf7ed;
*s = *s * m + a;
int shift = 29 - (*s >> 61);
return *s >> shift;
}
Notice how the final shift depends on the high order bits of the PRNG state. (This one weird trick from Melissa O’Neil will significantly improve your PRNG. Xorshift experts hate her.)
I think this raises a really interesting question: Why did it take until 2014 for someone to apply a data-dependent shift to a PRNG? Similarly, why are data-dependent rotations not used in many ciphers?
My own theory is that this is because many older instruction set architectures can’t perform data-dependent shift operations efficiently.
Many instruction sets only have a fixed shift (e.g. 1-bit), or can only shift by an immediate (e.g. a constant). In these cases, a data-dependent shift would require a loop. These loops would be a ripe source of side channel attacks in ciphers due to the difficultly of making them operate in constant time. It would also be relatively slow for video game PRNGs, which often needed to run on constrained hardware with limited instruction sets. For example, the 6502 (Atari, Apple II, NES, Commodore 64) and the Z80 (too many to list) can only shift/rotate one bit at a time.
Even on an architecture with an instruction for data-dependent shifts, such as the x86, those shifts will be slower than constant shifts, at least in part due to the additional data dependency.
It turns out there are also some patent issues (ex. 1, 2). Fortunately most of these patents have now expired, and one in particular is set to expire this June. I still like my theory better.
So I was thinking about data-dependent shifts, and I had also noticed I could trivially check the length of a UTF-8 code point using a small lookup table — the first step in my decoder. What if I combined these: a data-dependent shift based on a table lookup. This would become the last step in my decoder. The idea for a branchless UTF-8 decoder was essentially borne out of connecting these two thoughts, and then filling in the middle.
]]>In a previous article I demonstrated video filtering with C and a
unix pipeline. Thanks to the ubiquitous support for the
ridiculously simple Netpbm formats — specifically the “Portable
PixMap” (.ppm
, P6
) binary format — it’s trivial to parse and
produce image data in any language without image libraries. Video
decoders and encoders at the ends of the pipeline do the heavy lifting
of processing the complicated video formats actually used to store and
transmit video.
Naturally this same technique can be used to produce new video in a simple program. All that’s needed are a few functions to render artifacts — lines, shapes, etc. — to an RGB buffer. With a bit of basic sound synthesis, the same concept can be applied to create audio in a separate audio stream — in this case using the simple (but not as simple as Netpbm) WAV format. Put them together and a small, standalone program can create multimedia.
Here’s the demonstration video I’ll be going through in this article. It animates and visualizes various in-place sorting algorithms (see also). The elements are rendered as colored dots, ordered by hue, with red at 12 o’clock. A dot’s distance from the center is proportional to its corresponding element’s distance from its correct position. Each dot emits a sinusoidal tone with a unique frequency when it swaps places in a particular frame.
Original credit for this visualization concept goes to w0rthy.
All of the source code (less than 600 lines of C), ready to run, can be found here:
On any modern computer, rendering is real-time, even at 60 FPS, so you may be able to pipe the program’s output directly into your media player of choice. (If not, consider getting a better media player!)
$ ./sort | mpv --no-correct-pts --fps=60 -
VLC requires some help from ppmtoy4m:
$ ./sort | ppmtoy4m -F60:1 | vlc -
Or you can just encode it to another format. Recent versions of
libavformat can input PPM images directly, which means x264
can read
the program’s output directly:
$ ./sort | x264 --fps 60 -o video.mp4 /dev/stdin
By default there is no audio output. I wish there was a nice way to
embed audio with the video stream, but this requires a container and
that would destroy all the simplicity of this project. So instead, the
-a
option captures the audio in a separate file. Use ffmpeg
to
combine the audio and video into a single media file:
$ ./sort -a audio.wav | x264 --fps 60 -o video.mp4 /dev/stdin
$ ffmpeg -i video.mp4 -i audio.wav -vcodec copy -acodec mp3 \
combined.mp4
You might think you’ll be clever by using mkfifo
(i.e. a named pipe)
to pipe both audio and video into ffmpeg at the same time. This will
only result in a deadlock since neither program is prepared for this.
One will be blocked writing one stream while the other is blocked
reading on the other stream.
Several years ago my intern and I used the exact same pure C rendering technique to produce these raytracer videos:
I also used this technique to illustrate gap buffers.
This program really only has one purpose: rendering a sorting video with a fixed, square resolution. So rather than write generic image rendering functions, some assumptions will be hard coded. For example, the video size will just be hard coded and assumed square, making it simpler and faster. I chose 800x800 as the default:
#define S 800
Rather than define some sort of color struct with red, green, and blue
fields, color will be represented by a 24-bit integer (long
). I
arbitrarily chose red to be the most significant 8 bits. This has
nothing to do with the order of the individual channels in Netpbm
since these integers are never dumped out. (This would have stupid
byte-order issues anyway.) “Color literals” are particularly
convenient and familiar in this format. For example, the constant for
pink: 0xff7f7fUL
.
In practice the color channels will be operated upon separately, so here are a couple of helper functions to convert the channels between this format and normalized floats (0.0–1.0).
static void
rgb_split(unsigned long c, float *r, float *g, float *b)
{
*r = ((c >> 16) / 255.0f);
*g = (((c >> 8) & 0xff) / 255.0f);
*b = ((c & 0xff) / 255.0f);
}
static unsigned long
rgb_join(float r, float g, float b)
{
unsigned long ir = roundf(r * 255.0f);
unsigned long ig = roundf(g * 255.0f);
unsigned long ib = roundf(b * 255.0f);
return (ir << 16) | (ig << 8) | ib;
}
Originally I decided the integer form would be sRGB, and these functions handled the conversion to and from sRGB. Since it had no noticeable effect on the output video, I discarded it. In more sophisticated rendering you may want to take this into account.
The RGB buffer where images are rendered is just a plain old byte
buffer with the same pixel format as PPM. The ppm_set()
function
writes a color to a particular pixel in the buffer, assumed to be S
by S
pixels. The complement to this function is ppm_get()
, which
will be needed for blending.
static void
ppm_set(unsigned char *buf, int x, int y, unsigned long color)
{
buf[y * S * 3 + x * 3 + 0] = color >> 16;
buf[y * S * 3 + x * 3 + 1] = color >> 8;
buf[y * S * 3 + x * 3 + 2] = color >> 0;
}
static unsigned long
ppm_get(unsigned char *buf, int x, int y)
{
unsigned long r = buf[y * S * 3 + x * 3 + 0];
unsigned long g = buf[y * S * 3 + x * 3 + 1];
unsigned long b = buf[y * S * 3 + x * 3 + 2];
return (r << 16) | (g << 8) | b;
}
Since the buffer is already in the right format, writing an image is dead simple. I like to flush after each frame so that observers generally see clean, complete frames. It helps in debugging.
static void
ppm_write(const unsigned char *buf, FILE *f)
{
fprintf(f, "P6\n%d %d\n255\n", S, S);
fwrite(buf, S * 3, S, f);
fflush(f);
}
If you zoom into one of those dots, you may notice it has a nice smooth edge. Here’s one rendered at 30x the normal resolution. I did not render, then scale this image in another piece of software. This is straight out of the C program.
In an early version of this program I used a dumb dot rendering routine. It took a color and a hard, integer pixel coordinate. All the pixels within a certain distance of this coordinate were set to the color, everything else was left alone. This had two bad effects:
Dots jittered as they moved around since their positions were rounded to the nearest pixel for rendering. A dot would be centered on one pixel, then suddenly centered on another pixel. This looked bad even when those pixels were adjacent.
There’s no blending between dots when they overlap, making the lack of anti-aliasing even more pronounced.
Instead the dot’s position is computed in floating point and is actually rendered as if it were between pixels. This is done with a shader-like routine that uses smoothstep — just as found in shader languages — to give the dot a smooth edge. That edge is blended into the image, whether that’s the background or a previously-rendered dot. The input to the smoothstep is the distance from the floating point coordinate to the center (or corner?) of the pixel being rendered, maintaining that between-pixel smoothness.
Rather than dump the whole function here, let’s look at it piece by piece. I have two new constants to define the inner dot radius and the outer dot radius. It’s smooth between these radii.
#define R0 (S / 400.0f) // dot inner radius
#define R1 (S / 200.0f) // dot outer radius
The dot-drawing function takes the image buffer, the dot’s coordinates, and its foreground color.
static void
ppm_dot(unsigned char *buf, float x, float y, unsigned long fgc);
The first thing to do is extract the color components.
float fr, fg, fb;
rgb_split(fgc, &fr, &fg, &fb);
Next determine the range of pixels over which the dot will be draw. These are based on the two radii and will be used for looping.
int miny = floorf(y - R1 - 1);
int maxy = ceilf(y + R1 + 1);
int minx = floorf(x - R1 - 1);
int maxx = ceilf(x + R1 + 1);
Here’s the loop structure. Everything else will be inside the innermost
loop. The dx
and dy
are the floating point distances from the center
of the dot.
for (int py = miny; py <= maxy; py++) {
float dy = py - y;
for (int px = minx; px <= maxx; px++) {
float dx = px - x;
/* ... */
}
}
Use the x and y distances to compute the distance and smoothstep value, which will be the alpha. Within the inner radius the color is on 100%. Outside the outer radius it’s 0%. Elsewhere it’s something in between.
float d = sqrtf(dy * dy + dx * dx);
float a = smoothstep(R1, R0, d);
Get the background color, extract its components, and blend the foreground and background according to the computed alpha value. Finally write the pixel back into the buffer.
unsigned long bgc = ppm_get(buf, px, py);
float br, bg, bb;
rgb_split(bgc, &br, &bg, &bb);
float r = a * fr + (1 - a) * br;
float g = a * fg + (1 - a) * bg;
float b = a * fb + (1 - a) * bb;
ppm_set(buf, px, py, rgb_join(r, g, b));
That’s all it takes to render a smooth dot anywhere in the image.
The array being sorted is just a global variable. This simplifies some of the sorting functions since a few are implemented recursively. They can call for a frame to be rendered without needing to pass the full array. With the dot-drawing routine done, rendering a frame is easy:
#define N 360 // number of dots
static int array[N];
static void
frame(void)
{
static unsigned char buf[S * S * 3];
memset(buf, 0, sizeof(buf));
for (int i = 0; i < N; i++) {
float delta = abs(i - array[i]) / (N / 2.0f);
float x = -sinf(i * 2.0f * PI / N);
float y = -cosf(i * 2.0f * PI / N);
float r = S * 15.0f / 32.0f * (1.0f - delta);
float px = r * x + S / 2.0f;
float py = r * y + S / 2.0f;
ppm_dot(buf, px, py, hue(array[i]));
}
ppm_write(buf, stdout);
}
The buffer is static
since it will be rather large, especially if S
is cranked up. Otherwise it’s likely to overflow the stack. The
memset()
fills it with black. If you wanted a different background
color, here’s where you change it.
For each element, compute its delta from the proper array position,
which becomes its distance from the center of the image. The angle is
based on its actual position. The hue()
function (not shown in this
article) returns the color for the given element.
With the frame()
function complete, all I need is a sorting function
that calls frame()
at appropriate times. Here are a couple of
examples:
static void
shuffle(int array[N], uint64_t *rng)
{
for (int i = N - 1; i > 0; i--) {
uint32_t r = pcg32(rng) % (i + 1);
swap(array, i, r);
frame();
}
}
static void
sort_bubble(int array[N])
{
int c;
do {
c = 0;
for (int i = 1; i < N; i++) {
if (array[i - 1] > array[i]) {
swap(array, i - 1, i);
c = 1;
}
}
frame();
} while (c);
}
To add audio I need to keep track of which elements were swapped in this frame. When producing a frame I need to generate and mix tones for each element that was swapped.
Notice the swap()
function above? That’s not just for convenience.
That’s also how things are tracked for the audio.
static int swaps[N];
static void
swap(int a[N], int i, int j)
{
int tmp = a[i];
a[i] = a[j];
a[j] = tmp;
swaps[(a - array) + i]++;
swaps[(a - array) + j]++;
}
Before we get ahead of ourselves I need to write a WAV header. Without getting into the purpose of each field, just note that the header has 13 fields, followed immediately by 16-bit little endian PCM samples. There will be only one channel (monotone).
#define HZ 44100 // audio sample rate
static void
wav_init(FILE *f)
{
emit_u32be(0x52494646UL, f); // "RIFF"
emit_u32le(0xffffffffUL, f); // file length
emit_u32be(0x57415645UL, f); // "WAVE"
emit_u32be(0x666d7420UL, f); // "fmt "
emit_u32le(16, f); // struct size
emit_u16le(1, f); // PCM
emit_u16le(1, f); // mono
emit_u32le(HZ, f); // sample rate (i.e. 44.1 kHz)
emit_u32le(HZ * 2, f); // byte rate
emit_u16le(2, f); // block size
emit_u16le(16, f); // bits per sample
emit_u32be(0x64617461UL, f); // "data"
emit_u32le(0xffffffffUL, f); // byte length
}
Rather than tackle the annoying problem of figuring out the total
length of the audio ahead of time, I just wave my hands and write the
maximum possible number of bytes (0xffffffff
). Most software that
can read WAV files will understand this to mean the entire rest of the
file contains samples.
With the header out of the way all I have to do is write 1/60th of a second worth of samples to this file each time a frame is produced. That’s 735 samples (1,470 bytes) at 44.1kHz.
The simplest place to do audio synthesis is in frame()
right after
rendering the image.
#define FPS 60 // output framerate
#define MINHZ 20 // lowest tone
#define MAXHZ 1000 // highest tone
static void
frame(void)
{
/* ... rendering ... */
/* ... synthesis ... */
}
With the largest tone frequency at 1kHz, Nyquist says we only need to sample at 2kHz. 8kHz is a very common sample rate and gives some overhead space, making it a good choice. However, I found that audio encoding software was a lot happier to accept the standard CD sample rate of 44.1kHz, so I stuck with that.
The first thing to do is to allocate and zero a buffer for this frame’s samples.
int nsamples = HZ / FPS;
static float samples[HZ / FPS];
memset(samples, 0, sizeof(samples));
Next determine how many “voices” there are in this frame. This is used to mix the samples by averaging them. If an element was swapped more than once this frame, it’s a little louder than the others — i.e. it’s played twice at the same time, in phase.
int voices = 0;
for (int i = 0; i < N; i++)
voices += swaps[i];
Here’s the most complicated part. I use sinf()
to produce the
sinusoidal wave based on the element’s frequency. I also use a parabola
as an envelope to shape the beginning and ending of this tone so that
it fades in and fades out. Otherwise you get the nasty, high-frequency
“pop” sound as the wave is given a hard cut off.
for (int i = 0; i < N; i++) {
if (swaps[i]) {
float hz = i * (MAXHZ - MINHZ) / (float)N + MINHZ;
for (int j = 0; j < nsamples; j++) {
float u = 1.0f - j / (float)(nsamples - 1);
float parabola = 1.0f - (u * 2 - 1) * (u * 2 - 1);
float envelope = parabola * parabola * parabola;
float v = sinf(j * 2.0f * PI / HZ * hz) * envelope;
samples[j] += swaps[i] * v / voices;
}
}
}
Finally I write out each sample as a signed 16-bit value. I flush the frame audio just like I flushed the frame image, keeping them somewhat in sync from an outsider’s perspective.
for (int i = 0; i < nsamples; i++) {
int s = samples[i] * 0x7fff;
emit_u16le(s, wav);
}
fflush(wav);
Before returning, reset the swap counter for the next frame.
memset(swaps, 0, sizeof(swaps));
You may have noticed there was text rendered in the corner of the video
announcing the sort function. There’s font bitmap data in font.h
which
gets sampled to render that text. It’s not terribly complicated, but
you’ll have to study the code on your own to see how that works.
This simple video rendering technique has served me well for some years now. All it takes is a bit of knowledge about rendering. I learned quite a bit just from watching Handmade Hero, where Casey writes a software renderer from scratch, then implements a nearly identical renderer with OpenGL. The more I learn about rendering, the better this technique works.
Before writing this post I spent some time experimenting with using a media player as a interface to a game. For example, rather than render the game using OpenGL or similar, render it as PPM frames and send it to the media player to be displayed, just as game consoles drive television sets. Unfortunately the latency is horrible — multiple seconds — so that idea just doesn’t work. So while this technique is fast enough for real time rendering, it’s no good for interaction.
]]>if
statements, loops, short-circuit operators, or other
sorts of conditional jumps. You can find the source code here along
with a test suite and benchmark:
In addition to decoding the next code point, it detects any errors and returns a pointer to the next code point. It’s the complete package.
Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.
The usual analogy is laundry. You can have more than one load of laundry in process at a time because laundry is typically a pipelined process. There’s a washing machine stage, dryer stage, and folding stage. One load can be in the washer, a second in the drier, and a third being folded, all at once. This greatly increases throughput because, under ideal circumstances with a full pipeline, an instruction is completed each clock cycle despite any individual instruction taking many clock cycles to complete.
Branches are the enemy of pipelines. The CPU can’t begin work on the next instruction if it doesn’t know which instruction will be executed next. It must finish computing the branch condition before it can know. To deal with this, pipelined CPUs are also equipped with branch predictors. It makes a guess at which branch will be taken and begins executing instructions on that branch. The prediction is initially made using static heuristics, and later those predictions are improved by learning from previous behavior. This even includes predicting the number of iterations of a loop so that the final iteration isn’t mispredicted.
A mispredicted branch has two dire consequences. First, all the progress on the incorrect branch will need to be discarded. Second, the pipeline will be flushed, and the CPU will be inefficient until the pipeline fills back up with instructions on the correct branch. With a sufficiently deep pipeline, it can easily be more efficient to compute and discard an unneeded result than to avoid computing it in the first place. Eliminating branches means eliminating the hazards of misprediction.
Another hazard for pipelines is dependencies. If an instruction depends on the result of a previous instruction, it may have to wait for the previous instruction to make sufficient progress before it can complete one of its stages. This is known as a pipeline stall, and it is an important consideration in instruction set architecture (ISA) design.
For example, on the x86-64 architecture, storing a 32-bit result in a 64-bit register will automatically clear the upper 32 bits of that register. Any further use of that destination register cannot depend on prior instructions since all bits have been set. This particular optimization was missed in the design of the i386: Writing a 16-bit result to 32-bit register leaves the upper 16 bits intact, creating false dependencies.
Dependency hazards are mitigated using out-of-order execution. Rather than execute two dependent instructions back to back, which would result in a stall, the CPU may instead executing an independent instruction further away in between. A good compiler will also try to spread out dependent instructions in its own instruction scheduling.
The effects of out-of-order execution are typically not visible to a single thread, where everything will appear to have executed in order. However, when multiple processes or threads can access the same memory out-of-order execution can be observed. It’s one of the many challenges of writing multi-threaded software.
The focus of my UTF-8 decoder was to be branchless, but there was one interesting dependency hazard that neither GCC nor Clang were able to resolve themselves. More on that later.
Without getting into the history of it, you can generally think of UTF-8 as a method for encoding a series of 21-bit integers (code points) into a stream of bytes.
Shorter integers encode to fewer bytes than larger integers. The shortest available encoding must be chosen, meaning there is one canonical encoding for a given sequence of code points.
Certain code points are off limits: surrogate halves. These are
code points U+D800
through U+DFFF
. Surrogates are used in UTF-16
to represent code points above U+FFFF and serve no purpose in UTF-8.
This has interesting consequences for pseudo-Unicode
strings, such “wide” strings in the Win32 API, where surrogates may
appear unpaired. Such sequences cannot legally be represented in
UTF-8.
Keeping in mind these two rules, the entire format is summarized by this table:
length byte[0] byte[1] byte[2] byte[3]
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The x
placeholders are the bits of the encoded code point.
UTF-8 has some really useful properties:
It’s backwards compatible with ASCII, which never used the highest bit.
Sort order is preserved. Sorting a set of code point sequences has the same result as sorting their UTF-8 encoding.
No additional zero bytes are introduced. In C we can continue using
null terminated char
buffers, often without even realizing they
hold UTF-8 data.
It’s self-synchronizing. A leading byte will never be mistaken for a
continuation byte. This allows for byte-wise substring searches,
meaning UTF-8 unaware functions like strstr(3)
continue to work
without modification (except for normalization issues). It also
allows for unambiguous recovery of a damaged stream.
A straightforward approach to decoding might look something like this:
unsigned char *
utf8_simple(unsigned char *s, long *c)
{
unsigned char *next;
if (s[0] < 0x80) {
*c = s[0];
next = s + 1;
} else if ((s[0] & 0xe0) == 0xc0) {
*c = ((long)(s[0] & 0x1f) << 6) |
((long)(s[1] & 0x3f) << 0);
next = s + 2;
} else if ((s[0] & 0xf0) == 0xe0) {
*c = ((long)(s[0] & 0x0f) << 12) |
((long)(s[1] & 0x3f) << 6) |
((long)(s[2] & 0x3f) << 0);
next = s + 3;
} else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4)) {
*c = ((long)(s[0] & 0x07) << 18) |
((long)(s[1] & 0x3f) << 12) |
((long)(s[2] & 0x3f) << 6) |
((long)(s[3] & 0x3f) << 0);
next = s + 4;
} else {
*c = -1; // invalid
next = s + 1; // skip this byte
}
if (*c >= 0xd800 && *c <= 0xdfff)
*c = -1; // surrogate half
return next;
}
It branches off on the highest bits of the leading byte, extracts all of
those x
bits from each byte, concatenates those bits, checks if it’s a
surrogate half, and returns a pointer to the next character. (This
implementation does not check that the highest two bits of each
continuation byte are correct.)
The CPU must correctly predict the length of the code point or else it will suffer a hazard. An incorrect guess will stall the pipeline and slow down decoding.
In real world text this is probably not a serious issue. For the English language, the encoded length is nearly always a single byte. However, even for non-English languages, text is usually accompanied by markup from the ASCII range of characters, and, overall, the encoded lengths will still have consistency. As I said, the CPU predicts branches based on the program’s previous behavior, so this means it will temporarily learn some of the statistical properties of the language being actively decoded. Pretty cool, eh?
Eliminating branches from the decoder side-steps any issues with mispredicting encoded lengths. Only errors in the stream will cause stalls. Since that’s probably the unusual case, the branch predictor will be very successful by continually predicting success. That’s one optimistic CPU.
Here’s the interface to my branchless decoder:
void *utf8_decode(void *buf, uint32_t *c, int *e);
I chose void *
for the buffer so that it doesn’t care what type was
actually chosen to represent the buffer. It could be a uint8_t
,
char
, unsigned char
, etc. Doesn’t matter. The encoder accesses it
only as bytes.
On the other hand, with this interface you’re forced to use uint32_t
to represent code points. You could always change the function to suit
your own needs, though.
Errors are returned in e
. It’s zero for success and non-zero when an
error was detected, without any particular meaning for different values.
Error conditions are mixed into this integer, so a zero simply means the
absence of error.
This is where you could accuse me of “cheating” a little bit. The
caller probably wants to check for errors, and so they will have to
branch on e
. It seems I’ve just smuggled the branches outside of the
decoder.
However, as I pointed out, unless you’re expecting lots of errors, the
real cost is branching on encoded lengths. Furthermore, the caller
could instead accumulate the errors: count them, or make the error
“sticky” by ORing all e
values together. Neither of these require a
branch. The caller could decode a huge stream and only check for
errors at the very end. The only branch would be the main loop (“are
we done yet?”), which is trivial to predict with high accuracy.
The first thing the function does is extract the encoded length of the next code point:
static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
};
unsigned char *s = buf;
int len = lengths[s[0] >> 3];
Looking back to the UTF-8 table above, only the highest 5 bits determine
the length. That’s 32 possible values. The zeros are for invalid
prefixes. This will later cause a bit to be set in e
.
With the length in hand, it can compute the position of the next code point in the buffer.
unsigned char *next = s + len + !len;
Originally this expression was the return value, computed at the very end of the function. However, after inspecting the compiler’s assembly output, I decided to move it up, and the result was a solid performance boost. That’s because it spreads out dependent instructions. With the address of the next code point known so early, the instructions that decode the next code point can get started early.
The reason for the !len
is so that the pointer is advanced one byte
even in the face of an error (length of zero). Adding that !len
is
actually somewhat costly, though I couldn’t figure out why.
static const int shiftc[] = {0, 18, 12, 6, 0};
*c = (uint32_t)(s[0] & masks[len]) << 18;
*c |= (uint32_t)(s[1] & 0x3f) << 12;
*c |= (uint32_t)(s[2] & 0x3f) << 6;
*c |= (uint32_t)(s[3] & 0x3f) << 0;
*c >>= shiftc[len];
This reads four bytes regardless of the actual length. Avoiding doing something is branching, so this can’t be helped. The unneeded bits are shifted out based on the length. That’s all it takes to decode UTF-8 without branching.
One important consequence of always reading four bytes is that the caller must zero-pad the buffer to at least four bytes. In practice, this means padding the entire buffer with three bytes in case the last character is a single byte.
The padding must be zero in order to detect errors. Otherwise the padding might look like legal continuation bytes.
static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
static const int shifte[] = {0, 6, 4, 2, 0};
*e = (*c < mins[len]) << 6;
*e |= ((*c >> 11) == 0x1b) << 7; // surrogate half?
*e |= (s[1] & 0xc0) >> 2;
*e |= (s[2] & 0xc0) >> 4;
*e |= (s[3] ) >> 6;
*e ^= 0x2a;
*e >>= shifte[len];
The first line checks if the shortest encoding was used, setting a bit
in e
if it wasn’t. For a length of 0, this always fails.
The second line checks for a surrogate half by checking for a certain prefix.
The next three lines accumulate the highest two bits of each
continuation byte into e
. Each should be the bits 10
. These bits are
“compared” to 101010
(0x2a
) using XOR. The XOR clears these bits as
long as they exactly match.
Finally the continuation prefix bits that don’t matter are shifted out.
My primary — and totally arbitrary — goal was to beat the performance of
Björn Höhrmann’s DFA-based decoder. Under favorable (and
artificial) benchmark conditions I had moderate success. You can try it
out on your own system by cloning the repository and running make
bench
.
With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.
Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.
The benchmark operates very similarly to my PRNG shootout (e.g.
alarm(2)
). First a buffer is filled with random UTF-8 data, then the
decoder decodes it again and again until the alarm fires. The
measurement is the number of bytes decoded.
The number of errors is printed at the end (always 0) in order to force errors to actually get checked for each code point. Otherwise the sneaky compiler omits the error checking from the branchless decoder, making it appear much faster than it really is — a serious letdown once I noticed my error. Since the other decoder is a DFA and error checking is built into its graph, the compiler can’t really omit its error checking.
I called this “favorable” because the buffer being decoded isn’t anything natural. Each time a code point is generated, first a length is chosen uniformly: 1, 2, 3, or 4. Then a code point that encodes to that length is generated. The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction. Real text has a far more favorable distribution.
uint32_t
randchar(uint64_t *s)
{
uint32_t r = rand32(s);
int len = 1 + (r & 0x3);
r >>= 2;
switch (len) {
case 1:
return r % 128;
case 2:
return 128 + r % (2048 - 128);
case 3:
return 2048 + r % (65536 - 2048);
case 4:
return 65536 + r % (131072 - 65536);
}
abort();
}
Given the odd input zero-padding requirement and the artificial parameters of the benchmark, despite the supposed 20% speed boost under GCC, my branchless decoder is not really any better than the DFA decoder in practice. It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.
Update: Bryan Donlan has followed up with a SIMD UTF-8 decoder.
Update 2024: NRK has followed up with parallel extract decoder.
]]>I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.
Monte Carlo simulations, where PRNGs are used to compute numeric estimates for problems that are difficult or impossible to solve analytically.
Monte Carlo tree search AI, where massive numbers of games are played out randomly in search of an optimal move. This is a specific application of the last item.
Genetic algorithms, where a PRNG creates the initial population, and then later guides in mutation and breeding of selected solutions.
Cryptography, where a cryptographically-secure PRNGs (CSPRNGs) produce output that is predictable for recipients who know a particular secret, but not for anyone else. This article is only concerned with plain PRNGs.
For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:
The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.
PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.
In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.
Depending on the programming language, the PRNGs found in various
standard libraries may be of dubious quality. They’re slower than they
need to be, or have poorer quality than required. In some cases, such
as rand()
in C, the algorithm isn’t specified, and you can’t rely on
it for anything outside of trivial examples. In other cases the
algorithm and behavior is specified, but you could easily do better
yourself.
My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.
A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”
For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:
Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.
Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.
Speed (MB/s)
PRNG FAIL WEAK gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline X X 15000 13100
blowfishcbc16 0 1 169 157
blowfishcbc4 0 5 725 676
blowfishctr16 1 3 187 184
blowfishctr4 1 5 890 1000
mt64 1 7 1700 1970
pcg64 0 4 4150 3290
rc4 0 5 366 185
spcg64 0 8 5140 4960
xoroshiro128+ 0 6 8100 7720
xorshift128+ 0 2 7660 6530
xorshift64* 0 3 4990 5060
And the actual dieharder outputs:
The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.
These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.
In the speed benchmark, the PRNG is initialized, a 1-second alarm(1)
is set, then the PRNG fills a large volatile
buffer of 64-bit unsigned
integers again and again as quickly as possible until the alarm fires.
The amount of memory written is measured as the PRNG’s speed.
The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.
The purpose for making the buffer volatile
is to force the entire
output to actually be “consumed” as far as the compiler is concerned.
Otherwise the compiler plays nasty tricks to make the program do as
little work as possible. Another way to deal with this would be to
write(2)
buffer, but of course I didn’t want to introduce
unnecessary I/O into a benchmark.
On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.
The SIGALRM handler sets a volatile
global variable that tells the
generator to stop. The PRNG call was unrolled 8 times to avoid the
alarm check from significantly impacting the benchmark. You can see
the effect for yourself by changing UNROLL
to 1 (i.e. “don’t
unroll”) in the code. Unrolling beyond 8 times had no measurable
effect to my tests.
Due to the PRNGs being inlined, this unrolling makes the benchmark
less realistic, and it shows in the results. Using volatile
for the
buffer helped to counter this effect and reground the results. This is
a fuzzy problem, and there’s not really any way to avoid it, but I
will also discuss this below.
To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:
prng | dieharder -g200 -a -m4
This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.
There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.
The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.
Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.
Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.
One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.
Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.
The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.
In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.
In CBC mode, each output depends on the previous, but in CTR mode all
blocks are independent. Work can begin on the next output before the
previous output is complete. The x86 architecture uses out-of-order
execution to achieve many of its performance gains: Instructions may
be executed in a different order than they appear in the program,
though their observable effects must generally be ordered
correctly. Breaking dependencies between instructions allows
out-of-order execution to be fully exercised. It also gives the
compiler more freedom in instruction scheduling, though the volatile
accesses cannot be reordered with respect to each other (hence it
helping to reground the benchmark).
Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.
On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?
In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.
Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:
It’s between 1/4 and 1/5 the speed of xoroshiro128+.
It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.
Its implementation is three times bigger than xoroshiro128+, and much more complicated.
It had one statistical failure (dab_filltree2).
Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.
Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.
The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.
PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).
Here’s essentially what the simplified version looks like:
uint32_t
spcg32(uint64_t s[1])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a = 0xd737232eeccdf7ed;
*s = *s * m + a;
int shift = 29 - (*s >> 61);
return *s >> shift;
}
The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.
The two constants are entirely my own devising. It’s two 64-bit primes
generated using Emacs’ M-x calc
: 2 64 ^ k r k n k p k p k p
.
Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.
However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.
Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.
Here’s my doubled-up version:
uint64_t
spcg64(uint64_t s[2])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a0 = 0xd737232eeccdf7ed;
uint64_t a1 = 0x8b260b70b8e98891;
uint64_t p0 = s[0];
uint64_t p1 = s[1];
s[0] = p0 * m + a0;
s[1] = p1 * m + a1;
int r0 = 29 - (p0 >> 61);
int r1 = 29 - (p1 >> 61);
uint64_t high = p0 >> r0;
uint32_t low = p1 >> r1;
return (high << 32) | low;
}
The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.
In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.
This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.
xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:
uint64_t
xoroshiro128plus(uint64_t s[2])
{
uint64_t s0 = s[0];
uint64_t s1 = s[1];
uint64_t result = s0 + s1;
s1 ^= s0;
s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
s[1] = (s1 << 36) | (s1 >> 28);
return result;
}
There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.
These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.
uint64_t
xorshift64star(uint64_t s[1])
{
uint64_t x = s[0];
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
s[0] = x;
return x * UINT64_C(0x2545f4914f6cdd1d);
}
However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.
If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.
uint64_t
xorshift128plus(uint64_t s[2])
{
uint64_t x = s[0];
uint64_t y = s[1];
s[0] = y;
x ^= x << 23;
s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
return s[1] + y;
}
It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.
My own takeaway (until I re-evaluate some years in the future):
Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:
Speed (MB/s)
PRNG gcc-5.4.0 clang-3.8.0
------------------------------------
baseline 2560 2400
blowfishcbc16 36.5 45.4
blowfishcbc4 135 173
blowfishctr16 36.4 45.2
blowfishctr4 133 168
mt64 207 254
pcg64 980 712
rc4 96.6 44.0
spcg64 1021 948
xoroshiro128+ 2560 1570
xorshift128+ 2560 1520
xorshift64* 1360 1080
LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.
So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.
]]>Most importantly, since Blowpipe is intended to be used as a pipe (duh), it will never output decrypted plaintext that hasn’t been authenticated. That is, it will detect tampering of the encrypted stream and truncate its output, reporting an error, without producing the manipulated data. Some very similar tools that aren’t considered toys lack this important feature, such as aespipe.
Blowpipe came about because I wanted to study Blowfish, a 64-bit block cipher designed by Bruce Schneier in 1993. It’s played an important role in the history of cryptography and has withstood cryptanalysis for 24 years. Its major weakness is its small block size, leaving it vulnerable to birthday attacks regardless of any other property of the cipher. Even in 1993 the 64-bit block size was a bit on the small side, but Blowfish was intended as a drop-in replacement for the Data Encryption Standard (DES) and the International Data Encryption Algorithm (IDEA), other 64-bit block ciphers.
The main reason I’m calling this program a toy is that, outside of legacy interfaces, it’s simply not appropriate to deploy a 64-bit block cipher in 2017. Blowpipe shouldn’t be used to encrypt more than a few tens of GBs of data at a time. Otherwise I’m fairly confident in both my message construction and my implementation. One detail is a little uncertain, and I’ll discuss it later when describing message format.
A tool that I am confident about is Enchive, though since it’s intended for file encryption, it’s not appropriate for use as a pipe. It doesn’t authenticate until after it has produced most of its output. Enchive does try its best to delete files containing unauthenticated output when authentication fails, but this doesn’t prevent you from consuming this output before it can be deleted, particularly if you pipe the output into another program.
As you might expect, there are two modes of operation: encryption (-E
)
and decryption (-D
). The simplest usage is encrypting and decrypting a
file:
$ blowpipe -E < data.gz > data.gz.enc
$ blowpipe -D < data.gz.enc | gunzip > data.txt
In both cases you will be prompted for a passphrase which can be up to 72 bytes in length. The only verification for the key is the first Message Authentication Code (MAC) in the datastream, so Blowpipe cannot tell the difference between damaged ciphertext and an incorrect key.
In a script it would be smart to check Blowpipe’s exit code after decrypting. The output will be truncated should authentication fail somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t clean up for you.
Another use case is securely transmitting files over a network with
netcat. In this example I’ll use a pre-shared key file, keyfile
.
Rather than prompt for a key, Blowpipe will use the raw bytes of a given
file. Here’s how I would create a key file:
$ head -c 32 /dev/urandom > keyfile
First the receiver listens on a socket (bind(2)
):
$ nc -lp 2000 | blowpipe -D -k keyfile > data.zip
Then the sender connects (connect(2)
) and pipes Blowpipe through:
$ blowpipe -E -k keyfile < data.zip | nc -N hostname 2000
If all went well, Blowpipe will exit with 0 on the receiver side.
Blowpipe doesn’t buffer its output (but see -w
). It performs one
read(2)
, encrypts whatever it got, prepends a MAC, and calls
write(2)
on the result. This means it can comfortably transmit live
sensitive data across the network:
$ nc -lp 2000 | blowpipe -D
# dmesg -w | blowpipe -E | nc -N hostname 2000
Kernel messages will appear on the other end as they’re produced by
dmesg
. Though keep in mind that the size of each line will be known to
eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to
disguise the length. Those lengths may leak useful information.
This whole project started when I wanted to play with Blowfish as a small drop-in library. I wasn’t satisfied with the selection, so I figured it would be a good exercise to write my own. Besides, the specification is both an enjoyable and easy read (and recommended). It justifies the need for a new cipher and explains the various design decisions.
I coded from the specification, including writing a script to generate the subkey initialization tables. Subkeys are initialized to the binary representation of pi (the first ~10,000 decimal digits). After a couple hours of work I hooked up the official test vectors to see how I did, and all the tests passed on the first run. This wasn’t reasonable, so I spent awhile longer figuring out how I screwed up my tests. Turns out I absolutely nailed it on my first shot. It’s a really great sign for Blowfish that it’s so easy to implement correctly.
Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of storage. The key schedule is unusually complex: Subkeys are repeatedly encrypted with themselves as they are being computed. This complexity inspired the bcrypt password hashing scheme, which essentially works by iterating the key schedule many times in a loop, then encrypting a constant 24-byte string. My bcrypt implementation wasn’t nearly as successful on my first attempt, and it took hours of debugging in order to match OpenBSD’s outputs.
The encryption and decryption algorithms are nearly identical, as is typical for, and a feature of, Feistel ciphers. There are no branches (preventing some side-channel attacks), and the only operations are 32-bit XOR and 32-bit addition. This makes it ideal for implementation on 32-bit computers.
One tricky point is that encryption and decryption operate on a pair of 32-bit integers (another giveaway that it’s a Feistel cipher). To put the cipher to practical use, these integers have to be serialized into a byte stream. The specification doesn’t choose a byte order, even for mixing the key into the subkeys. The official test vectors are also 32-bit integers, not byte arrays. An implementer could choose little endian, big endian, or even something else.
However, there’s one place in which this decision is formally made: the official test vectors mix the key into the first subkey in big endian byte order. By luck I happened to choose big endian as well, which is why my tests passed on the first try. OpenBSD’s version of bcrypt also uses big endian for all integer encoding steps, further cementing big endian as the standard way to encode Blowfish integers.
The Blowpipe repository contains a ready-to-use, public domain Blowfish library written in strictly conforming C99. The interface is just three functions:
void blowfish_init(struct blowfish *, const void *key, int len);
void blowfish_encrypt(struct blowfish *, uint32_t *, uint32_t *);
void blowfish_decrypt(struct blowfish *, uint32_t *, uint32_t *);
Technically the key can be up to 72 bytes long, but the last 16 bytes have an incomplete effect on the subkeys, so only the first 56 bytes should matter. Since bcrypt runs the key schedule multiple times, all 72 bytes have full effect.
The library also includes a bcrypt implementation, though it will only produce the raw password hash, not the base-64 encoded form. The main reason for including bcrypt is to support Blowpipe.
The main goal of Blowpipe was to build a robust, authenticated encryption tool using only Blowfish as a cryptographic primitive.
It uses bcrypt with a moderately-high cost as a key derivation function (KDF). Not terrible, but this is not a memory hard KDF, which is important for protecting against cheap hardware brute force attacks.
Encryption is Blowfish in “counter” CTR mode. A 64-bit
counter is incremented and encrypted, producing a keystream. The
plaintext is XORed with this keystream like a stream cipher. This
allows the last block to be truncated when output and eliminates
some padding issues. Since CRT mode is trivially malleable, the MAC
becomes even more important. In CTR mode, blowfish_decrypt()
is
never called. In fact, Blowpipe never uses it.
The authentication scheme is Blowfish-CBC-MAC with a unique key and encrypt-then-authenticate (something I harmlessly got wrong with Enchive). It essentially encrypts the ciphertext again with a different key, but in Cipher Block Chaining mode (CBC), but it only saves the final block. The final block is prepended to the ciphertext as the MAC. On decryption the same block is computed again to ensure that it matches. Only someone who knows the MAC key can compute it.
Of all three Blowfish uses, I’m least confident about authentication. CBC-MAC is tricky to get right, though I am following the rules: fixed length messages using a different key than encryption.
Wait a minute. Blowpipe is pipe-oriented and can output data without buffering the entire pipe. How can there be fixed-length messages?
The pipe datastream is broken into 64kB chunks. Each chunk is authenticated with its own MAC. Both the MAC and chunk length are written in the chunk header, and the length is authenticated by the MAC. Furthermore, just like the keystream, the MAC is continued from previous chunk, preventing chunks from being reordered. Blowpipe can output the content of a chunk and discard it once it’s been authenticated. If any chunk fails to authenticate, it aborts.
This also leads to another useful trick: The pipe is terminated with a zero length chunk, preventing an attacker from appending to the datastream. Everything after the zero-length chunk is discarded. Since the length is authenticated by the MAC, the attacker also cannot truncate the pipe since that would require knowledge of the MAC key.
The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1 byte for the bcrypt cost. The salt is like an initialization vector (IV) that allows keys to be safely reused in different Blowpipe instances. The cost byte is the only distinguishing byte in the stream. Since even the chunk lengths are encrypted, everything else in the datastream should be indistinguishable from random data.
Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I
initially wrote it for POSIX (on Linux) of course, but I took an unusual
approach when it came time to port it to Windows. Normally I’d invent a
generic OS interface that makes the appropriate host system
calls. This time I kept the POSIX interface (read(2)
, write(2)
,
open(2)
, etc.) and implemented the tiny subset of POSIX that I needed
in terms of Win32. That implementation can be found under w32-compat/
.
I even dropped in a copy of my own getopt()
.
One really cool feature of this technique is that, on Windows, Blowpipe
will still “open” /dev/urandom
. It’s intercepted by my own open(2)
,
which in response to that filename actually calls
CryptAcquireContext()
and pretends like it’s a file. It’s all
hidden behind the file descriptor. That’s the unix way.
I’m considering giving Enchive the same treatment since it would simply and reduce much of the interface code. In fact, this project has taught me a number of ways that Enchive could be improved. That’s the value of writing “toys” such as Blowpipe.
]]>Gap buffers are very easy to implement. A bare minimum implementation is about 60 lines of C.
Gap buffers are especially efficient for the majority of typical editing commands, which tend to be clustered in a small area.
Except for the gap, the content of the buffer is contiguous, making the search and display implementations simpler and more efficient. There’s also the potential for most of the gap buffer to be memory-mapped to the original file, though typical encoding and decoding operations prevent this from being realized.
Due to having contiguous content, saving a gap buffer is basically
just two write(2)
system calls. (Plus fsync(2)
, etc.)
A gap buffer is really a pair of buffers where one buffer holds all of the content before the cursor (or point for Emacs), and the other buffer holds the content after the cursor. When the cursor is moved through the buffer, characters are copied from one buffer to the other. Inserts and deletes close to the gap are very efficient.
Typically it’s implemented as a single large buffer, with the pre-cursor content at the beginning, the post-cursor content at the end, and the gap spanning the middle. Here’s an illustration:
The top of the animation is the display of the text content and cursor as the user would see it. The bottom is the gap buffer state, where each character is represented as a gray block, and a literal gap for the cursor.
Ignoring for a moment more complicated concerns such as undo and Unicode, a gap buffer could be represented by something as simple as the following:
struct gapbuf {
char *buf;
size_t total; /* total size of buf */
size_t front; /* size of content before cursor */
size_t gap; /* size of the gap */
};
This is close to how Emacs represents it. In the structure above, the size of the content after the cursor isn’t tracked directly, but can be computed on the fly from the other three quantities. That is to say, this data structure is normalized.
As an optimization, the cursor could be tracked separately from the gap such that non-destructive cursor movement is essentially free. The difference between cursor and gap would only need to be reconciled for a destructive change — an insert or delete.
A gap buffer certainly isn’t the only way to do it. For example, the original vi used an array of lines, which sort of explains some of its quirky line-oriented idioms. The BSD clone of vi, nvi, uses an entire database to represent buffers. Vim uses a fairly complex rope-like data structure with page-oriented blocks, which may be stored out-of-order in its swap file.
Multiple cursors is fairly recent text editor invention that has gained a lot of popularity recent years. It seems every major editor either has the feature built in or a readily-available extension. I myself used Magnar Sveen’s well-polished package for several years. Though obviously the concept didn’t originate in Emacs or else it would have been called multiple points, which doesn’t quite roll off the tongue quite the same way.
The concept is simple: If the same operation needs to done in many different places in a buffer, you place a cursor at each position, then drive them all in parallel using the same commands. It’s super flashy and great for impressing all your friends.
However, as a result of improving my typing skills, I’ve come to the conclusion that multiple cursors is all hat and no cattle. It doesn’t compose well with other editing commands, it doesn’t scale up to large operations, and it’s got all sorts of flaky edge cases (off-screen cursors). Nearly anything you can do with multiple cursors, you can do better with old, well-established editing paradigms.
Somewhere around 99% of my multiple cursors usage was adding a common prefix to a contiguous serious of lines. As similar brute force options, Emacs already has rectangular editing, and Vim already has visual block mode.
The most sophisticated, flexible, and robust alternative is a good old macro. You can play it back anywhere it’s needed. You can zip it across a huge buffer. The only downside is that it’s less flashy and so you’ll get invited to a slightly smaller number of parties.
But if you don’t buy my arguments about multiple cursors being tasteless, there’s still a good technical argument: Gap buffers are not designed to work well in the face of multiple cursors!
For example, suppose we have a series of function calls and we’d like to add the same set of arguments to each. It’s a classic situation for a macro or for multiple cursors. Here’s the original code:
foo();
bar();
baz();
The example is tiny so that it will fit in the animations to come. Here’s the desired code:
foo(x, y);
bar(x, y);
baz(x, y);
With multiple cursors you would place a cursor inside each set of
parenthesis, then type x, y
. Visually it looks something like this:
Text is magically inserted in parallel in multiple places at a time.
However, if this is a text editor that uses a gap buffer, the
situation underneath isn’t quite so magical. The entire edit doesn’t
happen at once. First the x
is inserted in each location, then the
comma, and so on. The edits are not clustered so nicely.
From the gap buffer’s point of view, here’s what it looks like:
For every individual character insertion the buffer has to visit each
cursor in turn, performing lots of copying back and forth. The more
cursors there are, the worse it gets. For an edit of length n
with
m
cursors, that’s O(n * m)
calls to memmove(3)
. Multiple cursors
scales badly.
Compare that to the old school hacker who can’t be bothered with something as tacky and modern (eww!) as multiple cursors, instead choosing to record a macro, then play it back:
The entire edit is done locally before moving on to the next location.
It’s perfectly in tune with the gap buffer’s expectations, only needing
O(m)
calls to memmove(3)
. Most of the work flows neatly into the
gap.
So, don’t waste your time with multiple cursors, especially if you’re using a gap buffer text editor. Instead get more comfortable with your editor’s macro feature. If your editor doesn’t have a good macro feature, get a new editor.
If you want to make your own gap buffer animations, here’s the source code. It includes a tiny gap buffer implementation:
]]>gmake
) instead
of the system’s make.
I’ve since become familiar and comfortable with make’s official specification, and I’ve spend the last year writing strictly portable Makefiles. Not only has are my builds now portable across all unix-like systems, my Makefiles are cleaner and more robust. Many of the common make extensions — conditionals in particular — lead to fragile, complicated Makefiles and are best avoided anyway. It’s important to be able to trust your build system to do its job correctly.
This tutorial should be suitable for make beginners who have never written their own Makefiles before, as well as experienced developers who want to learn how to write portable Makefiles. Regardless, in order to understand the examples you must be familiar with the usual steps for building programs on the command line (compiler, linker, object files, etc.). I’m not going to suggest any fancy tricks nor provide any sort of standard starting template. Makefiles should be dead simple when the project is small, and grow in a predictable, clean fashion alongside the project.
I’m not going to cover every feature. You’ll need to read the specification for yourself to learn it all. This tutorial will go over the important features as well as the common conventions. It’s important to follow established conventions so that people using your Makefiles will know what to expect and how to accomplish the basic tasks.
If you’re running Debian, or a Debian derivative such as Ubuntu, the
bmake
and freebsd-buildutils
packages will provide the bmake
and
fmake
programs respectively. These alternative make implementations
are very useful for testing your Makefiles’ portability, should you
accidentally make use of a GNU Make feature. It’s not perfect since each
implements some of the same extensions as GNU Make, but it will catch
some common mistakes.
I am free, no matter what rules surround me. If I find them tolerable, I tolerate them; if I find them too obnoxious, I break them. I am free because I know that I alone am morally responsible for everything I do. ―Robert A. Heinlein
At make’s core are one or more dependency trees, constructed from rules. Each vertex in the tree is called a target. The final products of the build (executable, document, etc.) are the tree roots. A Makefile specifies the dependency trees and supplies the shell commands to produce a target from its prerequisites.
In this illustration, the “.c” files are source files that are written by hand, not generated by commands, so they have no prerequisites. The syntax for specifying one or more edges in this dependency tree is simple:
target [target...]: [prerequisite...]
While technically multiple targets can be specified in a single rule, this is unusual. Typically each target is specified in its own rule. To specify the tree in the illustration above:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
The order of these rules doesn’t matter. The entire Makefile is parsed before any actions are taken, so the tree’s vertices and edges can be specified in any order. There’s one exception: the first non-special target in a Makefile is the default target. This target is selected implicitly when make is invoked without choosing a target. It should be something sensible, so that a user can blindly run make and get a useful result.
A target can be specified more than once. Any new prerequisites are appended to the previously-given prerequisites. For example, this Makefile is identical to the previous, though it’s typically not written this way:
game: graphics.o
game: physics.o
game: input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
There are six special targets that are used to change the behavior
of make itself. All have uppercase names and start with a period.
Names fitting this pattern are reserved for use by make. According to
the standard, in order to get reliable POSIX behavior, the first
non-comment line of the Makefile must be .POSIX
. Since this is a
special target, it’s not a candidate for the default target, so game
will remain the default target:
.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c
physics.o: physics.c
input.o: input.c
In practice, even a simple program will have header files, and sources that include a header file should also have an edge on the dependency tree for it. If the header file changes, targets that include it should also be rebuilt.
.POSIX:
game: graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
We’ve constructed a dependency tree, but we still haven’t told make how to actually build any targets from its prerequisites. The rules also need to specify the shell commands that produce a target from its prerequisites.
If you were to create the source files in the example and invoke make,
you will find that it actually does know how to build the object
files. This is because make is initially configured with certain
inference rules, a topic which will be covered later. For now, we’ll
add the .SUFFIXES
special target to the top, erasing all the built-in
inference rules.
Commands immediately follow the target/prerequisite line in a rule. Each command line must start with a tab character. This can be awkward if your text editor isn’t configured for it, and it will be awkward if you try to copy the examples from this page.
Each line is run in its own shell, so be mindful of using commands like
cd
, which won’t affect later lines.
The simplest thing to do is literally specify the same commands you’d type at the shell:
.POSIX:
.SUFFIXES:
game: graphics.o physics.o input.o
cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
cc -c graphics.c
physics.o: physics.c physics.h
cc -c physics.c
input.o: input.c input.h graphics.h physics.h
cc -c input.c
I tried to walk into Target, but I missed. ―Mitch Hedberg
When invoking make, it accepts zero or more targets from the dependency tree, and it will build these targets — e.g. run the commands in the target’s rule — if the target is out-of-date. A target is out-of-date if it is older than any of its prerequisites.
# build the "game" binary (default target)
$ make
# build just the object files
$ make graphics.o physics.o input.o
This effect cascades up the dependency tree and causes further targets
to be rebuilt until all of the requested targets are up-to-date. There’s
a lot of room for parallelism since different branches of the tree can
be updated independently. It’s common for make implementations to
support parallel builds with the -j
option. This is non-standard, but
it’s a fantastic feature that doesn’t require anything special in the
Makefile to work correctly.
Similar to parallel builds is make’s -k
(“keep going”) option, which
is standard. This tells make not to stop on the first error, and to
continue updating targets that are unaffected by the error. This is nice
for fully populating Vim’s quickfix list or Emacs’ compilation
buffer.
It’s common to have multiple targets that should be built by default. If the first rule selects the default target, how do we solve the problem of needing multiple default targets? The convention is to use phony targets. These are called “phony” because there is no corresponding file, and so phony targets are never up-to-date. It’s convention for a phony “all” target to be the default target.
I’ll make game
a prerequisite of a new “all” target. More real targets
could be added as necessary to turn them into defaults. Users of this
Makefile will also expect make all
to build the entire project.
Another common phony target is “clean” which removes all of the built
files. Users will expect make clean
to delete all generated files.
.POSIX:
.SUFFIXES:
all: game
game: graphics.o physics.o input.o
cc -o game graphics.o physics.o input.o
graphics.o: graphics.c graphics.h
cc -c graphics.c
physics.o: physics.c physics.h
cc -c physics.c
input.o: input.c input.h graphics.h physics.h
cc -c input.c
clean:
rm -f game graphics.o physics.o input.o
So far the Makefile hardcodes cc
as the compiler, and doesn’t use any
compiler flags (warnings, optimization, hardening, etc.). The user
should be able to easily control all these things, but right now they’d
have to edit the entire Makefile to do so. Perhaps the user has both
gcc
and clang
installed, and wants to choose one or the other
without changing which is installed as cc
.
To solve this, make has macros that expand into strings when
referenced. The convention is to use the macro named CC
when talking
about the C compiler, CFLAGS
when talking about flags passed to the C
compiler, LDFLAGS
for flags passed to the C compiler when linking, and
LDLIBS
for flags about libraries when linking. The Makefile should
supply defaults as needed.
A macro is expanded with $(...)
. It’s valid (and normal) to reference
a macro that hasn’t been defined, which will be an empty string. This
will be the case with LDFLAGS
below.
Macro values can contain other macros, which will be expanded recursively each time the macro is expanded. Some make implementations allow the name of the macro being expanded to itself be a macro, which is turing complete, but this behavior is non-standard.
.POSIX:
.SUFFIXES:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
all: game
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
$(CC) -c $(CFLAGS) graphics.c
physics.o: physics.c physics.h
$(CC) -c $(CFLAGS) physics.c
input.o: input.c input.h graphics.h physics.h
$(CC) -c $(CFLAGS) input.c
clean:
rm -f game graphics.o physics.o input.o
Macros are overridden by macro definitions given as command line
arguments in the form name=value
. This allows the user to select their
own build configuration. This is one of make’s most powerful and
under-appreciated features.
$ make CC=clang CFLAGS='-O3 -march=native'
If the user doesn’t want to specify these macros on every invocation,
they can (cautiously) use make’s -e
flag to set overriding macros
definitions from the environment.
$ export CC=clang
$ export CFLAGS=-O3
$ make -e all
Some make implementations have other special kinds of macro assignment
operators beyond simple assignment (=
). These are unnecessary, so
don’t worry about them.
The road itself tells us far more than signs do. ―Tom Vanderbilt, Traffic: Why We Drive the Way We Do
There’s repetition across the three different object files. Wouldn’t it be nice if there was a way to communicate this pattern? Fortunately there is, in the form of inference rules. It says that a target with a certain extension, with a prerequisite with another certain extension, is built a certain way. This will make more sense with an example.
In an inference rule, the target indicates the extensions. The $<
macro expands to the prerequisite, which is essential to making
inference rules work generically. Unfortunately this macro is not
available in target rules, as much as that would be useful.
For example, here’s an inference rule that teaches make how to build an object file from a C source file. This particular rule is one that is pre-defined by make, so you’ll never need to write this one yourself. I’ll include it for completeness.
.c.o:
$(CC) $(CFLAGS) -c $<
These extensions must be added to .SUFFIXES
before they will work.
With that, the commands for the rules about object files can be omitted.
.POSIX:
.SUFFIXES:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
all: game
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
rm -f game graphics.o physics.o input.o
.SUFFIXES: .c .o
.c.o:
$(CC) $(CFLAGS) -c $<
The first empty .SUFFIXES
clears the suffix list. The second one adds
.c
and .o
to the now-empty suffix list.
Conventions are, indeed, all that shield us from the shivering void, though often they do so but poorly and desperately. ―Robert Aickman
Users usually expect an “install” target that installs the built
program, libraries, man pages, etc. By convention this target should use
the PREFIX
and DESTDIR
macros.
The PREFIX
macro should default to /usr/local
, and since it’s a
macro the user can override it to install elsewhere, such as in their
home directory. The user should override it for both building and
installing, since the prefix may need to be built into the binary (e.g.
-DPREFIX=$(PREFIX)
).
The DESTDIR
is macro is used for staged builds, so that it gets
installed under a fake root directory for the sake of packaging. Unlike
PREFIX, it will not actually be run from this directory.
.POSIX:
CC = cc
CFLAGS = -W -O
LDLIBS = -lm
PREFIX = /usr/local
all: game
install: game
mkdir -p $(DESTDIR)$(PREFIX)/bin
mkdir -p $(DESTDIR)$(PREFIX)/share/man/man1
cp -f game $(DESTDIR)$(PREFIX)/bin
gzip < game.1 > $(DESTDIR)$(PREFIX)/share/man/man1/game.1.gz
game: graphics.o physics.o input.o
$(CC) $(LDFLAGS) -o game graphics.o physics.o input.o $(LDLIBS)
graphics.o: graphics.c graphics.h
physics.o: physics.c physics.h
input.o: input.c input.h graphics.h physics.h
clean:
rm -f game graphics.o physics.o input.o
You may also want to provide an “uninstall” phony target that does the opposite.
make PREFIX=$HOME/.local install
Other common targets are “mostlyclean” (like “clean” but don’t delete some slow-to-build targets), “distclean” (delete even more than “clean”), “test” or “check” (run the test suite), and “dist” (create a package).
One of make’s big weak points is scaling up as a project grows in size.
As your growing project is broken into subdirectories, you may be tempted to put a Makefile in each subdirectory and invoke them recursively.
Don’t use recursive Makefiles. It breaks the dependency tree across separate instances of make and typically results in a fragile build. There’s nothing good about it. Have one Makefile at the root of your project and invoke make there. You may have to teach your text editor how to do this.
When talking about files in subdirectories, just include the subdirectory in the name. Everything will work the same as far as make is concerned, including inference rules.
src/graphics.o: src/graphics.c
src/physics.o: src/physics.c
src/input.o: src/input.c
Keeping your object files separate from your source files is a nice idea. When it comes to make, there’s good news and bad news.
The good news is that make can do this. You can pick whatever file names you like for targets and prerequisites.
obj/input.o: src/input.c
The bad news is that inference rules are not compatible with out-of-source builds. You’ll need to repeat the same commands for each rule as if inference rules didn’t exist. This is tedious for large projects, so you may want to have some sort of “configure” script, even if hand-written, to generate all this for you. This is essentially what CMake is all about. That, plus dependency management.
Another problem with scaling up is tracking the project’s ever-changing
dependencies across all the source files. Missing a dependency means the
build may not be correct unless you make clean
first.
If you go the route of using a script to generate the tedious parts of
the Makefile, both GCC and Clang have a nice feature for generating all
the Makefile dependencies for you (-MM
, -MT
), at least for C and
C++. There are lots of tutorials for doing this dependency generation on
the fly as part of the build, but it’s fragile and slow. Much better to
do it all up front and “bake” the dependencies into the Makefile so that
make can do its job properly. If the dependencies change, rebuild your
Makefile.
For example, here’s what it looks like invoking gcc’s dependency
generator against the imaginary input.c
for an out-of-source build:
$ gcc $CFLAGS -MM -MT '$(BUILD)/input.o' input.c
$(BUILD)/input.o: input.c input.h graphics.h physics.h
Notice the output is in Makefile’s rule format.
Unfortunately this feature strips the leading paths from the target, so,
in practice, using it is always more complicated than it should be (e.g.
it requires the use of -MT
).
Microsoft has an implementation of make called Nmake, which comes with
Visual Studio. It’s nearly a POSIX-compatible make, but
necessarily breaks from the standard in some places. Their cl.exe
compiler uses .obj
as the object file extension and .exe
for
binaries, both of which differ from the unix world, so it has different
built-in inference rules. Windows also lacks a Bourne shell and the
standard unix tools, so all of the commands will necessarily be
different.
There’s no equivalent of rm -f
on Windows, so good luck writing a
proper “clean” target. No, del /f
isn’t the same.
So while it’s close to POSIX make, it’s not practical to write a Makefile that will simultaneously work properly with both POSIX make and Nmake. These need to be separate Makefiles.
It’s nice to have reliable, portable Makefiles that just work anywhere. Code to the standards and you don’t need feature tests or other sorts of special treatment.
]]>This small C program converts a vector image from a custom format (described below) into a Netpbm image, a conveniently simple format. The program defensively and carefully parses its input, but still makes a subtle, fatal mistake. This mistake not only leads to sensitive information disclosure, but, with a more sophisticated attack, could be used to execute arbitrary code.
After getting the hang of the interface for the program, I encourage you to take some time to work out an exploit yourself. Regardless, I’ll reveal a functioning exploit and explain how it works.
The input format is line-oriented and very similar to Netpbm itself. The
first line is the header, starting with the magic number V2
(ASCII)
followed by the image dimensions. The target output format is Netpbm’s
“P2” (text gray scale) format, so the “V2” parallels it. The file must end
with a newline.
V2 <width> <height>
What follows is drawing commands, one per line. For example, the s
command sets the value of a particular pixel.
s <x> <y> <00–ff>
Since it’s not important for the demonstration, this is the only command I implemented. It’s easy to imagine additional commands to draw lines, circles, Bezier curves, etc.
Here’s an example (example.txt
) that draws a single white point in the
middle of the image:
V2 256 256
s 127 127 ff
The rendering tool reads standard input to standard output:
$ render < example.txt > example.pgm
Here’s what it looks like rendered:
However, you will notice that when you run the rendering tool, it prompts you for username and password. This is silly, of course, but it’s an excuse to get “sensitive” information into memory. It will accept any username/password combination where the username and password don’t match each other. The key is this: It’s possible to craft a valid image that leaks the the entered password.
Without spoiling anything yet, let’s look at how this program works. The
first thing to notice is that I’m using a custom “obstack”
allocator instead of malloc()
and free()
. Real-world allocators have
some defenses against this particular vulnerability. Plus a specific
exploit would have to target a specific libc. By using my own allocator,
the exploit will mostly be portable, making for a better and easier
demonstration.
The allocator interface should be pretty self-explanatory, except for two
details. This is an obstack allocator, so freeing an object also frees
every object allocated after it. Also, it doesn’t call malloc()
in the
background. At initialization you give it a buffer from which to allocate
all memory.
struct mstack {
char *top;
char *max;
char buf[];
};
struct mstack *mstack_init(void *, size_t);
void *mstack_alloc(struct mstack *, size_t);
void mstack_free(struct mstack *, void *);
There are no vulnerabilities in these functions (I hope!). It’s just here for predictability.
Next here’s the “authentication” function. It reads a username and
password combination from /dev/tty
. It’s only an excuse to get a flag in
memory for this capture-the-flag game. The username and password must be
less than 32 characters each.
int
authenticate(struct mstack *m)
{
FILE *tty = fopen("/dev/tty", "r+");
if (!tty) {
perror("/dev/tty");
return 0;
}
char *user = mstack_alloc(m, 32);
if (!user) {
fclose(tty);
return 0;
}
fputs("User: ", tty);
fflush(tty);
if (!fgets(user, 32, tty))
user[0] = 0;
char *pass = mstack_alloc(m, 32);
int result = 0;
if (pass) {
fputs("Password: ", tty);
fflush(tty);
if (fgets(pass, 32, tty))
result = strcmp(user, pass) != 0;
}
fclose(tty);
mstack_free(m, user);
return result;
}
Next here’s a little version of calloc()
for the custom allocator. Hmm,
I wonder why this is called “naive”…
void *
naive_calloc(struct mstack *m, unsigned long nmemb, unsigned long size)
{
void *p = mstack_alloc(m, nmemb * size);
if (p)
memset(p, 0, nmemb * size);
return p;
}
Next up is a paranoid wrapper for strtoul()
that defensively checks its
inputs. If it’s out of range of an unsigned long
, it bails out. If
there’s trailing garbage, it bails out. If there’s no number at all, it
bails out. If you make prolonged eye contact, it bails out.
unsigned long
safe_strtoul(char *nptr, char **endptr, int base)
{
errno = 0;
unsigned long n = strtoul(nptr, endptr, base);
if (errno) {
perror(nptr);
exit(EXIT_FAILURE);
} else if (nptr == *endptr) {
fprintf(stderr, "Expected an integer\n");
exit(EXIT_FAILURE);
} else if (!isspace(**endptr)) {
fprintf(stderr, "Invalid character '%c'\n", **endptr);
exit(EXIT_FAILURE);
}
return n;
}
The main()
function parses the header using this wrapper and allocates
some zeroed memory:
unsigned long width = safe_strtoul(p, &p, 10);
unsigned long height = safe_strtoul(p, &p, 10);
unsigned char *pixels = naive_calloc(m, width, height);
if (!pixels) {
fputs("Not enough memory\n", stderr);
exit(EXIT_FAILURE);
}
Then there’s a command processing loop, also using safe_strtoul()
. It
carefully checks bounds against width
and height
. Finally it writes
out a Netpbm, P2 (.pgm) format.
printf("P2\n%ld %ld 255\n", width, height);
for (unsigned long y = 0; y < height; y++) {
for (unsigned long x = 0; x < width; x++)
printf("%d ", pixels[y * width + x]);
putchar('\n');
}
The vulnerability is in something I’ve shown above. Can you find it?
Did you find it? If you’re on a platform with 64-bit long
, here’s your
exploit:
V2 16 1152921504606846977
And here’s an exploit for 32-bit long
:
V2 16 268435457
Here’s how it looks in action. The most obvious result is that the program crashes:
$ echo V2 16 1152921504606846977 | ./mstack > capture.txt
User: coolguy
Password: mysecret
Segmentation fault
Here are the initial contents of capture.txt
:
P2
16 1152921504606846977 255
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
109 121 115 101 99 114 101 116 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Where did those junk numbers come from in the image data? Plug them into
an ASCII table and you’ll get “mysecret”. Despite allocating the image
with naive_calloc()
, the password has found its way into the image! How
could this be?
What happened is that width * height
overflows an unsigned long
.
(Well, technically speaking, unsigned integers are defined not to
overflow in C, wrapping around instead, but it’s really the same thing.)
In naive_calloc()
, the overflow results in a value of 16, so it only
allocates and clears 16 bytes. The requested allocation “succeeds” despite
far exceeding the available memory. The caller has been given a lot less
memory than expected, and the memory believed to have been allocated
contains a password.
The final part that writes the output doesn’t multiply the integers and doesn’t need to test for overflow. It uses a nested loop instead, continuing along with the original, impossible image size.
How do we fix this? Add an overflow check at the beginning of the
naive_calloc()
function (making it no longer naive). This is what the
real calloc()
does.
if (nmemb && size > -1UL / nmemb)
return 0;
The frightening takeaway is that this check is very easy to forget. It’s a subtle bug with potentially disastrous consequences.
In practice, this sort of program wouldn’t have sensitive data resident in
memory. Instead an attacker would target the program’s stack with those
s
commands — specifically the return pointers — and perform a ROP
attack against the application. With the exploit header above and a
platform where long
the same size as a size_t
, the program will behave
as if all available memory has been allocated to the image, so the s
command could be used to poke custom values anywhere in memory. This is
a much more complicated exploit, and it has to contend with ASLR and
random stack gap, but it’s feasible.
In the Smarter Every Day video, Destin illustrates the effect by simulating rolling shutter using a short video clip. In each frame of the video, a few additional rows are locked in place, showing the effect in slow motion, making it easier to understand.
At the end of the video he thanks a friend for figuring out how to get After Effects to simulate rolling shutter. After thinking about this for a moment, I figured I could easily accomplish this myself with just a bit of C, without any libraries. The video above this paragraph is the result.
I previously described a technique to edit and manipulate video without any formal video editing tools. A unix pipeline is sufficient for doing minor video editing, especially without sound. The program at the front of the pipe decodes the video into a raw, uncompressed format, such as YUV4MPEG or PPM. The tools in the middle losslessly manipulate this data to achieve the desired effect (watermark, scaling, etc.). Finally, the tool at the end encodes the video into a standard format.
$ decode video.mp4 | xform-a | xform-b | encode out.mp4
For the “decode” program I’ll be using ffmpeg now that it’s back in
the Debian repositories. You can throw a video in virtually any
format at it and it will write PPM frames to standard output. For the
encoder I’ll be using the x264
command line program, though ffmpeg
could handle this part as well. Without any filters in the middle,
this example will just re-encode a video:
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
x264 -o output.mp4 /dev/stdin
The filter tools in the middle only need to read and write in the raw image format. They’re a little bit like shaders, and they’re easy to write. In this case, I’ll write C program that simulates rolling shutter. The filter could be written in any language that can read and write binary data from standard input to standard output.
Update: It appears that input PPM streams are a rather recent
feature of libavformat (a.k.a lavf, used by x264
). Support for PPM
input first appeared in libavformat 3.1 (released June 26th, 2016). If
you’re using an older version of libavformat, you’ll need to stick
ppmtoy4m
in front of x264
in the processing pipeline.
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
ppmtoy4m | \
x264 -o output.mp4 /dev/stdin
In the past, my go to for raw video data has been loose PPM frames and
YUV4MPEG streams (via ppmtoy4m
). Fortunately, over the years a lot
of tools have gained the ability to manipulate streams of PPM images,
which is a much more convenient format. Despite being raw video data,
YUV4MPEG is still a fairly complex format with lots of options and
annoying colorspace concerns. PPM is simple RGB without
complications. The header is just text:
P6
<width> <height>
<maxdepth>
<width * height * 3 binary RGB data>
The maximum depth is virtually always 255. A smaller value reduces the image’s dynamic range without reducing the size. A larger value involves byte-order issues (endian). For video frame data, the file will typically look like:
P6
1920 1080
255
<frame RGB>
Unfortunately the format is actually a little more flexible than this.
Except for the new line (LF, 0x0A) after the maximum depth, the
whitespace is arbitrary and comments starting with #
are permitted.
Since the tools I’m using won’t produce comments, I’m going to ignore
that detail. I’ll also assume the maximum depth is always 255.
Here’s the structure I used to represent a PPM image, just one frame of video. I’m using a flexible array member to pack the data at the end of the structure.
struct frame {
size_t width;
size_t height;
unsigned char data[];
};
Next a function to allocate a frame:
static struct frame *
frame_create(size_t width, size_t height)
{
struct frame *f = malloc(sizeof(*f) + width * height * 3);
f->width = width;
f->height = height;
return f;
}
We’ll need a way to write the frames we’ve created.
static void
frame_write(struct frame *f)
{
printf("P6\n%zu %zu\n255\n", f->width, f->height);
fwrite(f->data, f->width * f->height, 3, stdout);
}
Finally, a function to read a frame, reusing an existing buffer if
possible. The most complex part of the whole program is just parsing
the PPM header. The %*c
in the scanf()
specifically consumes the
line feed immediately following the maximum depth.
static struct frame *
frame_read(struct frame *f)
{
size_t width, height;
if (scanf("P6 %zu%zu%*d%*c", &width, &height) < 2) {
free(f);
return 0;
}
if (!f || f->width != width || f->height != height) {
free(f);
f = frame_create(width, height);
}
fread(f->data, width * height, 3, stdin);
return f;
}
Since this program will only be part of a pipeline, I’m not worried
about checking the results of fwrite()
and fread()
. The process
will be killed by the shell if something goes wrong with the pipes.
However, if we’re out of video data and get an EOF, scanf()
will
fail, indicating the EOF, which is normal and can be handled cleanly.
That’s all the infrastructure we need to built an identity filter that passes frames through unchanged:
int main(void)
{
struct frame *frame = 0;
while ((frame = frame_read(frame)))
frame_write(frame);
}
Processing a frame is just matter of adding some stuff to the body of
the while
loop.
For the rolling shutter filter, in addition to the input frame we need an image to hold the result of the rolling shutter. Each input frame will be copied into the rolling shutter frame, but a little less will be copied from each frame, locking a little bit more of the image in place.
int
main(void)
{
int shutter_step = 3;
size_t shutter = 0;
struct frame *f = frame_read(0);
struct frame *out = frame_create(f->width, f->height);
while (shutter < f->height && (f = frame_read(f))) {
size_t offset = shutter * f->width * 3;
size_t length = f->height * f->width * 3 - offset;
memcpy(out->data + offset, f->data + offset, length);
frame_write(out);
shutter += shutter_step;
}
free(out);
free(f);
}
The shutter_step
controls how many rows are capture per frame of
video. Generally capturing one row per frame is too slow for the
simulation. For a 1080p video, that’s 1,080 frames for the entire
simulation: 18 seconds at 60 FPS or 36 seconds at 30 FPS. If this
program were to accept command line arguments, controlling the shutter
rate would be one of the options.
Putting it all together:
$ ffmpeg -i input.mp4 -f image2pipe -vcodec ppm pipe:1 | \
./rolling-shutter | \
x264 -o output.mp4 /dev/stdin
Here are some of the results for different shutter rates: 1, 3, 5, 8, 10, and 15 rows per frame. Feel free to right-click and “View Video” to see the full resolution video.
This post contains the full source in parts, but here it is all together:
Here’s the original video, filmed by my wife using her Nikon D5500, in case you want to try it for yourself:
It took much longer to figure out the string-pulling contraption to slowly spin the fan at a constant rate than it took to write the C filter program.
On Hacker News, morecoffee shared a video of the second order effect (direct link), where the rolling shutter speed changes over time.
A deeper analysis of rolling shutter: Playing detective with rolling shutter photos.
]]>You can find the complete code for this article here, ready to run:
But first, what is a stack clash? Here’s a rough picture of the
typical way process memory is laid out. The stack starts at a high
memory address and grows downwards. Code and static data sit at low
memory, with a brk
pointer growing upward to make small allocations.
In the middle is the heap, where large allocations and memory mappings
take place.
Below the stack is a slim guard page that divides the stack and the region of memory reserved for the heap. Reading or writing to that memory will trap, causing the program to crash or some special action to be taken. The goal is to prevent the stack from growing into the heap, which could cause all sorts of trouble, like security issues.
The problem is that this thin guard page isn’t enough. It’s possible to put a large allocation on the stack, never read or write to it, and completely skip over the guard page, such that the heap and stack overlap without detection.
Once this happens, writes into the heap will change memory on the stack and vice versa. If an attacker can cause the program to make such a large allocation on the stack, then legitimate writes into memory on the heap can manipulate local variables or return pointers, changing the program’s control flow. This can bypass buffer overflow protections, such as stack canaries.
Now, I’m going to abruptly change topics to discuss binary search trees. We’ll get back to stack clash in a bit. Suppose we have a binary tree which we would like to iterate depth-first. For this demonstration, here’s the C interface to the binary tree.
struct tree {
struct tree *left;
struct tree *right;
char *key;
char *value;
};
void tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void tree_visit(struct tree *, void (*f)(char *, char *));
void tree_destroy(struct tree *);
An empty tree is the NULL pointer, hence the double-pointer for insert. In the demonstration it’s an unbalanced search tree, but this could very well be a balanced search tree with the addition of another field on the structure.
For the traversal, first visit the root node, then traverse its left tree, and finally traverse its right tree. It makes for a simple, recursive definition — the sort of thing you’d teach a beginner. Here’s a definition that accepts a callback, which the caller will use to visit each key/value in the tree. This really is as simple as it gets.
void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
if (t) {
f(t->key, t->value);
tree_visit(t->left, f);
tree_visit(t->right, f);
}
}
Unfortunately this isn’t so convenient for the caller, who has to split off a callback function that lacks context, then hand over control to the traversal function.
void
printer(char *k, char *v)
{
printf("%s = %s\n", k, v);
}
void
print_tree(struct tree *tree)
{
tree_visit(tree, printer);
}
Usually it’s much nicer for the caller if instead it’s provided an iterator, which the caller can invoke at will. Here’s an interface for it, just two functions.
struct tree_it *tree_iterator(struct tree *);
int tree_next(struct tree_it *, char **k, char **v);
The first constructs an iterator object, and the second one visits a key/value pair each time it’s called. It returns 0 when traversal is complete, automatically freeing any resources associated with the iterator.
The caller now looks like this:
char *k, *v;
struct tree_it *it = tree_iterator(tree);
while (tree_next(it, &k, &v))
printf("%s = %s\n", k, v);
Notice I haven’t defined struct tree_it
. That’s because I’ve got
four different implementations, each taking a different approach. The
last one will use stack clashing.
With just the standard facilities provided by C, there’s a some manual bookkeeping that has to take place in order to convert the recursive definition into an iterator. Depth-first traversal is a stack-oriented process, and with recursion the stack is implicit in the call stack. As an iterator, the traversal stack needs to be managed explicitly. The iterator needs to keep track of the path it took so that it can backtrack, which means keeping track of parent nodes as well as which branch was taken.
Here’s my little implementation, which, to keep things simple, has a hard depth limit of 32. It’s structure definition includes a stack of node pointers, and 2 bits of information per visited node, stored across a 64-bit integer.
struct tree_it {
struct tree *stack[32];
unsigned long long state;
int nstack;
};
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack[0] = t;
it->state = 0;
it->nstack = 1;
return it;
}
The 2 bits track three different states for each visited node:
It works out to the following. Don’t worry too much about trying to understand how this works. My point is to demonstrate that converting the recursive definition into an iterator complicates the implementation.
int
tree_next(struct tree_it *it, char **k, char **v)
{
while (it->nstack) {
int shift = (it->nstack - 1) * 2;
int state = 3u & (it->state >> shift);
struct tree *t = it->stack[it->nstack - 1];
it->state += 1ull << shift;
switch (state) {
case 0:
*k = t->key;
*v = t->value;
if (t->left) {
it->stack[it->nstack++] = t->left;
it->state &= ~(3ull << (shift + 2));
}
return 1;
case 1:
if (t->right) {
it->stack[it->nstack++] = t->right;
it->state &= ~(3ull << (shift + 2));
}
break;
case 2:
it->nstack--;
break;
}
}
free(it);
return 0;
}
Wouldn’t it be nice to keep both the recursive definition while also getting an iterator? There’s an exact solution to that: coroutines.
C doesn’t come with coroutines, but there are a number of libraries
available. We can also build our own coroutines. One way to do that is
with user contexts (<ucontext.h>
) provided by the X/Open System
Interfaces Extension (XSI), an extension to POSIX. This set of
functions allow programs to create their own call stacks and switch
between them. That’s the key ingredient for coroutines. Caveat: These
functions aren’t widely available, and probably shouldn’t be used in
new code.
Here’s my iterator structure definition.
#define _XOPEN_SOURCE 600
#include <ucontext.h>
struct tree_it {
char *k;
char *v;
ucontext_t coroutine;
ucontext_t yield;
};
It needs one context for the original stack and one context for the
iterator’s stack. Each time the iterator is invoked, it the program
will switch to the other stack, find the next value, then switch back.
This process is called yielding. Values are passed between context
using the k
(key) and v
(value) fields on the iterator.
Before I get into initialization, here’s the actual traversal
coroutine. It’s nearly the same as the original recursive definition
except for the swapcontext()
. This is the yield, pausing execution
and sending control back to the caller. The current context is saved
in the first argument, and the second argument becomes the current
context.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
swapcontext(&it->coroutine, &it->yield);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
While the actual traversal is simple again, initialization is more
complicated. The first problem is that there’s no way to pass pointer
arguments to the coroutine. Technically only int
arguments are
permitted. (All the online tutorials get this wrong.) To work around
this problem, I smuggle the arguments in as global variables. This
would cause problems should two different threads try to create
iterators at the same time, even on different trees.
static struct tree *tree_arg;
static struct tree_it *tree_it_arg;
static void
coroutine_init(void)
{
coroutine(tree_arg, tree_it_arg);
}
The stack has to be allocated manually, which I do with a call to
malloc()
. Nothing fancy is needed, though this means the new
stack won’t have a guard page. For the stack size, I use the suggested
value of SIGSTKSZ
. The makecontext()
function is what creates the
new context from scratch, but the new context must first be
initialized with getcontext()
, even though that particular snapshot
won’t actually be used.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
it->coroutine.uc_stack.ss_size = SIGSTKSZ;
it->coroutine.uc_link = &it->yield;
getcontext(&it->coroutine);
makecontext(&it->coroutine, coroutine_init, 0);
tree_arg = t;
tree_it_arg = it;
return it;
}
Notice I gave it a function pointer, a lot like I’m starting a new thread. This is no coincidence. There’s a lot of similarity between coroutines and multiple threads, as you’ll soon see.
Finally the iterator function itself. Since NULL isn’t a valid key, it initializes the key to NULL before yielding to the iterator context. If the iterator has no more nodes to visit, it doesn’t set the key, which can be detected when control returns.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
swapcontext(&it->yield, &it->coroutine);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->coroutine.uc_stack.ss_sp);
free(it);
return 0;
}
}
That’s all it takes to create and operate a coroutine in C, provided you’re on a system with these XSI extensions.
Instead of a coroutine, we could just use actual threads and a couple of semaphores to synchronize them. This is a heavy implementation and also probably shouldn’t be used in practice, but at least it’s fully portable.
Here’s the structure definition:
struct tree_it {
struct tree *t;
char *k;
char *v;
sem_t visitor;
sem_t main;
pthread_t thread;
};
The main thread will wait on one semaphore and the iterator thread will wait on the other. This should sound very familiar.
The actual traversal function looks the same, but with sem_post()
and sem_wait()
as the yield.
static void
visit(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
sem_post(&it->main);
sem_wait(&it->visitor);
visit(t->left, it);
visit(t->right, it);
}
}
There’s a separate function to initialize the iterator context again.
static void *
thread_entrance(void *arg)
{
struct tree_it *it = arg;
sem_wait(&it->visitor);
visit(it->t, it);
sem_post(&it->main);
return 0;
}
Creating the iterator only requires initializing the semaphores and creating the thread:
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->t = t;
sem_init(&it->visitor, 0, 0);
sem_init(&it->main, 0, 0);
pthread_create(&it->thread, 0, thread_entrance, it);
return it;
}
The iterator function looks just like the coroutine version.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
sem_post(&it->visitor);
sem_wait(&it->main);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
pthread_join(it->thread, 0);
sem_destroy(&it->main);
sem_destroy(&it->visitor);
free(it);
return 0;
}
}
Overall, this is almost identical to the coroutine version.
Finally I can tie this back into the topic at hand. Without either XSI
extensions or Pthreads, we can (usually) create coroutines by abusing
setjmp()
and longjmp()
. Technically this violates two of the C’s
rules and relies on undefined behavior, but it generally works. This
is not my own invention, and it dates back to at least 2010.
From the very beginning, C has provided a crude “exception” mechanism
that allows the stack to be abruptly unwound back to a previous state.
It’s a sort of non-local goto. Call setjmp()
to capture an opaque
jmp_buf
object to be used in the future. This function returns 0
this first time. Hand that value to longjmp()
later, even in a
different function, and setjmp()
will return again, this time with a
non-zero value.
It’s technically unsuitable for coroutines because the jump is a
one-way trip. The unwound stack invalidates any jmp_buf
that was
created after the target of the jump. In practice, though, you can
still use these jumps, which is one rule being broken.
That’s where stack clashing comes into play. In order for it to be a
proper coroutine, it needs to have its own stack. But how can we do
that with these primitive C utilities? Extend the stack to overlap
the heap, call setjmp()
to capture a coroutine on it, then return.
Generally we can get away with using longjmp()
to return to this
heap-allocated stack.
Here’s my iterator definition for this one. Like the XSI context
struct, this has two jmp_buf
“contexts.” The stack
holds the
iterator’s stack buffer so that it can be freed, and the gap
field
will be used to prevent the optimizer from spoiling our plans.
struct tree_it {
char *k;
char *v;
char *stack;
volatile char *gap;
jmp_buf coroutine;
jmp_buf yield;
};
The coroutine looks familiar again. This time the yield is performed
with setjmmp()
and longjmp()
, just like swapcontext()
. Remember
that setjmp()
returns twice, hence the branch. The longjmp()
never
returns.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
if (!setjmp(it->coroutine))
longjmp(it->yield, 1);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
Next is the tricky part to cause the stack clash. First, allocate the
new stack with malloc()
so that we can get its address. Then use a
local variable on the stack to determine how much the stack needs to
grow in order to overlap with the allocation. Taking the difference
between these pointers is illegal as far as the language is concerned,
making this the second rule I’m breaking. I can imagine an
implementation where the stack and heap are in two separate
kinds of memory, and it would be meaningless to take the difference. I
don’t actually have to imagine very hard, because this is actually how
it used to work on the 8086 with its segmented memory
architecture.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack = malloc(STACK_SIZE);
char marker;
char gap[&marker - it->stack - STACK_SIZE];
it->gap = gap; // prevent optimization
if (!setjmp(it->yield))
coroutine_init(t, it);
return it;
}
I’m using a variable-length array (VLA) named gap
to indirectly
control the stack pointer, moving it over the heap. I’m assuming the
stack grows downward, since otherwise the sign would be wrong.
The compiler is smart and will notice I’m not actually using gap
,
and it’s happy to throw it away. In fact, it’s vitally important that
I don’t touch it since the guard page, along with a bunch of
unmapped memory, is actually somewhere in the middle of that array. I
only want the array for its side effect, but that side effect isn’t
officially supported, which means the optimizer doesn’t need to
consider it in its decisions. To inhibit the optimizer, I store the
array’s address where someone might potentially look at it, meaning
the array has to exist.
Finally, the iterator function looks just like the others, again.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
if (!setjmp(it->yield))
longjmp(it->coroutine, 1);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->stack);
free(it);
return 0;
}
}
And that’s it: a nasty hack using a stack clash to create a context
for a setjmp()
+longjmp()
coroutine.
$HOME/.local/
. Within are the standard
/usr
directories, such as bin/
, include/
, lib/
, etc.,
containing my own software, libraries, and man pages. These are
first-class citizens, indistinguishable from the system-installed
programs and libraries. With one exception (setuid programs), none of
this requires root privileges.
Installing software in $HOME serves two important purposes, both of which are indispensable to me on a regular basis.
This prevents me from installing packaged software myself through the system’s package manager. Building and installing the software myself in my home directory, without involvement from the system administrator, neatly works around this issue. As a software developer, it’s already perfectly normal for me to build and run custom software, and this is just an extension of that behavior.
In the most desperate situation, all I need from the sysadmin is a decent C compiler and at least a minimal POSIX environment. I can bootstrap anything I might need, both libraries and programs, including a better C compiler along the way. This is one major strength of open source software.
I have noticed one alarming trend: Both GCC (since 4.8) and Clang are written in C++, so it’s becoming less and less reasonable to bootstrap a C++ compiler from a C compiler, or even from a C++ compiler that’s more than a few years old. So you may also need your sysadmin to supply a fairly recent C++ compiler if you want to bootstrap an environment that includes C++. I’ve had to avoid some C++ software (such as CMake) for this reason.
In theory this is what /usr/local
is all about. It’s typically the
location for software not managed by the system’s package manager.
However, I think it’s cleaner to put this in $HOME/.local
, so long
as other system users don’t need it.
For example, I have an installation of each version of Emacs between
24.3 (the oldest version worth supporting) through the latest stable
release, each suffixed with its version number, under $HOME/.local
.
This is useful for quickly running a test suite under different
releases.
$ git clone https://github.com/skeeto/elfeed
$ cd elfeed/
$ make EMACS=emacs24.3 clean test
...
$ make EMACS=emacs25.2 clean test
...
Another example is NetHack, which I prefer to play with a couple of
custom patches (Menucolors, wchar). The install to
$HOME/.local
is also captured as a patch.
$ tar xzf nethack-343-src.tar.gz
$ cd nethack-3.4.3/
$ patch -p1 < ~/nh343-menucolor.diff
$ patch -p1 < ~/nh343-wchar.diff
$ patch -p1 < ~/nh343-home-install.diff
$ sh sys/unix/setup.sh
$ make -j$(nproc) install
Normally NetHack wants to be setuid (e.g. run as the “games” user) in order to restrict access to high scores, saves, and bones — saved levels where a player died, to be inserted randomly into other players’ games. This prevents cheating, but requires root to set up. Fortunately, when I install NetHack in my home directory, this isn’t a feature I actually care about, so I can ignore it.
Mutt is in a similar situation, since it wants to install a
special setgid program (mutt_dotlock
) that synchronizes mailbox
access. All MUAs need something like this.
Everything described below is relevant to basically any modern unix-like system: Linux, BSD, etc. I personally install software in $HOME across a variety of systems and, fortunately, it mostly works the same way everywhere. This is probably in large part due to everyone standardizing around the GCC and GNU binutils interfaces, even if the system compiler is actually LLVM/Clang.
Out of the box, installing things in $HOME/.local
won’t do anything
useful. You need to set up some environment variables in your shell
configuration (i.e. .profile
, .bashrc
, etc.) to tell various
programs, such as your shell, about it. The most obvious variable is
$PATH:
export PATH=$HOME/.local/bin:$PATH
Notice I put it in the front of the list. This is because I want my home directory programs to override system programs with the same name. For what other reason would I install a program with the same name if not to override the system program?
In the simplest situation this is good enough, but in practice you’ll probably need to set a few more things. If you install libraries in your home directory and expect to use them just as if they were installed on the system, you’ll need to tell the compiler where else to look for those headers and libraries, both for C and C++.
export C_INCLUDE_PATH=$HOME/.local/include
export CPLUS_INCLUDE_PATH=$HOME/.local/include
export LIBRARY_PATH=$HOME/.local/lib
The first two are like the -I
compiler option and the third is like
-L
linker option, except you usually won’t need to use them
explicitly. Unfortunately LIBRARY_PATH
doesn’t override the system
library paths, so in some cases, you will need to explicitly set
-L
. Otherwise you will still end up linking against the system library
rather than the custom packaged version. I really wish GCC and Clang
didn’t behave this way.
Some software uses pkg-config
to determine its compiler and linker
flags, and your home directory will contain some of the needed
information. So set that up too:
export PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig
Finally, when you install libraries in your home directory, the run-time dynamic linker will need to know where to find them. There are three ways to deal with this:
LD_LIBRARY_PATH
.For the crude way, point the run-time linker at your lib/
and you’re
done:
export LD_LIBRARY_PATH=$HOME/.local/lib
However, this is like using a shotgun to kill a fly. If you install a library in your home directory that is also installed on the system, and then run a system program, it may be linked against your library rather than the library installed on the system as was originally intended. This could have detrimental effects.
The precision method is to set the ELF “runpath” value. It’s like a
per-binary LD_LIBRARY_PATH
. The run-time linker uses this path first
in its search for libraries, and it will only have an effect on that
particular program/library. This also applies to dlopen()
.
Some software will configure the runpath by default in their build
system, but often you need to configure this yourself. The simplest way
is to set the LD_RUN_PATH
environment variable when building software.
Another option is to manually pass -rpath
options to the linker via
LDFLAGS
. It’s used directly like this:
$ gcc -Wl,-rpath=$HOME/.local/lib -o foo bar.o baz.o -lquux
Verify with readelf
:
$ readelf -d foo | grep runpath
Library runpath: [/home/username/.local/lib]
ELF supports a special $ORIGIN
“variable” set to the binary’s
location. This allows the program and associated libraries to be
installed anywhere without changes, so long as they have the same
relative position to each other . (Note the quotes to prevent shell
interpolation.)
$ gcc -Wl,-rpath='$ORIGIN/../lib' -o foo bar.o baz.o -lquux
There is one situation where runpath
won’t work: when you want a
system-installed program to find a home directory library with
dlopen()
— e.g. as an extension to that program. You either need to
ensure it uses a relative or absolute path (i.e. the argument to
dlopen()
contains a slash) or you must use LD_LIBRARY_PATH
.
Personally, I always use the Worse is Better LD_LIBRARY_PATH
shotgun. Occasionally it’s caused some annoying issues, but the vast
majority of the time it gets the job done with little fuss. This is
just my personal development environment, after all, not a production
server.
Another potentially tricky issue is man pages. When a program or
library installs a man page in your home directory, it would certainly
be nice to access it with man <topic>
just like it was installed on
the system. Fortunately, Debian and Debian-derived systems, using a
mechanism I haven’t yet figured out, discover home directory man pages
automatically without any assistance. No configuration needed.
It’s more complicated on other systems, such as the BSDs. You’ll need to
set the MANPATH
variable to include $HOME/.local/share/man
. It’s
unset by default and it overrides the system settings, which means you
need to manually include the system paths. The manpath
program can
help with this … if it’s available.
export MANPATH=$HOME/.local/share/man:$(manpath)
I haven’t figured out a portable way to deal with this issue, so I mostly ignore it.
While I’ve poo-pooed autoconf in the past, the standard
configure
script usually makes it trivial to build and install
software in $HOME. The key ingredient is the --prefix
option:
$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
Most of the time it’s that simple! If you’re linking against your own
libraries and want to use runpath
, it’s a little more complicated:
$ ./configure --prefix=$HOME/.local \
LDFLAGS="-Wl,-rpath=$HOME/.local/lib"
For CMake, there’s CMAKE_INSTALL_PREFIX
:
$ cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
The CMake builds I’ve seen use ELF runpath by default, and no further configuration may be required to make that work. I’m sure that’s not always the case, though.
Some software is just a single, static, standalone binary with everything baked in. It doesn’t need to be given a prefix, and installation is as simple as copying the binary into place. For example, Enchive works like this:
$ git clone https://github.com/skeeto/enchive
$ cd enchive/
$ make
$ cp enchive ~/.local/bin
Some software uses its own unique configuration interface. I can respect
that, but it does add some friction for users who now have something
additional and non-transferable to learn. I demonstrated a NetHack build
above, which has a configuration much more involved than it really
should be. Another example is LuaJIT, which uses make
variables that
must be provided consistently on every invocation:
$ tar xzf LuaJIT-2.0.5.tar.gz
$ cd LuaJIT-2.0.5/
$ make -j$(nproc) PREFIX=$HOME/.local
$ make PREFIX=$HOME/.local install
(You can use the “install” target to both build and install, but I
wanted to illustrate the repetition of PREFIX
.)
Some libraries aren’t so smart about pkg-config
and need some
handholding — for example, ncurses. I mention it because
it’s required for both Vim and Emacs, among many others, so I’m often
building it myself. It ignores --prefix
and needs to be told a
second time where to install things:
$ ./configure --prefix=$HOME/.local \
--enable-pc-files \
--with-pkg-config-libdir=$PKG_CONFIG_PATH
Another issue is that a whole lot of software has been hardcoded for
ncurses 5.x (i.e. ncurses5-config
), and it requires hacks/patching
to make it behave properly with ncurses 6.x. I’ve avoided ncurses 6.x
for this reason.
I could go on and on like this, discussing the quirks for the various libraries and programs that I use. Over the years I’ve gotten used to many of these issues, committing the solutions to memory. Unfortunately, even within the same version of a piece of software, the quirks can change between major operating system releases, so I’m continuously learning my way around new issues. It’s really given me an appreciation for all the hard work that package maintainers put into customizing and maintaining software builds to fit properly into a larger ecosystem.
]]>When I’m reasoning about whether or not something is allowed, I like to imagine an adversarial implementation. If the standard allows some freedom, this implementation takes an imaginative or unique approach. It chooses non-obvious interpretations with possibly unexpected, but valid, results. This is nearly the opposite of djb’s hypothetical boringcc, though some of the ideas are similar.
Many argue that this is already the case with modern C and C++ optimizing compilers. Compiler writers are already creative with the standard in order to squeeze out more performance, even if it’s at odds with the programmer’s actual intentions. The most prominent example in C and C++ is strict aliasing, where the optimizer is deliberately blinded to certain kinds of aliasing because the standard allows it to be, eliminating some (possibly important) loads. This happens despite the compiler’s ability to trivially prove that two particular objects really do alias.
I want to be clear that I’m not talking about the nasal daemon kind of creativity. That’s not a helpful thought experiment. What I mean is this: Can I imagine a conforming implementation that breaks any assumptions made by the code?
In practice, compilers typically have to bridge multiple specifications: the language standard, the platform ABI, and operating system interface (process startup, syscalls, etc.). This really ties its hands on how creative it can be with any one of the specifications. Depending on the situation, the imaginary adversarial implementation isn’t necessarily running on any particular platform. If our program is expected to have a long life, useful for many years to come, we should avoid making too many assumptions about future computers and imagine an adversarial compiler with few limitations.
Take this bit of C:
printf("%d", sizeof(foo));
The printf
function is variadic, and it relies entirely on the format
string in order to correctly handle all its arguments. The %d
specifier means that its matching argument is of type int
. The result
of the sizeof
operator is an integer of type size_t
, which has a
different sign and may even be a different size.
Typically this code will work just fine. An int
and size_t
are
generally passed the same way, the actual value probably fits in an
int
, and two’s complement means the signedness isn’t an issue due to
the value being positive. From the printf
point of view, it
typically can’t detect that the type is wrong, so everything works by
chance. In fact, it’s hard to imagine a real situation where this
wouldn’t work fine.
However, this still undefined behavior — a scenario where a creative adversarial implementation can break things. In this case there are a few options for an adversarial implementation:
int
and size_t
are passed differently, so
printf
will load the argument it from the wrong place.foo
is given crazy padding for arbitrary reasons that
makes it so large it doesn’t fit in an int
.What’s interesting about #1 is that this has actually happened. For example, here’s a C source file.
float foo(float x, int y);
float
bar(int y)
{
return foo(0.0f, y);
}
And in another source file:
float
foo(int x, int y)
{
(void)x; // ignore x
return y * 2.0f;
}
The type of argument x
differs between the prototype and the
definition, which is undefined behavior. However, since this argument
is ignored, this code will still work correctly on many different
real-world computers, particularly where float
and int
arguments
are passed the same way (i.e. on the stack).
However, in 2003 the x86-64 CPU arrived with its new System V ABI. Floating point and integer arguments were now passed differently, and the types of preceding arguments mattered when deciding which register to use. Some constructs that worked fine, by chance, prior to 2003 would soon stop working due to what may have seemed like an adversarial implementation years before.
Let’s look at some Python. This snippet opens a file a million times without closing any handles.
for i in range(1, 1000000):
f = open("/dev/null", "r")
Assuming you have a /dev/null
, this code will work fine without
throwing any exceptions on CPython, the most widely used Python
implementation. CPython uses a deterministic reference counting scheme,
and the handle is automatically closed as soon as its variable falls out
of scope. It’s like having an invisible f.close()
at the end of the
block.
However, this code is incorrect. The deterministic handle closing an implementation behavior, not part of the specification. The operating system limits the number of files a process can have open at once, and there’s a risk that this resource will run out even though none of those handles are reachable. Imagine an adversarial Python implementation trying to break this code. It could sufficiently delay garbage collection, or even have infinite memory, omitting garbage collection altogether.
Like before, such an implementation eventually did come about: PyPy, a Python implementation written in Python with a JIT compiler. It uses (by default) something closer to mark-and-sweep, not reference counting, and those handles are left open until the next collection.
>>>> for i in range(1, 1000000):
.... f = open("/dev/null", "r")
....
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IOError: [Errno 24] Too many open files: '/dev/null'
This fits right in with a broader method of self-improvement: Occasionally put yourself in the implementor’s shoes. Think about what it would take to correctly implement the code that you write, either as a language or the APIs that you call. On reflection, you may find that some of those things that seem cheap may not be. Your assumptions may be reasonable, but not guaranteed. (Though it may be that “reasonable” is perfectly sufficient for your situation.)
An adversarial implementation is one that challenges an assumption you’ve taken for granted by turning it on its head.
]]>Monte Carlo tree search (MCTS) is the most impressive game artificial intelligence I’ve ever used. At its core it simulates a large number of games (playouts), starting from the current game state, using random moves for each player. Then it simply picks the move where it won most often. This description is sufficient to spot one of its most valuable features: MCTS requires no knowledge of strategy or effective play. The game’s rules — enough to simulate the game — are all that’s needed to allow the AI to make decent moves. Expert knowledge still makes for a stronger AI, but, more many games, it’s unnecessary to construct a decent opponent.
A second valuable feature is that it’s easy to parallelize. Unlike alpha-beta pruning, which doesn’t mix well with parallel searches of a Minimax tree, Monte Carlo simulations are practically independent and can be run in parallel.
Finally, the third valuable feature is that the search can be stopped at any time. The completion of any single simulation is as good a stopping point as any. It could be due to a time limit, a memory limit, or both. In general, the algorithm converges to a best move rather than suddenly discovering it. The good moves are identified quickly, and further simulations work to choose among them. More simulations make for better moves, with exponentially diminishing returns. Contrasted with Minimax, stopping early has the risk that the good moves were never explored at all.
To try out MCTS myself, I wrote two games employing it:
They’re both written in C, for both unix-like and Windows, and should be easy to build. I challenge you to beat them both. The Yavalath AI is easier to beat due to having blind spots, which I’ll discuss below. The Connect Four AI is more difficult and will likely take a number of tries.
MCTS works very well with Connect Four, and only requires modest resources: 32MB of memory to store the results of random playouts, and 500,000 game simulations. With a few tweaks, it can even be run in DOSBox. It stops when it hits either of those limits. In theory, increasing both would make for stronger moves, but in practice I can’t detect any difference. It’s like computing pi with Monte Carlo, where eventually it just runs out of precision to make any more progress.
Based on my simplified description above, you might wonder why it needs all that memory. Not only does MCTS need to track its win/loss ratio for each available move from the current state, it tracks the win/loss ratio for moves in the states behind those moves. A large chunk of the game tree is kept in memory to track all of the playout results. This is why MCTS needs a lot more memory than Minimax, which can discard branches that have been searched.
A convenient property of this tree is that the branch taken in the actual game can be re-used in a future search. The root of the tree becomes the node representing the taken game state, which has already seen a number of playouts. Even better, MCTS is weighted towards exploring good moves over bad moves, and good moves are more likely to be taken in the real game. In general, a significant portion of the tree gets to be reused in a future search.
I’m going to skip most of the details of the algorithm itself and focus on my implementation. Other articles do a better job at detailing the algorithm than I could.
My Connect Four engine doesn’t use dynamic allocation for this tree (or at all). Instead it manages a static buffer — an array of tree nodes, each representing a game state. All nodes are initially chained together into a linked list of free nodes. As the tree is built, nodes are pulled off the free list and linked together into a tree. When the game advances to the next state, nodes on unreachable branches are added back to the free list.
If at any point the free list is empty when a new node is needed, the current search aborts. This is the out-of-memory condition, and no more searching can be performed.
/* Connect Four is normally a 7 by 6 grid. */
#define CONNECT4_WIDTH 7
#define CONNECT4_HEIGHT 6
struct connect4_node {
uint32_t next[CONNECT4_WIDTH]; // "pointer" to next node
uint32_t playouts[CONNECT4_WIDTH]; // number of playouts
float score[CONNECT4_WIDTH]; // pseudo win/loss ratio
};
Rather than native C pointers, the structure uses 32-bit indexes into
the master array. This saves a lot of memory on 64-bit systems, and the
structure is the same size no matter the pointer size of the host. The
next
field points to the next state for the nth move. Since 0 is a
valid index, -1 represents null (CONNECT4_NULL
).
Each column is a potential move, so there are CONNECT4_WIDTH
possible moves at any given state. Each move has a floating point
score and a total number of playouts through that move. In my
implementation, the search can also halt due to an overflow in a
playout counter. The search can no longer be tracked in this
representation, so it has to stop. This generally only happens when
the game is nearly over and it’s grinding away on a small number of
possibilities.
Note that the actual game state (piece positions) is not tracked in the node structure. That’s because it’s implicit. We know the state of the game at the root, and simulating the moves while descending the tree will keep track of the board state at the current node. That’s more memory savings.
The state itself is a pair of bitboards, one for each player. Each position on the grid gets a bit on each bitboard. The bitboard is very fast to manipulate, and win states are checked with just a handful of bit operations. My intention was to make playouts as fast as possible.
struct connect4_ai {
uint64_t state[2]; // game state at root (bitboard)
uint64_t rng[2]; // random number generator state
uint32_t nodes_available; // total number of nodes available
uint32_t nodes_allocated; // number of nodes in the tree
uint32_t root; // "pointer" to root node
uint32_t free; // "pointer" to free list
int turn; // whose turn (0 or 1) at the root?
};
The nodes_available
and nodes_allocated
are not necessary for
correctness nor speed. They’re useful for diagnostics and debugging.
All the functions that operate on these two structures are
straightforward, except for connect4_playout
, a recursive function
which implements the bulk of MCTS. Depending on the state of the node
it’s at, it does one of two things:
If there are unexplored moves (playouts == 0
), it randomly chooses
an unplayed move, allocates exactly one node for the state behind that
move, and simulates the rest of the game in a loop, without recursion
or allocating any more nodes.
If all moves have been explored at least once, it uses an upper confidence bound (UCB1) to randomly choose a move, weighed towards both moves that are under-explored and moves which are strongest. Striking that balance is one of the challenges. It recurses into that next state, then updates the node with the result as it propagates back to the root.
That’s pretty much all there is to it.
Yavalath is a board game invented by a computer program. It’s a pretty fascinating story. The depth and strategy are disproportionately deep relative to its dead simple rules: Get four in a row without first getting three in a row. The game revolves around forced moves.
The engine is structured almost identically to the Connect Four engine. It uses 32-bit indexes instead of pointers. The game state is a pair of bitboards, with end-game masks computed at compile time via metaprogramming. The AI allocates the tree from a single, massive buffer — multiple GBs in this case, dynamically scaled to the available physical memory. And the core MCTS function is nearly identical.
One important difference is that identical game states — states where the pieces on the board are the same, but the node was reached through a different series of moves — are coalesced into a single state in the tree. This state deduplication is done through a hash table. This saves on memory and allows multiple different paths through the game tree to share playouts. It comes at a cost of including the game state in the node (so it can be identified in the hash table) and reference counting the nodes (since they might have more than one parent).
Unfortunately the AI has blind spots, and once you learn to spot them it becomes easy to beat consistently. It can’t spot certain kinds of forced moves, so it always falls for the same tricks. The official Yavalath AI is slightly stronger than mine, but has a similar blindness. I think MCTS just isn’t quite a good fit for Yavalath.
The AI’s blindness is caused by shallow traps, a common problem for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is a branch in the game tree where the game will abruptly end in a small number of turns. If the random tree search doesn’t luckily stumble upon a trap during its random traversal, it can’t take it into account in its final decision. A skilled player will lead the game towards one of these traps, and the AI will blunder along, not realizing what’s happened until its too late.
I almost feel bad for it when this happens. If you watch the memory usage and number of playouts, once it falls into a trap, you’ll see it using almost no memory while performing a ton of playouts. It’s desperately, frantically searching for a way out of the trap. But it’s too late, little AI.
I’m really happy to have sunk a couple weekends into playing with MCTS. It’s not always a great fit, as seen with Yavalath, but it’s a really neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready to use it should I run into an appropriate problem in the future.
]]>With some up-front attention to detail, this is actually not terribly difficult. Unix-like systems are probably the least diverse and least buggy they’ve ever been. Writing portable code is really just a matter of coding to the standards and ignoring extensions unless absolutely necessary. Knowing what’s standard and what’s extension is the tricky part, but I’ll explain how to find this information.
You might be tempted to reach for an overly complicated solution
such as GNU Autoconf. Sure, it creates a configure script with the
familiar, conventional interface. This has real value. But do you
really need to run a single-threaded gauntlet of hundreds of
feature/bug tests for things that sometimes worked incorrectly in some
weird unix variant back in the 1990s? On a machine with many cores
(parallel build, -j
), this may very well be the slowest part of the
whole build process.
For example, the configure script for Emacs checks that the compiler
supplies stdlib.h
, string.h
, and getenv
— things that were
standardized nearly 30 years ago. It also checks for a slew of POSIX
functions that have been standard since 2001.
There’s a much easier solution: Document that the application requires, say, C99 and POSIX.1-2001. It’s the responsibility of the person building the application to supply these implementations, so there’s no reason to waste time testing for it.
Suppose there’s some function you want to use, but you’re not sure if it’s standard or an extension. Or maybe you don’t know what standard it comes from. Luckily the man pages document this stuff very well, especially on Linux. Check the friendly “CONFORMING TO” section. For example, look at getenv(3). Here’s what that section has to say:
CONFORMING TO
getenv(): SVr4, POSIX.1-2001, 4.3BSD, C89, C99.
secure_getenv() is a GNU extension.
This says this function comes from the original C standard. It’s always
available on anything that claims to be a C implementation. The man page
also documents secure_getenv()
, which is a GNU extension: to be avoided
in anything intended to be portable.
What about sleep(3)?
CONFORMING TO
POSIX.1-2001.
This function isn’t part of standard C, but it’s available on any system
claiming to implement POSIX.1-2001 (the POSIX standard from 2001). If the
program needs to run on an operating system not implementing this POSIX
standard (i.e. Windows), you’ll need to call an alternative function,
probably inside a different #if .. #endif
branch. More on this in a
moment.
If you’re coding to POSIX, you must define the _POSIX_C_SOURCE
feature test macro to the standard you intend to use prior to
any system header includes:
A POSIX-conforming application should ensure that the feature test macro
_POSIX_C_SOURCE
is defined before inclusion of any header.
For example, to properly access POSIX.1-2001 functions in your
application, define _POSIX_C_SOURCE
to 200112L
. With this defined,
it’s safe to assume access to all of C and everything from that standard
of POSIX. You can do this at the top of your sources, but I personally
like the tidiness of a global config.h
that gets included before
everything.
So you’ve written clean, portable C to the standards. How do you build
this application? The natural choice is make
. It’s available
everywhere and it’s part of POSIX.
Again, the tricky part is teasing apart the standard from the extension. I’m a long-time sinner in this regard, having far too often written Makefiles that depend on GNU Make extensions. This is a real pain when building programs on systems without the GNU utilities. I’ve been making amends (and finding some bugs as a result).
No implementation makes the division clear in its documentation, and
especially don’t bother looking at the GNU Make manual. Your best
resource is the standard itself. If you’re already familiar with
make
, coding to the standard is largely a matter of unlearning the
various extensions you know.
Outside of some hacks, this means you don’t get conditionals
(if
, else
, etc.). With some practice, both with sticking to portable
code and writing portable Makefiles, you’ll find that you don’t really
need them. Following the macro conventions will cover most situations.
For example:
CC
: the C compiler programCFLAGS
: flags to pass to the C compilerLDFLAGS
: flags to pass to the linker (via the C compiler)LDLIBS
: libraries to pass to the linkerYou don’t need to do anything weird with the assignments. The user
invoking make
can override them easily. For example, here’s part of a
Makefile:
CC = c99
CFLAGS = -Wall -Wextra -Os
But the user wants to use clang
, and their system needs to explicitly
link -lsocket
(e.g. Solaris). The user can override the macro
definitions on the command line:
$ make CC=clang LDLIBS=-lsocket
The same rules apply to the programs you invoke from the Makefile. Read
the standards documents and ignore your system’s man pages as to avoid
accidentally using an extension. It’s especially valuable to learn the
Bourne shell language and avoid any accidental bashisms in your
Makefiles and scripts. The dash
shell is good for testing your scripts.
Makefiles conforming to the standard will, unfortunately, be more verbose
than those taking advantage of a particular implementation. If you know
how to code Bourne shell — which is not terribly difficult to learn —
then you might even consider hand-writing a configure
script to
generate the Makefile (a la metaprogramming). This gives you a more
flexible language with conditionals, and, being generated, redundancy in
the Makefile no longer matters.
As someone who frequently dabbles with BSD systems, my life has gotten a lot easier since learning to write portable Makefiles and scripts.
It’s the elephant in the room and I’ve avoided talking about it so far.
If you want to build with Visual Studio’s command line tools —
something I do on occasion — build portability goes out the window.
Visual Studio has nmake.exe
, which nearly conforms to POSIX make
.
However, without the standard unix utilities and with the completely
foreign compiler interface for cl.exe
, there’s absolutely no hope of
writing a Makefile portable to this situation.
The nice alternative is MinGW(-w64) with MSYS or Cygwin supplying the
unix utilities, though it has the problem of linking against
msvcrt.dll
. Another option is a separate Makefile dedicated to
nmake.exe
and the Visual Studio toolchain. Good luck defining a
correctly working “clean” target with del.exe
.
My preferred approach lately is an amalgamation build (as seen in
Enchive): Carefully concatenate all the application’s sources
into one giant source file. First concatenate all the headers in the
right order, followed by all the C files. Use sed
to remove and local
includes. You can do this all on a unix system with the nice utilities,
then point cl.exe
at the amalgamation for the Visual Studio build.
It’s not very useful for actual development (i.e. you don’t want to edit
the amalgamation), but that’s what MinGW-w64 resolves.
What about all those POSIX functions? You’ll need to find Win32
replacements on MSDN. I prefer to do this is by abstracting those
operating system calls. For example, compare POSIX sleep(3)
and Win32
Sleep()
.
#if defined(_WIN32)
#include <windows.h>
void
my_sleep(int s)
{
Sleep(s * 1000); // TODO: handle overflow, maybe
}
#else /* __unix__ */
#include <unistd.h>
void
my_sleep(int s)
{
sleep(s); // TODO: fix signal interruption
}
#endif
Then the rest of the program calls my_sleep()
. There’s another example
in the OpenMP article with pwrite(2)
and WriteFile()
. This
demonstrates that supporting a bunch of different unix-like systems is
really easy, but introducing Windows portability adds a disproportionate
amount of complexity.
There’s one major complication with filenames for applications portable to Windows. In the unix world, filenames are null-terminated bytestrings. Typically these are Unicode strings encoded as UTF-8, but it’s not necessarily so. The kernel just sees bytestrings. A bytestring doesn’t necessarily have a formal Unicode representation, which can be a problem for languages that want filenames to be Unicode strings (also).
On Windows, filenames are somewhere between UCS-2 and UTF-16, but end up being neither. They’re really null-terminated unsigned 16-bit integer arrays. It’s almost UTF-16 except that Windows allows unpaired surrogates. This means Windows filenames also don’t have a formal Unicode representation, but in a completely different way than unix. Some heroic efforts have gone into working around this issue.
As a result, it’s highly non-trivial to correctly support all possible filenames on both systems in the same program, especially when they’re passed as command line arguments.
The key points are:
This was all a discussion of non-GUI applications, and I didn’t really
touch on libraries. Many libraries are simple to access in the build
(just add it to LDLIBS
), but some libraries — GUIs in particular — are
particularly complicated to manage portably and will require a more
complex solution (pkg-config, CMake, Autoconf, etc.).
Here’s an example that computes the frames of a video in parallel. Despite being computed out of order, each frame is written in order to a large buffer, then written to standard output all at once at the end.
size_t size = sizeof(struct frame) * num_frames;
struct frame *output = malloc(size);
float beta = DEFAULT_BETA;
/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
float theta = compute_theta(i);
compute_frame(&output[i], theta, beta);
}
write(STDOUT_FILENO, output, size);
free(output);
Adding OpenMP to this program is much simpler than introducing
low-level threading semantics with, say, Pthreads. With care, there’s
often no need for explicit thread synchronization. It’s also fairly
well supported by many vendors, even Microsoft (up to OpenMP 2.0), so
a multi-threaded OpenMP program is quite portable without #ifdef
.
There’s real value this pragma API: The above example would still compile and run correctly even when OpenMP isn’t available. The pragma is ignored and the program just uses a single core like it normally would. It’s a slick fallback.
When a program really does require synchronization there’s
omp_lock_t
(mutex lock) and the expected set of functions to operate
on them. This doesn’t have the nice fallback, so I don’t like to use
it. Instead, I prefer #pragma omp critical
. It nicely maintains the
OpenMP-unsupported fallback.
/* schedule(dynamic, 1): treat the loop like a work queue */
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
struct frame *frame = malloc(sizeof(*frame));
float theta = compute_theta(i);
compute_frame(frame, theta, beta);
#pragma omp critical
{
write(STDOUT_FILENO, frame, sizeof(*frame));
}
free(frame);
}
This would append the output to some output file in an arbitrary order. The critical section prevents interleaving of outputs.
There are a couple of problems with this example:
Only one thread can write at a time. If the write takes too long, other threads will queue up behind the critical section and wait.
The output frames will be out of order, which is probably
inconvenient for consumers. If the output is seekable this can be
solved with lseek()
, but that only makes the critical section
even more important.
There’s an easy fix for both, and eliminates the need for a critical
section: POSIX pwrite()
.
ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
It’s like write()
but has an offset parameter. Unlike lseek()
followed by a write()
, multiple threads and processes can, in
parallel, safely write to the same file descriptor at different file
offsets. The catch is that the output must be a file, not a pipe.
#pragma omp parallel for schedule(dynamic, 1)
for (int i = 0; i < num_frames; i++) {
size_t size = sizeof(struct frame);
struct frame *frame = malloc(size);
float theta = compute_theta(i);
compute_frame(frame, theta, beta);
pwrite(STDOUT_FILENO, frame, size, size * i);
free(frame);
}
There’s no critical section, the writes can interleave, and the output is in order.
If you’re concerned about standard output not being seekable (it often isn’t), keep in mind that it will work just fine when invoked like so:
$ ./compute_frames > frames.ppm
I talked about OpenMP being really portable, then used POSIX
functions. Fortunately the Win32 WriteFile()
function has an
“overlapped” parameter that works just like pwrite()
. Typically
rather than call either directly, I’d wrap the write like so:
#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
static int
write_frame(struct frame *f, int i)
{
HANDLE out = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD written;
OVERLAPPED offset = {.Offset = sizeof(*f) * i};
return WriteFile(out, f, sizeof(*f), &written, &offset);
}
#else /* POSIX */
#include <unistd.h>
static int
write_frame(struct frame *f, int i)
{
size_t count = sizeof(*f);
size_t offset = sizeof(*f) * i;
return pwrite(STDOUT_FILENO, buf, count, offset) == count;
}
#endif
Except for switching to write_frame()
, the OpenMP part remains
untouched.
Here’s an example in a real program:
Notice because of pwrite()
there’s no piping directly into
ppmtoy4m
:
$ ./julia > output.ppm
$ ppmtoy4m -F 60:1 < output.ppm > output.y4m
$ x264 -o output.mp4 output.y4m
So that’s what he did: Elfuse. It’s an old joke that Emacs is an operating system, and here it is handling system calls.
However, there’s a tricky problem to solve, an issue also present my joystick module. Both modules handle asynchronous events — filesystem requests or joystick events — but Emacs runs the event loop and owns the main thread. The external events somehow need to feed into the main event loop. It’s even more difficult with FUSE because FUSE also wants control of its own thread for its own event loop. This requires Elfuse to spawn a dedicated FUSE thread and negotiate a request/response hand-off.
When a filesystem request or joystick event arrives, how does Emacs know to handle it? The simple and obvious solution is to poll the module from a timer.
struct queue requests;
emacs_value
Frequest_next(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
emacs_value next = Qnil;
queue_lock(requests);
if (queue_length(requests) > 0) {
void *request = queue_pop(requests, env);
next = env->make_user_ptr(env, fin_empty, request);
}
queue_unlock(request);
return next;
}
And then ask Emacs to check the module every, say, 10ms:
(defun request--poll ()
(let ((next (request-next)))
(when next
(request-handle next))))
(run-at-time 0 0.01 #'request--poll)
Blocking directly on the module’s event pump with Emacs’ thread would prevent Emacs from doing important things like, you know, being a text editor. The timer allows it to handle its own events uninterrupted. It gets the job done, but it’s far from perfect:
It imposes an arbitrary latency to handling requests. Up to the poll period could pass before a request is handled.
Polling the module 100 times per second is inefficient. Unless you really enjoy recharging your laptop, that’s no good.
The poll period is a sliding trade-off between latency and battery life. If only there was some mechanism to, ahem, signal the Emacs thread, informing it that a request is waiting…
Emacs Lisp programs can handle the POSIX SIGUSR1 and SIGUSR2 signals,
which is exactly the mechanism we need. The interface is a “key”
binding on special-event-map
, the keymap that handles these kinds of
events. When the signal arrives, Emacs queues it up for the main event
loop.
(define-key special-event-map [sigusr1]
(lambda ()
(interactive)
(request-handle (request-next))))
The module blocks on its own thread on its own event pump. When a
request arrives, it queues the request, rings the bell for Emacs to
come handle it (raise()
), and waits on a semaphore. For illustration
purposes, assume the module reads requests from and writes responses
to a file descriptor, like a socket.
int event_fd = /* ... */;
struct request request;
sem_init(&request.sem, 0, 0);
for (;;) {
/* Blocking read for request event */
read(event_fd, &request.event, sizeof(request.event));
/* Put request on the queue */
queue_lock(requests);
queue_push(requests, &request);
queue_unlock(requests);
raise(SIGUSR1); // TODO: Should raise() go inside the lock?
/* Wait for Emacs */
while (sem_wait(&request.sem))
;
/* Reply with Emacs' response */
write(event_fd, &request.response, sizeof(request.response));
}
The sem_wait()
is in a loop because signals will wake it up
prematurely. In fact, it may even wake up due to its own signal on the
line before. This is the only way this particular use of sem_wait()
might fail, so there’s no need to check errno
.
If there are multiple module threads making requests to the same global queue, the lock is necessary to protect the queue. The semaphore is only for blocking the thread until Emacs has finished writing its particular response. Each thread has its own semaphore.
When Emacs is done writing the response, it releases the module thread by incrementing the semaphore. It might look something like this:
emacs_value
Frequest_complete(emacs_env *env, ptrdiff_t n, emacs_value *args, void *p)
{
struct request *request = env->get_user_ptr(env, args[0]);
if (request)
sem_post(&request->sem);
return Qnil;
}
The top-level handler dispatches to the specific request handler,
calling request-complete
above when it’s done.
(defun request-handle (next)
(condition-case e
(cl-ecase (request-type next)
(:open (request-handle-open next))
(:close (request-handle-close next))
(:read (request-handle-read next)))
(error (request-respond-as-error next e)))
(request-complete))
This SIGUSR1+semaphore mechanism is roughly how Elfuse currently processes requests.
Windows doesn’t have signals. This isn’t a problem for Elfuse since Windows doesn’t have FUSE either. Nor does it matter for Joymacs since XInput isn’t event-driven and always requires polling. But someday someone will need this mechanism for a dynamic module on Windows.
Fortunately there’s a solution: input language change events,
WM_INPUTLANGCHANGE
. It’s also on special-event-map
:
(define-key special-event-map [language-change]
(lambda ()
(interactive)
(request-process (request-next))))
Instead of raise()
(or pthread_kill()
), broadcast the window event
with PostMessage()
. Outside of invoking the language-change
key
binding, Emacs will ignore the event because WPARAM is 0 — it doesn’t
belong to any particular window. We don’t really want to change the
input language, after all.
PostMessageA(HWND_BROADCAST, WM_INPUTLANGCHANGE, 0, 0);
Naturally you’ll also need to replace the POSIX threading primitives
with the Windows versions (CreateThread()
, CreateSemaphore()
,
etc.). With a bit of abstraction in the right places, it should be
pretty easy to support both POSIX and Windows in these asynchronous
dynamic module events.
If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.
The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.
For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.
CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.
While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.
Consider this trivial C program, demo.c
:
int
main(void)
{
char name[8];
gets(name);
printf("Hello, %s.\n", name);
return 0;
}
It reads a name into a buffer and prints it back out with a greeting.
While trivial, it’s far from innocent. That naive call to gets()
doesn’t check the bounds of the buffer, introducing an exploitable
buffer overflow. It’s so obvious that both the compiler and linker
will yell about it.
For simplicity, suppose the program also contains a dangerous function.
void
self_destruct(void)
{
puts("**** GO BOOM! ****");
}
The attacker can use the buffer overflow to call this dangerous function.
To make this attack simpler for the sake of the article, assume the
program isn’t using ASLR (e.g. without -fpie
/-pie
, or with
-fno-pie
/-no-pie
). For this particular example, I’ll also
explicitly disable buffer overflow protections (e.g. _FORTIFY_SOURCE
and stack protectors).
$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
-o demo demo.c
First, find the address of self_destruct()
.
$ readelf -a demo | grep self_destruct
46: 00000000004005c5 10 FUNC GLOBAL DEFAULT 13 self_destruct
This is on x86-64, so it’s a 64-bit address. The size of the name
buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes
allocated above, so there’s 16 bytes to fill, then 8 bytes to
overwrite the return pointer with the address of self_destruct
.
$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault
With this input I’ve successfully exploited the buffer overflow to
divert control to self_destruct()
. When main
tries to return into
libc, it instead jumps to the dangerous function, and then crashes
when that function tries to return — though, presumably, the system
would have self-destructed already. Turning on the stack protector
stops this exploit.
$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
-o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...
The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.
The stack protector transformed the program into something that looks like the following:
int
main(void)
{
long __canary = __get_thread_canary();
char name[8];
gets(name);
printf("Hello, %s.\n", name);
if (__canary != __get_thread_canary())
abort();
return 0;
}
However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.
After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:
void
self_destruct(char *password)
{
if (strcmp(password, "12345") == 0)
puts("**** GO BOOM! ****");
}
The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.
Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.
struct greeter {
char name[8];
void (*greet)(struct greeter *);
};
void
greet_hello(struct greeter *g)
{
printf("Hello, %s.\n", g->name);
}
void
greet_aloha(struct greeter *g)
{
printf("Aloha, %s.\n", g->name);
}
There’s now a greeter object and the function pointer makes its
behavior polymorphic. Think of it as a hand-coded virtual function for
C. Here’s the new (contrived) main
:
int
main(void)
{
struct greeter greeter = {.greet = greet_hello};
gets(greeter.name);
greeter.greet(&greeter);
return 0;
}
(In a real program, something else provides greeter
and picks its
own function pointer for greet
.)
Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.
$ readelf -a demo | grep self_destruct
54: 00000000004006a5 10 FUNC GLOBAL DEFAULT 13 self_destruct
We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).
$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****
Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.
This is where the Control Flow Guard comes into play. With CFG
enabled, the compiler inserts a check before calling the greet()
function pointer. It must point to the beginning of a known function,
otherwise it will abort just like the stack protector. Since the
middle of self_destruct()
isn’t the beginning of a function, it
would abort if this exploit is attempted.
However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.
As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.
For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.
These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.
Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.
Here are the CFG parameters. I’ve made them macros so that they can
easily be tuned at compile-time. The cfg_bits
is the integer type
backing the bitmap array. The CFG_RESOLUTION
is the number of bits
dropped, so “3” is a granularity of 8 bytes.
typedef unsigned long cfg_bits;
#define CFG_RESOLUTION 3
#define CFG_BITS 24
Given a function pointer f
, this macro extracts the bitmap index.
#define CFG_INDEX(f) \
(((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))
The CFG bitmap is just an array of integers. Zero it to initialize.
struct cfg {
cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];
};
Functions are manually registered in the bitmap using
cfg_register()
.
void
cfg_register(struct cfg *cfg, void *f)
{
unsigned long i = CFG_INDEX(f);
size_t z = sizeof(cfg_bits) * CHAR_BIT;
cfg->bitmap[i / z] |= 1UL << (i % z);
}
Because functions are registered at run-time, it’s fully compatible
with ASLR. If ASLR is enabled, the bitmap will be a little different
each run. On the same note, it may be worth XORing each bitmap element
with a random, run-time value — along the same lines as the stack
canary value — to make it harder for an attacker to manipulate the
bitmap should he get the ability to overwrite it by a vulnerability.
Alternatively the bitmap could be switched to read-only (e.g.
mprotect()
) once everything is registered.
And finally, the check function, used immediately before indirect
calls. It ensures f
was previously passed to cfg_register()
(except for false positives, as discussed). Since it will be invoked
often, it needs to be fast and simple.
void
cfg_check(struct cfg *cfg, void *f)
{
unsigned long i = CFG_INDEX(f);
size_t z = sizeof(cfg_bits) * CHAR_BIT;
if (!((cfg->bitmap[i / z] >> (i % z)) & 1))
abort();
}
And that’s it! Now augment main
to make use of it:
struct cfg cfg;
int
main(void)
{
cfg_register(&cfg, self_destruct); // to prove this works
cfg_register(&cfg, greet_hello);
cfg_register(&cfg, greet_aloha);
struct greeter greeter = {.greet = greet_hello};
gets(greeter.name);
cfg_check(&cfg, greeter.greet);
greeter.greet(&greeter);
return 0;
}
And now attempting the exploit:
$ ./demo < boom
Aborted
Normally self_destruct()
wouldn’t be registered since it’s not a
legitimate target of an indirect call, but the exploit still didn’t
work because it called into the middle of self_destruct()
, which
isn’t a valid address in the bitmap. The check aborts the program
before it can be exploited.
In a real application I would have a global cfg
bitmap for
the whole program, and define cfg_check()
in a header as an inline
function.
Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.
Update: Ted Unangst pointed out OpenBSD performing a similar check in its mbuf library. Instead of a bitmap, the function pointer is replaced with an index into an array of registered function pointers. That approach is cleaner, more efficient, completely portable, and has no false positives.
]]>qsort()
and bsearch()
, each requiring a
comparator function in order to operate on arbitrary types.
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
void *bsearch(const void *key, const void *base,
size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
A problem with these functions is that there’s no way to pass context to the callback. The callback may need information beyond the two element pointers when making its decision, or to update a result. For example, suppose I have a structure representing a two-dimensional coordinate, and a coordinate distance function.
struct coord {
float x;
float y;
};
static inline float
distance(const struct coord *a, const struct coord *b)
{
float dx = a->x - b->x;
float dy = a->y - b->y;
return sqrtf(dx * dx + dy * dy);
}
If I have an array of coordinates and I want to sort them based on
their distance from some target, the comparator needs to know the
target. However, the qsort()
interface has no way to directly pass
this information. Instead it has to be passed by another means, such
as a global variable.
struct coord *target;
int
coord_cmp(const void *a, const void *b)
{
float dist_a = distance(a, target);
float dist_b = distance(b, target);
if (dist_a < dist_b)
return -1;
else if (dist_a > dist_b)
return 1;
else
return 0;
}
And its usage:
size_t ncoords = /* ... */;
struct coords *coords = /* ... */;
struct current_target = { /* ... */ };
// ...
target = ¤t_target
qsort(coords, ncoords, sizeof(coords[0]), coord_cmp);
Potential problems are that it’s neither thread-safe nor re-entrant. Two different threads cannot use this comparator at the same time. Also, on some platforms and configurations, repeatedly accessing a global variable in a comparator may have a significant cost. A common workaround for thread safety is to make the global variable thread-local by allocating it in thread-local storage (TLS):
_Thread_local struct coord *target; // C11
__thread struct coord *target; // GCC and Clang
__declspec(thread) struct coord *target; // Visual Studio
This makes the comparator thread-safe. However, it’s still not re-entrant (usually unimportant) and accessing thread-local variables on some platforms is even more expensive — which is the situation for Pthreads TLS, though not a problem for native x86-64 TLS.
Modern libraries usually provide some sort of “user data” pointer — a
generic pointer that is passed to the callback function as an
additional argument. For example, the GNU C Library has long had
qsort_r()
: re-entrant qsort.
void qsort_r(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *, void *),
void *arg);
The new comparator looks like this:
int
coord_cmp_r(const void *a, const void *b, void *target)
{
float dist_a = distance(a, target);
float dist_b = distance(b, target);
if (dist_a < dist_b)
return -1;
else if (dist_a > dist_b)
return 1;
else
return 0;
}
And its usage:
void *arg = ¤t_target;
qsort_r(coords, ncoords, sizeof(coords[0]), coord_cmp_r, arg);
User data arguments are thread-safe, re-entrant, performant, and perfectly portable. They completely and cleanly solve the entire problem with virtually no drawbacks. If every library did this, there would be nothing left to discuss and this article would be boring.
In order to make things more interesting, suppose you’re stuck calling a function in some old library that takes a callback but doesn’t support a user data argument. A global variable is insufficient, and the thread-local storage solution isn’t viable for one reason or another. What do you do?
The core problem is that a function pointer is just an address, and it’s the same address no matter the context for any particular callback. On any particular call, the callback has three ways to distinguish this call from other calls. These align with the three solutions above:
A wholly different approach is to use a unique function pointer for
each callback. The callback could then inspect its own address to
differentiate itself from other callbacks. Imagine defining multiple
instances of coord_cmp
each getting their context from a different
global variable. Using a unique copy of coord_cmp
on each thread for
each usage would be both re-entrant and thread-safe, and wouldn’t
require TLS.
Taking this idea further, I’d like to generate these new functions on demand at run time akin to a JIT compiler. This can be done as a library, mostly agnostic to the implementation of the callback. Here’s an example of what its usage will be like:
void *closure_create(void *f, int nargs, void *userdata);
void closure_destroy(void *);
The callback to be converted into a closure is f
and the number of
arguments it takes is nargs
. A new closure is allocated and returned
as a function pointer. This closure takes nargs - 1
arguments, and
it will call the original callback with the additional argument
userdata
.
So, for example, this code uses a closure to convert coord_cmp_r
into a function suitable for qsort()
:
int (*closure)(const void *, const void *);
closure = closure_create(coord_cmp_r, 3, ¤t_target);
qsort(coords, ncoords, sizeof(coords[0]), closure);
closure_destroy(closure);
Caveat: This API is utterly insufficient for any sort of portability. The number of arguments isn’t nearly enough information for the library to generate a closure. For practically every architecture and ABI, it’s going to depend on the types of each of those arguments. On x86-64 with the System V ABI — where I’ll be implementing this — this argument will only count integer/pointer arguments. To find out what it takes to do this properly, see the libjit documentation.
This implementation will be for x86-64 Linux, though the high level details will be the same for any program running in virtual memory. My closures will span exactly two consecutive pages (typically 8kB), though it’s possible to use exactly one page depending on the desired trade-offs. The reason I need two pages are because each page will have different protections.
Native code — the thunk — lives in the upper page. The user data pointer and callback function pointer lives at the high end of the lower page. The two pointers could really be anywhere in the lower page, and they’re only at the end for aesthetic reasons. The thunk code will be identical for all closures of the same number of arguments.
The upper page will be executable and the lower page will be writable. This allows new pointers to be set without writing to executable thunk memory. In the future I expect operating systems to enforce W^X (“write xor execute”), and this code will already be compliant. Alternatively, the pointers could be “baked in” with the thunk page and immutable, but since creating closure requires two system calls, I figure it’s better that the pointers be mutable and the closure object reusable.
The address for the closure itself will be the upper page, being what other functions will call. The thunk will load the user data pointer from the lower page as an additional argument, then jump to the actual callback function also given by the lower page.
The x86-64 thunk assembly for a 2-argument closure calling a 3-argument callback looks like this:
user: dq 0
func: dq 0
;; --- page boundary here ---
thunk2:
mov rdx, [rel user]
jmp [rel func]
As a reminder, the integer/pointer argument register order for the
System V ABI calling convention is: rdi
, rsi
, rdx
, rcx
, r8
,
r9
. The third argument is passed through rdx
, so the user pointer
is loaded into this register. Then it jumps to the callback address
with the original arguments still in place, plus the new argument. The
user
and func
values are loaded RIP-relative (rel
) to the
address of the code. The thunk is using the callback address (its own
address) to determine the context.
The assembled machine code for the thunk is just 13 bytes:
unsigned char thunk2[16] = {
// mov rdx, [rel user]
0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
// jmp [rel func]
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}
All closure_create()
has to do is allocate two pages, copy this
buffer into the upper page, adjust the protections, and return the
address of the thunk. Since closure_create()
will work for nargs
number of arguments, there will actually be 6 slightly different
thunks, one for each of the possible register arguments (rdi
through
r9
).
static unsigned char thunk[6][13] = {
{
0x48, 0x8b, 0x3d, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}, {
0x48, 0x8b, 0x35, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}, {
0x48, 0x8b, 0x15, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}, {
0x48, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}, {
0x4C, 0x8b, 0x05, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
}, {
0x4C, 0x8b, 0x0d, 0xe9, 0xff, 0xff, 0xff,
0xff, 0x25, 0xeb, 0xff, 0xff, 0xff
},
};
Given a closure pointer returned from closure_create()
, here are the
setter functions for setting the closure’s two pointers.
void
closure_set_data(void *closure, void *data)
{
void **p = closure;
p[-2] = data;
}
void
closure_set_function(void *closure, void *f)
{
void **p = closure;
p[-1] = f;
}
In closure_create()
, allocation is done with an anonymous mmap()
,
just like in my JIT compiler. It’s initially mapped writable in
order to copy the thunk, then the thunk page is set to executable.
void *
closure_create(void *f, int nargs, void *userdata)
{
long page_size = sysconf(_SC_PAGESIZE);
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
char *p = mmap(0, page_size * 2, prot, flags, -1, 0);
if (p == MAP_FAILED)
return 0;
void *closure = p + page_size;
memcpy(closure, thunk[nargs - 1], sizeof(thunk[0]));
mprotect(closure, page_size, PROT_READ | PROT_EXEC);
closure_set_function(closure, f);
closure_set_data(closure, userdata);
return closure;
}
Destroying a closure is done by computing the lower page address and
calling munmap()
on it:
void
closure_destroy(void *closure)
{
long page_size = sysconf(_SC_PAGESIZE);
munmap((char *)closure - page_size, page_size * 2);
}
And that’s it! You can see the entire demo here:
It’s a lot simpler for x86-64 than it is for x86, where there’s no RIP-relative addressing and arguments are passed on the stack. The arguments must all be copied back onto the stack, above the new argument, and it cannot be a tail call since the stack has to be fixed before returning. Here’s what the thunk looks like for a 2-argument closure:
data: dd 0
func: dd 0
;; --- page boundary here ---
thunk2:
call .rip2eax
.rip2eax:
pop eax
push dword [eax - 13]
push dword [esp + 12]
push dword [esp + 12]
call [eax - 9]
add esp, 12
ret
Exercise for the reader: Port the closure demo to a different architecture or to the the Windows x64 ABI.
]]>Consider this simple C code sample.
static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};
float get_value(unsigned x)
{
return x < 4 ? values[x] : 0.0f;
}
This function needs the base address of values
in order to
dereference it for values[x]
. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).
$ gcc -c -Os -fPIC get_value.c
I optimized for size (-Os
) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with objdump
. Alternatively I
could have asked for the compiler’s assembly output with -S
, but
this will be good reverse engineering practice.
$ objdump -d -Mintel get_value.o
0000000000000000 <get_value>:
0: 83 ff 03 cmp edi,0x3
3: 0f 57 c0 xorps xmm0,xmm0
6: 77 0e ja 16 <get_value+0x16>
8: 48 8d 05 00 00 00 00 lea rax,[rip+0x0]
f: 89 ff mov edi,edi
11: f3 0f 10 04 b8 movss xmm0,DWORD PTR [rax+rdi*4]
16: c3 ret
There are a couple of interesting things going on, but let’s start from the beginning.
The ABI specifies that the first integer/pointer argument
(the 32-bit integer x
) is passed through the edi
register. The
function compares x
to 3, to satisfy x < 4
.
The ABI specifies that floating point values are returned through
the SSE2 SIMD register xmm0
. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of 0.0f
.
It then uses the result of the previous comparison to perform a
jump, ja
(“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to cmp
(edi
) comes after the first operand (0x3
) as unsigned values.
Its cousin, jg
(“jump if greater”), is for signed values. If x
is outside the array bounds, it jumps straight to ret
, returning
0.0f
.
If x
was in bounds, it uses a lea
(“load effective address”) to
load something into the 64-bit rax
register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into rax
is the address of the values
array. More on
this in a moment.
Finally it uses x
as an index into address in rax
. The movss
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of xmm0
, where the caller expects to find the
return value. This is all preceded by a mov edi, edi
which
looks like a hotpatch nop, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses rdi
not edi
.
All 32-bit register assignments clear the upper 32 bits, and so
this mov
zero-extends edi
into rdi
. This is in case of the
unlikely event that the caller left garbage in those upper bits.
xmm0
The first interesting part: xmm0
is cleared even when its first lane
is loaded with a value. There are two reasons to do this.
The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra ret
or an conditional jmp
over the “else” branch.
The less obvious reason is that it breaks a data dependency. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called register renaming. Architectural
registers (rax
, edi
, etc.) are just temporary names for
underlying physical registers. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.
For example, take this assembly sample. It assembles to 9 bytes of machine code.
mov edi, [rcx]
mov ecx, 7
shl eax, cl
This reads a 32-bit value from the address stored in rcx
, then
assigns ecx
and uses cl
(the lowest byte of rcx
) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of rcx
, wiping the unused parts of
register.
So after the second instruction, it’s guaranteed that the value in
rcx
has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, before the load. Ingenious!
Compare it to this example, where the second instruction assigns to
cl
instead of ecx
. This assembles to just 6 bytes.
mov edi, [rcx]
mov cl, 7
shl eax, cl
The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of rcx
still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.
By clearing xmm0
, instructions in get_value
involving xmm0
have
the opportunity to be executed prior to instructions in the callee
that use xmm0
.
Going back to the instruction that computes the address of values
.
8: 48 8d 05 00 00 00 00 lea rax,[rip+0x0]
Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
RIP-relative addressing that’s still the case, but the register with
the absolute address is rip
, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.
That means this instruction copies the instruction pointer (pointing
to the next instruction) into rax
, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you want a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a relocation entry to the object file so that the linker knows how
to do this.
On platforms that use ELF we can inspect relocations this with
readelf
.
$ readelf -r get_value.o
Relocation section '.rela.text' at offset 0x270 contains 1 entries:
Offset Info Type Sym. Value
00000000000b 000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
The relocation type is R_X86_64_PC32
. In the AMD64 Architecture
Processor Supplement, this is defined as “S + A - P”.
S: Represents the value of the symbol whose index resides in the relocation entry.
A: Represents the addend used to compute the value of the relocatable field.
P: Represents the place of the storage unit being relocated.
The symbol, S, is .rodata
— the final address for this object file’s
portion of .rodata
(where values
resides). The addend, A, is -4
since the instruction pointer points at the next instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the lea
instruction. These values are all known at
link-time, so no run-time support is necessary.
Being “S - P” (overall), this will be the displacement between these two addresses: the 32-bit value is relative. It’s relocatable so long as these two parts of the binary (code and data) maintain a fixed distance from each other. The binary is relocated as a whole, so this assumption holds.
Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the -m32
flag for a 32-bit target, and -fomit-frame-pointer
to
make it simpler for explanatory purposes.
$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 <get_value>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: d9 ee fldz
6: e8 fc ff ff ff call 7 <get_value+0x7>
b: 81 c1 02 00 00 00 add ecx,0x2
11: 83 f8 03 cmp eax,0x3
14: 77 09 ja 1f <get_value+0x1f>
16: dd d8 fstp st(0)
18: d9 84 81 00 00 00 00 fld DWORD PTR [ecx+eax*4+0x0]
1f: c3 ret
Disassembly of section .text.__x86.get_pc_thunk.cx:
00000000 <__x86.get_pc_thunk.cx>:
0: 8b 0c 24 mov ecx,DWORD PTR [esp]
3: c3 ret
Hmm, this one includes an extra function.
In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, x
, into eax
.
The fldz
instruction clears the x87 floating pointer return
register, just like clearing xmm0
in the x86-64 version.
Next it calls __x86.get_pc_thunk.cx
. The call pushes the
instruction pointer, eip
, onto the stack. This function reads
that value off the stack into ecx
and returns. In other words,
calling this function copies eip
into ecx
. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.
Next a 32-bit displacement is added to ecx
. In this case it’s
2
, but, like before, this is actually going be filled in later by
the linker.
Then it’s just like before: a branch to optionally load a value.
The floating pointer load (fld
) is another relocation.
Let’s look at the relocations. There are three this time:
$ readelf -r get_value.o
Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
Offset Info Type Sym.Value Sym. Name
00000007 00000e02 R_386_PC32 00000000 __x86.get_pc_thunk.cx
0000000d 00000f0a R_386_GOTPC 00000000 _GLOBAL_OFFSET_TABLE_
0000001b 00000709 R_386_GOTOFF 00000000 .rodata
The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is R_386_PC32
: a code-relative
displacement just like on x86-64.
The next is of type R_386_GOTPC
and sets the second operand in that
add ecx
. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since values
is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.
The final relocation is of type R_386_GOTOFF
. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, fld
. Ultimately the load adds these last
two relocations together, canceling the GOT:
(GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
So the GOT isn’t relevant in this case. It’s just a mechanism for constructing a custom relocation type.
Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will x
be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.
The typical way to provide such a compiler hint is with a pair of
macros, likely()
and unlikely()
. With GCC and Clang, these would
be defined to use __builtin_expect
. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};
float get_value(unsigned x)
{
return unlikely(x < 4) ? values[x] : 0.0f;
}
Unfortunately this makes no difference even in the latest version of GCC. In Clang it changes branch fall-through (for static branch prediction), but still always calls the thunk. It seems compilers have difficulty with optimizing relocatable code on x86.
It’s commonly understood that the advantage of 64-bit versus 32-bit systems is processes having access to more than 4GB of memory. But as this shows, there’s more to it than that. Even programs that don’t need that much memory can really benefit from newer features like RIP-relative addressing.
]]>struct event {
uint64_t time; // unix epoch (microseconds)
uint32_t size; // including this header (bytes)
uint16_t source;
uint16_t type;
};
The size
member is used to find the offset of the next structure in
the file without knowing anything else about the current structure.
Just add size
to the offset of the current structure.
The type
member indicates what kind of data follows this structure.
The program is likely to switch
on this value.
The actual structures might look something like this (in the spirit of
X-COM). Note how each structure begins with struct event
as
header. All angles are expressed using binary scaling.
#define EVENT_TYPE_OBSERVER 10
#define EVENT_TYPE_UFO_SIGHTING 20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL 30
struct observer {
struct event event;
uint32_t latitude; // binary scaled angle
uint32_t longitude; //
uint16_t source_id; // later used for event source
uint16_t name_size; // not including null terminator
char name[];
};
struct ufo_sighting {
struct event event;
uint32_t azimuth; // binary scaled angle
uint32_t elevation; //
};
struct suspicious_signal {
struct event event;
uint16_t num_channels;
uint16_t sample_rate; // Hz
uint32_t num_samples; // per channel
int16_t samples[];
};
If all integers are stored in little endian byte order (least significant byte first), there’s a strong temptation to lay the structures directly over the data. After all, this will work correctly on most computers.
struct event header;
fread(buffer, sizeof(header), 1, file);
switch (header.type) {
// ...
}
This code will not work correctly when:
The host machine doesn’t use little endian byte order, though this is now uncommon. Sometimes developers will attempt to detect the byte order at compile time and use the preprocessor to byte-swap if needed. This is a mistake.
The host machine has different alignment requirements and so
introduces additional padding to the structure. Sometimes this can
be resolved with a non-standard #pragma pack
.
Fortunately it’s easy to write fast, correct, portable code for this
situation. First, define some functions to extract little endian
integers from an octet buffer (uint8_t
). These will work correctly
regardless of the host’s alignment and byte order.
static inline uint16_t
extract_u16le(const uint8_t *buf)
{
return (uint16_t)buf[1] << 8 |
(uint16_t)buf[0] << 0;
}
static inline uint32_t
extract_u32le(const uint8_t *buf)
{
return (uint32_t)buf[3] << 24 |
(uint32_t)buf[2] << 16 |
(uint32_t)buf[1] << 8 |
(uint32_t)buf[0] << 0;
}
static inline uint64_t
extract_u64le(const uint8_t *buf)
{
return (uint64_t)buf[7] << 56 |
(uint64_t)buf[6] << 48 |
(uint64_t)buf[5] << 40 |
(uint64_t)buf[4] << 32 |
(uint64_t)buf[3] << 24 |
(uint64_t)buf[2] << 16 |
(uint64_t)buf[1] << 8 |
(uint64_t)buf[0] << 0;
}
The big endian version is identical, but with shifts in reverse order.
A common concern is that these functions are a lot less efficient than
they could be. On x86 where alignment is very relaxed, each could be
implemented as a single load instruction. However, on GCC 4.x and
earlier, extract_u32le
compiles to something like this:
extract_u32le:
movzx eax, [rdi+3]
sal eax, 24
mov edx, eax
movzx eax, [rdi+2]
sal eax, 16
or eax, edx
movzx edx, [rdi]
or eax, edx
movzx edx, [rdi+1]
sal edx, 8
or eax, edx
ret
It’s tempting to fix the problem with the following definition:
// Note: Don't do this.
static inline uint32_t
extract_u32le(const uint8_t *buf)
{
return *(uint32_t *)buf;
}
It’s unportable, it’s undefined behavior, and worst of all, it might not work correctly even on x86. Fortunately I have some great news. On GCC 5.x and above, the correct definition compiles to the desired, fast version. It’s the best of both worlds.
extract_u32le:
mov eax, [rdi]
ret
It’s even smart about the big endian version:
static inline uint32_t
extract_u32be(const uint8_t *buf)
{
return (uint32_t)buf[0] << 24 |
(uint32_t)buf[1] << 16 |
(uint32_t)buf[2] << 8 |
(uint32_t)buf[3] << 0;
}
Is compiled to exactly what you’d want:
extract_u32be:
mov eax, [rdi]
bswap eax
ret
Or, even better, if your system supports movbe
(gcc -mmovbe
):
extract_u32be:
movbe eax, [rdi]
ret
Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m betting it will eventually learn how to do this, too.
For this next technique, that struct event
from above need not
actually be in the source. It’s purely documentation. Instead, let’s
define the structure in terms of member offset constants — a term I
just made up for this article. I’ve included the integer types as part
of the name to aid in their correct use.
#define EVENT_U64LE_TIME 0
#define EVENT_U32LE_SIZE 8
#define EVENT_U16LE_SOURCE 12
#define EVENT_U16LE_TYPE 14
Given a buffer, the integer extraction functions, and these offsets, structure members can be plucked out on demand.
uint8_t *buf;
// ...
uint64_t time = extract_u64le(buf + EVENT_U64LE_TIME);
uint32_t size = extract_u32le(buf + EVENT_U32LE_SIZE;
uint16_t source = extract_u16le(buf + EVENT_U16LE_SOURCE);
uint16_t type = extract_u16le(buf + EVENT_U16LE_TYPE);
On x86 with GCC 5.x, each member access will be inlined and compiled to a one-instruction extraction. As far as performance is concerned, it’s identical to using a structure overlay, but this time the C code is clean and portable. A slight downside is the lack of type checking on member access: it’s easy to mismatch the types and accidentally read garbage.
There’s a real advantage to memory mapping the input file and using its contents directly. On a system with a huge virtual address space, such as x86-64 or AArch64, this memory is almost “free.” Already being backed by a file, paging out this memory costs nothing (i.e. it’s discarded). The input file can comfortably be much larger than physical memory without straining the system.
Unportable structure overlay can take advantage of memory mapping this way, but has the previously-described issues. An approach with member offset constants will take advantage of it just as well, all while remaining clean and portable.
I like to wrap the memory mapping code into a simple interface, which makes porting to non-POSIX platforms, such Windows, easier. Caveat: This won’t work with files whose size exceeds the available contiguous virtual memory of the system — a real problem for 32-bit systems.
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
uint8_t *
map_file(const char *path, size_t *length)
{
int fd = open(path, O_RDONLY);
if (fd == -1)
return 0;
struct stat stat;
if (fstat(fd, &stat) == -1) {
close(fd);
return 0;
}
*length = stat.st_size; // TODO: possible overflow
uint8_t *p = mmap(0, *length, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
return p != MAP_FAILED ? p : 0;
}
void
unmap_file(uint8_t *p, size_t length)
{
munmap(p, length);
}
Next, here’s an example that iterates over all the structures in
input_file
, in this case counting each. The size
member is
extracted in order to stride to the next structure.
size_t length;
uint8_t *data = map_file(input_file, &length);
if (!data)
FATAL();
size_t event_count = 0;
uint8_t *p = data;
while (p < data + length) {
event_count++;
uint32_t size = extract_u32le(p + EVENT_U32LE_SIZE);
if (size > length - (p - data))
FATAL(); // invalid size
p += size;
}
printf("I see %zu events.\n", event_count);
unmap_file(data, length);
This is the basic structure for navigating this kind of data. A deeper
dive would involve a switch
inside the loop, extracting the relevant
members for whatever use is needed.
Fast, correct, simple. Pick three.
]]>The simpler, less portable option is to have the linker do it. Both
the GNU linker and the gold linker (ELF only) can create
object files from arbitrary files using the --format
(-b
) option
set to binary
(raw data). It’s combined with --relocatable
(-r
)
to make it linkable with the rest of the program. MinGW supports all
of this, too, so it’s fairly portable so long as you stick to GNU
Binutils.
For example, to create an object file, my_msg.o
with the
contents of the text file my_msg.txt
:
$ ld -r -b binary -o my_file.o my_msg.txt
(Update: You probably also want to use -z noexecstack
.)
The object file will have three symbols, each named after the input file. Unfortunately there’s no control over the symbol names, section (.data), alignment, or protections (e.g. read-only). You’re completely at the whim of the linker, short of objcopy tricks.
$ nm my_msg.o
000000000000000e D _binary_my_msg_txt_end
000000000000000e A _binary_my_msg_txt_size
0000000000000000 D _binary_my_msg_txt_start
To access these in C, declare them as global variables like so:
extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;
The size symbol, _binary_my_msg_txt_size
, is misleading. The “A”
from nm means it’s an absolute symbol, not relocated. It doesn’t refer
to an integer that holds the size of the raw data. The value of the
symbol itself is the size of the data. That is, take the address of it
and cast it to an integer.
size_t size = (size_t)&_binary_my_msg_txt_size;
Alternatively — and this is my own preference — just subtract the other two symbols. It’s cleaner and easier to understand.
size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;
Here’s the “Hello, world” for this approach (hello.c
).
#include <stdio.h>
extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;
int
main(void)
{
size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;
fwrite(_binary_my_msg_txt_start, size, 1, stdout);
return 0;
}
The program has to use fwrite()
rather than fputs()
because the
data won’t necessarily be null-terminated. That is, unless a null is
intentionally put at the end of the text file itself.
And for the build:
$ cat my_msg.txt
Hello, world!
$ ld -r -b binary -o my_msg.o my_msg.txt
$ gcc -o hello hello.c my_msg.o
$ ./hello
Hello, world!
If this was binary data, such as an image file, the program would instead read the array as if it were a memory mapped file. In fact, that’s what it really is: the raw data memory mapped by the loader before the program started.
This could be taken further to dump out some kinds of data structures.
For example, this program (table_gen.c
) fills out a table of the
first 90 Fibonacci numbers and dumps it to standard output.
#include <stdio.h>
#define TABLE_SIZE 90
long long table[TABLE_SIZE] = {1, 1};
int
main(void)
{
for (int i = 2; i < TABLE_SIZE; i++)
table[i] = table[i - 1] + table[i - 2];
fwrite(table, sizeof(table), 1, stdout);
return 0;
}
Build and run this intermediate helper program as part of the overall build.
$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.bin
$ ld -r -b binary -o table.o table.bin
And then the main program (print_fib.c
) might look like:
#include <stdio.h>
extern long long _binary_table_bin_start[];
extern long long _binary_table_bin_end[];
int
main(void)
{
long long *start = _binary_table_bin_start;
long long *end = _binary_table_bin_end;
for (long long *x = start; x < end; x++)
printf("%lld\n", *x);
return 0;
}
However, there are some good reasons not to use this feature in this way:
The format of table.bin
is specific to the host architecture
(byte order, size, padding, etc.). If the host is the same as the
target then this isn’t a problem, but it will prohibit
cross-compilation.
The linker has no information about the alignment requirements of
the data. To the linker it’s just a byte buffer. In the final
program the long long
array will not necessarily aligned properly
for its type, meaning the above program might crash. The Right Way
is to never dereference the data directly but rather memcpy()
it
into a properly-aligned variable, just as if the data was an
unaligned buffer.
The data structure cannot use any pointers. Pointer values are meaningless to other processes and will be no different than garbage.
There’s an easy way to address all three of these problems and eliminate the reliance on GNU linkers: serialize the data into C code. It’s metaprogramming, baby.
In the Fibonacci example, change the fwrite()
in table_gen.c
to
this:
printf("int table_size = %d;\n", TABLE_SIZE);
printf("long long table[] = {\n");
for (int i = 0; i < TABLE_SIZE; i++)
printf(" %lldLL,\n", table[i]);
printf("};\n");
The output of the program becomes text:
int table_size = 90;
long long table[] = {
1LL,
1LL,
2LL,
3LL,
/* ... */
1779979416004714189LL,
2880067194370816120LL,
}
And print_fib.c
is changed to:
#include <stdio.h>
extern int table_size;
extern long long table[];
int
main(void)
{
for (int i = 0; i < table_size; i++)
printf("%lld\n", table[i]);
return 0;
}
Putting it all together:
$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.c
$ gcc -std=c99 -o print_fib print_fib.c table.c
Any C compiler and linker could do all of this, no problem, making it
more portable. The intermediate metaprogram isn’t a barrier to cross
compilation. It would be compiled for the host (typically identified
through HOST_CC
) and the rest is compiled for the target (e.g.
CC
).
The output of table_gen.c
isn’t dependent on any architecture,
making it cross-compiler friendly. There are also no alignment
problems because it’s all visible to compiler. The type system isn’t
being undermined.
The Fibonacci example doesn’t address the pointer problem — it has no pointers to speak of. So let’s step it up to a trie using the trie from the previous article. As a reminder, here it is:
#define TRIE_ALPHABET_SIZE 4
#define TRIE_TERMINAL_FLAG (1U << 0)
struct trie {
struct trie *next[TRIE_ALPHABET_SIZE];
struct trie *p;
int i;
unsigned flags;
};
Dumping these structures out raw would definitely be useless since they’re almost entirely pointer data. So instead, fill out an array of these structures, referencing the array itself to build up the pointers (later filled in by either the linker or the loader). This code uses the in-place breadth-first traversal technique from the previous article.
void
trie_serialize(struct trie *t, const char *name)
{
printf("struct trie %s[] = {\n", name);
struct trie *head = t;
struct trie *tail = t;
t->p = NULL;
size_t count = 0;
while (head) {
printf(" {{");
for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
struct trie *next = head->next[i];
const char *comma = i ? ", " : "";
if (next) {
/* Add child to the queue. */
tail->p = next;
tail = next;
next->p = NULL;
/* Print the pointer to the child. */
printf("%s%s + %zu", comma, name, ++count);
} else {
printf("%s0", comma);
}
}
printf("}, 0, 0, %u},\n", head->flags & TRIE_TERMINAL_FLAG);
head = head->p;
}
printf("};\n");
}
Remember that list of strings from before?
AAAAA
ABCD
CAA
CAD
CDBD
Which looks like this?
That serializes to this C code:
struct trie root[] = {
{{root + 1, 0, root + 2, 0}, 0, 0, 0},
{{root + 3, root + 4, 0, 0}, 0, 0, 0},
{{root + 5, 0, 0, root + 6}, 0, 0, 0},
{{root + 7, 0, 0, 0}, 0, 0, 0},
{{0, 0, root + 8, 0}, 0, 0, 0},
{{root + 9, 0, 0, root + 10}, 0, 0, 0},
{{0, root + 11, 0, 0}, 0, 0, 0},
{{root + 12, 0, 0, 0}, 0, 0, 0},
{{0, 0, 0, root + 13}, 0, 0, 0},
{{0, 0, 0, 0}, 0, 0, 1},
{{0, 0, 0, 0}, 0, 0, 1},
{{0, 0, 0, root + 14}, 0, 0, 0},
{{root + 15, 0, 0, 0}, 0, 0, 0},
{{0, 0, 0, 0}, 0, 0, 1},
{{0, 0, 0, 0}, 0, 0, 1},
{{0, 0, 0, 0}, 0, 0, 1},
};
This trie can be immediately used at program startup without initialization, and it can even have new nodes inserted into it. It’s not without its downsides, particularly because it’s a trie:
It’s really going to blow up the size of the binary, especially when it holds lots of strings. These nodes are anything but compact.
If the code is compiled to be position-independent (-fPIC
), each
of those nodes is going to hold multiple dynamic relocations,
further exploding the size of the binary and preventing the trie
from being shared between processes. It’s 24 bytes per
relocation on x86-64. This will also slow down program start up
time. With just a few thousand strings, the simple test program was
taking 5x longer to start (25ms instead of 5ms) than with an empty
trie.
Even without being position-independent, the linker will have to resolve all the compile-time relocations. I was able to overwhelm linker and run it out of memory with just some tens of thousands of strings. This would make for a decent linker stress test.
This technique obviously doesn’t scale well with trie data. You’re better off baking in the flat string list and building the trie at run time — though you could compute the exact number of needed nodes at compile time and statically allocate them (in .bss). I’ve personally had much better luck with other sorts of lookup tables. It’s a useful tool for the C programmer’s toolbelt.
]]>This wasn’t my first time writing a trie. The curse of programming in C is rewriting the same data structures and algorithms over and over. It’s the problem C++ templates are intended to solve. This rewriting isn’t always bad since each implementation is typically customized for its specific use, often resulting in greater performance and a smaller resource footprint.
Every time I’ve rewritten a trie, my implementation is a little bit better than the last. This time around I discovered an approach for traversing, both depth-first and breadth-first, an arbitrarily-sized trie without memory allocation. I’m definitely not the first to discover something like this. There’s Deutsch-Schorr-Waite pointer reversal for binary graphs (1965) — which I originally learned from reading the Scheme 9 from Outer Space garbage collector source — and Morris in-order traversal (1979) for binary trees. The former requires two extra tag bits per node and the latter requires no modifications at all.
But before I go further, some background. A trie can come in many shapes and sizes, but in the simple case each node of a trie has as many pointers as its alphabet. For illustration purposes, imagine a trie for strings of only four characters: A, B, C, and D. Each node is essentially four pointers.
#define TRIE_ALPHABET_SIZE 4
#define TRIE_STATIC_INIT {.flags = 0}
#define TRIE_TERMINAL_FLAG (1U << 0)
struct trie {
struct trie *next[TRIE_ALPHABET_SIZE];
unsigned flags;
};
It includes a flags
field, where a single bit tracks whether or not
a node is terminal — that is, a key terminates at this node. Terminal
nodes are not necessarily leaf nodes, which is the case when one key
is a prefix of another key. I could instead have used a 1-bit
bit-field (e.g. int is_terminal : 1;
) but I don’t like bit-fields.
A trie with the following keys, inserted in any order:
AAAAA
ABCD
CAA
CAD
CDBD
Looks like this (terminal nodes illustrated as small black squares):
The root of the trie is the empty string, and each child represents a trie prefixed with one of the symbols from the alphabet. This is a nice recursive definition, and it’s tempting to write recursive functions to process it. For example, here’s a recursive insertion function.
int
trie_insert_recursive(struct trie *t, const char *s)
{
if (!*s) {
t->flags |= TRIE_TERMINAL_FLAG;
return 1;
}
int i = *s - 'A';
if (!t->next[i]) {
t->next[i] = malloc(sizeof(*t->next[i]));
if (!t->next[i])
return 0;
*t->next[i] = (struct trie)TRIE_STATIC_INIT;
}
return trie_insert_recursive(t->next[i], s + 1);
}
If the string is empty (!*s
), mark the current node as terminal.
Otherwise recursively insert the substring under the appropriate
child. That’s a tail call, and any optimizing compiler would optimize
this call into a jump back to the beginning of of the function
(tail-call optimization), reusing the stack frame as if it were a
simple loop.
If that’s not good enough, such as when optimization is disabled for debugging and the recursive definition is blowing the stack, this is trivial to convert to a safe, iterative function. I prefer this version anyway.
int
trie_insert(struct trie *t, const char *s)
{
for (; *s; s++) {
int i = *s - 'A';
if (!t->next[i]) {
t->next[i] = malloc(sizeof(*t->next[i]));
if (!t->next[i])
return 0;
*t->next[i] = (struct trie)TRIE_STATIC_INIT;
}
t = t->next[i];
}
t->flags |= TRIE_TERMINAL_FLAG;
return 1;
}
Finding a particular prefix in the trie iteratively is also easy. This would be used to narrow the trie to a chosen prefix before iterating over the keys (e.g. find all strings matching a prefix).
struct trie *
trie_find(struct trie *t, const char *s)
{
for (; *s; s++) {
int i = *s - 'A';
if (!t->next[i])
return NULL;
t = t->next[i];
}
return t;
}
Depth-first traversal is stack-oriented. The stack represents the path through the graph, and each new vertex is pushed into this stack as it’s visited. A recursive traversal function can implicitly use the call stack for storing this information, so no additional data structure is needed.
The downside is that the call is no longer tail-recursive, so a large trie will blow the stack. Also, the caller needs to provide a callback function because the stack cannot unwind to return a value: The stack has important state on it. Here’s a typedef for the callback.
typedef void (*trie_visitor)(const char *key, void *arg);
And here’s the recursive depth-first traversal function. The top-level
caller passes the same buffer for buf
and bufend
, which must be at
least as large as the largest key. The visited key will be written to
this buffer and passed to the visitor.
void
trie_dfs_recursive(struct trie *t,
char *buf,
char *bufend,
trie_visitor v,
void *arg)
{
if (t->flags & TRIE_TERMINAL_FLAG) {
*bufend = 0;
v(buf, arg);
}
for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
if (t->next[i]) {
*bufend = 'A' + i;
trie_dfs_recursive(t->next[i], buf, bufend + 1, v, arg);
}
}
}
Moving the traversal stack to the heap would eliminate the stack overflow problem and it would allow control to return to the caller. This is going to be a lot of code for an article, but bear with me.
First define an iterator object. The stack will need two pieces of
information: which node did we come from (p
) and through which
pointer (i
). When a node has been exhausted, this will allow return
to the parent. The root
field tracks when traversal is complete.
struct trie_iter {
struct trie *root;
char *buf;
char *bufend;
struct {
struct trie *p;
int i;
} *stack;
};
A special value of -1 in i
means it’s the first visit for this node
and it should be visited by the callback if it’s terminal.
The iterator is initialized with trie_iter_init
. The max
indicates
the maximum length of any key. A more elaborate implementation could
automatically grow the stack to accommodate (e.g. realloc()), but I’m
keeping it as simple as possible.
int
trie_iter_init(struct trie_iter *it, struct trie *t, size_t max)
{
it->root = t;
it->stack = malloc(sizeof(*it->stack) * max);
if (!it->stack)
return 0;
it->buf = it->bufend = malloc(max);
if (!it->buf) {
free(it->stack);
return 0;
}
it->stack->p = t;
it->stack->i = -1;
return 1;
}
void
trie_iter_destroy(struct trie_iter *it)
{
free(it->stack);
it->stack = NULL;
free(it->buf);
it->buf = NULL;
}
And finally the complicated part. This uses the allocated stack to explore the trie in a loop until it hits a terminal, at which point it returns. A further call continues the traversal from where it left off. It’s like a hand-coded generator. With the way it’s written, the caller is obligated to follow through with the entire iteration before destroying the iterator, but this would be easy to correct.
int
trie_iter_next(struct trie_iter *it)
{
for (;;) {
struct trie *current = it->stack->p;
int i = it->stack->i++;
if (i == -1) {
/* Return result if terminal node. */
if (current->flags & TRIE_TERMINAL_FLAG) {
*it->bufend = 0;
return 1;
}
continue;
}
if (i == TRIE_ALPHABET_SIZE) {
/* End of current node. */
if (current == it->root)
return 0; // back at root, done
it->stack--;
it->bufend--;
continue;
}
if (current->next[i]) {
/* Push on next child node. */
*it->bufend = 'A' + i;
it->stack++;
it->bufend++;
it->stack->p = current->next[i];
it->stack->i = -1;
}
}
}
This is much nicer for the caller since there’s no control inverse.
struct trie_iter it;
trie_iter_init(&it, &trie_root, KEY_MAX);
while (trie_iter_next(&it)) {
// ... do something with it.buf ...
}
trie_iter_destroy(&it);
There are a few downsides to this:
Initialization could fail (not checked in the example) since it allocates memory.
Either the caller has to keep track of the maximum key length, or the iterator grows the stack automatically, which would mean iteration could fail at any point in the middle.
In order to destroy the trie, it needs to be traversed: Freeing memory first requires allocating memory. If the program is out of memory, it cannot destroy the trie to clean up before handling the situation, nor to make more memory available. It’s not good for resilience.
Wouldn’t it be nice to traverse the trie without memory allocation?
Rather than allocate a separate stack, the stack can be allocated
across the individual nodes of the trie. Remember those p
and i
fields from before? Put them on the trie.
struct trie_v2 {
struct trie_v2 *next[TRIE_ALPHABET_SIZE];
struct trie_v2 *p;
int i;
unsigned flags;
};
This automatically scales with the size of the trie, so there will always be enough of this stack. With the stack “pre-allocated” like this, traversal requires no additional memory allocation.
The iterator itself becomes a little simpler. It cannot fail and it doesn’t need a destructor.
struct trie_v2_iter {
struct trie_v2 *current;
char *buf;
};
void
trie_v2_iter_init(struct trie_v2_iter *it, struct trie_v2 *t, char *buf)
{
t->p = NULL;
t->i = -1;
it->current = t;
it->buf = buf;
}
The iteration function itself is almost identical to before. Rather
than increment a stack pointer, it uses p
to chain the nodes as a
linked list.
int
trie_v2_iter_next(struct trie_v2_iter *it)
{
for (;;) {
struct trie_v2 *current = it->current;
int i = it->current->i++;
if (i == -1) {
/* Return result if terminal node. */
if (current->flags & TRIE_TERMINAL_FLAG) {
*it->buf = 0;
return 1;
}
continue;
}
if (i == TRIE_ALPHABET_SIZE) {
/* End of current node. */
if (!current->p)
return 0;
it->current = current->p;
it->buf--;
continue;
}
if (current->next[i]) {
/* Push on next child node. */
*it->buf = 'A' + i;
it->buf++;
it->current = current->next[i];
it->current->p = current;
it->current->i = -1;
}
}
}
During traversal the iteration pointers look something like this:
This is not without its downsides:
Traversal is not re-entrant nor thread-safe. It’s not possible to run multiple in-place iterators side by side on the same trie since they’ll clobber each other.
It uses more memory — O(n) rather than O(max-key-length) — and sits on this extra memory for its entire lifetime.
The same technique can be used for breadth-first search, which is
queue-oriented rather than stack-oriented. The p
pointers are
instead chained into a queue, with a head
and tail
pointer
variable for each end. As each node is visited, its children are
pushed into the queue linked list.
This isn’t good for visiting keys by name. buf
was itself a stack
and played nicely with depth-first traversal, but there’s no easy way
to build up a key in a buffer breadth-first. So instead here’s a
function to destroy a trie breadth-first.
void
trie_v2_destroy(struct trie_v2 *t)
{
struct trie_v2 *head = t;
struct trie_v2 *tail = t;
while (head) {
for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
struct trie_v2 *next = head->next[i];
if (next) {
next->p = NULL;
tail->p = next;
tail = next;
}
}
struct trie_v2 *dead = head;
head = head->p;
free(dead);
}
}
During its traversal the p
pointers link up like so:
In my real code there’s also a flag to indicate the node’s allocation type: static or heap. This allows a trie to be composed of nodes from both kinds of allocations while still safe to destroy. It might also be useful to pack a reference counter into this space so that a node could be shared by more than one trie.
For a production implementation it may be worth packing i
into the
flags
field since it only needs a few bits, even with larger
alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the p
field
could be eliminated and instead one of the child pointers is
temporarily reversed. With these changes, this technique would fit
into the original struct trie
without changes, eliminating the extra
memory usage.
Update: Over on Hacker News, psi-squared has interesting suggestions such as leaving the traversal pointers intact, particularly in the case of a breadth-first search, which, until the next trie modification, allows for concurrent follow-up traversals.
]]>As a demonstration, in this article I’ll build an Emacs joystick interface (Linux only) using a dynamic module. It will allow Emacs to read events from any joystick on the system. All the source code is here:
It includes a calibration interface (M-x joydemo
) within Emacs:
Currently, Emacs’ emacs-module.h header is the entirety of the module
documentation. It’s a bit thin and leaves ambiguities that requires
some reading of the Emacs source code. Even reading the source, it’s
not clear which behaviors are a reliable part of the interface. For
example, if there’s a pending non-local exit, it’s safe for a function
to return NULL
since the return value is never inspected (Emacs
25.1), but will this always be the case? While mistakes are
unforgiving (a hard crash), the API is mostly intuitive and it’s been
pretty easy to feel my way around it.
Update: Philipp Stephani has written thorough, reliable module documentation.
All Emacs values — integers, floats, cons cells, vectors, strings,
etc. — are represented as the polymorphic, pointer-valued type,
emacs_value
. Despite being a pointer, NULL
is not a valid value,
as convenient as that would be. The API includes functions for
creating and extracting the fundamental types: integers, floats,
strings. Almost all other object types can only be accessed by making
Lisp function calls to regular Emacs functions from the module.
Modules also introduce a brand new Emacs object type: a user pointer. These are non-readable, opaque pointer values returned by modules, typically representing a handle to some resource, be it a memory block, database connection, or a joystick. These objects include a finalizer function pointer — which, surprisingly, is not permitted to be NULL — and their lifetime is managed by Emacs’ garbage collector.
User pointers are a somewhat dangerous feature since there’s little to stop Emacs Lisp code from misusing them. A Lisp program can take a user pointer from one module and pass it to a function in a different module. Since it’s just a pointer, there’s no way to type check it. At best, a module could maintain a table of all its live pointers, checking all user pointer arguments against the table before dereferencing. But I don’t expect this to be normal practice.
After loading the module through the platform’s mechanism, the first
thing Emacs does is check for the symbol plugin_is_GPL_compatible
.
While tacky, this is not surprising given the culture around Emacs.
Next it calls emacs_module_init()
, passing it the first function
pointer. From this, the module can get a Lisp environment and start
doing Emacs things, such as binding module functions to Lisp symbols.
Here’s a complete “Hello, world!” example:
#include "emacs-module.h"
int plugin_is_GPL_compatible;
int
emacs_module_init(struct emacs_runtime *ert)
{
emacs_env *env = ert->get_environment(ert);
emacs_value message = env->intern(env, "message");
const char hi[] = "Hello, world!";
emacs_value string = env->make_string(env, hi, sizeof(hi) - 1);
env->funcall(env, message, 1, &string);
return 0;
}
In a real module, it’s common to create function objects for native
functions, then fetch the fset
symbol and make a Lisp call on it to
bind the newly-created function object to a name. You’ll see this in
action later.
The joystick API will closely resemble Linux’s own joystick API,
making for a fairly thin wrapper. It’s so thin that Emacs almost
doesn’t even need a dynamic module. This is because, on Linux,
joysticks are just files under /dev/input/
. Want to see the input
events on the first joystick? Just read /dev/input/js0
. So Plan 9.
Emacs already knows how to read files, but these virtual files are a
little too special for that. The header linux/joystick.h
defines a
struct js_event
:
struct js_event {
uint32_t time; /* event timestamp in milliseconds */
int16_t value;
uint8_t type;
uint8_t number; /* axis/button number */
};
The idea is to read from the joystick device into this structure. The first several reads are initialization that define the axes and buttons of the joystick and their initial state. Further events are queued up for the file descriptor. This all means that the file can’t just be opened each time joystick input is needed. It has to be held open for the duration, and is typically configured non-blocking.
The Emacs package will be called joymacs
and there will be three
functions:
(joymacs-open N)
(joymacs-close JOYSTICK)
(joymacs-read JOYSTICK EVENT-VECTOR)
The joymacs-open
function will take an integer, opening the Nth
joystick (/dev/input/jsN
). It will create a file descriptor for the
joystick device, returning it as a user pointer. Think of it as a sort
of “joystick handle.” Now, it could instead return the file
descriptor as an integer, but the user pointer has two significant
benefits:
The resource will be garbage collected. If the caller loses track of a file descriptor returned as an integer, the joystick device will be held open until Emacs shuts down, using up one of Emacs’ file descriptors. By putting it in a user pointer, the garbage collector will have the module to release the file descriptor if the user loses track of it.
It should be difficult for the user to make a dangerous call.
Emacs Lisp can’t create user pointers — they only come from modules
— and so the module is less likely to get passed the wrong thing.
In the case of joystick-close
, the module will be calling
close(2)
on the argument. We definitely don’t want to make that
system call on file descriptors owned by Emacs. Further, since user
pointers are mutable, the module can ensure it doesn’t call
close(2)
twice.
Here’s the implementation for joymacs-open
. I’ll over over each part
in detail.
static emacs_value
joymacs_open(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
(void)ptr;
(void)n;
int id = env->extract_integer(env, args[0]);
if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
return nil;
char buf[64];
int buflen = sprintf(buf, "/dev/input/js%d", id);
int fd = open(buf, O_RDONLY | O_NONBLOCK);
if (fd == -1) {
emacs_value signal = env->intern(env, "file-error");
emacs_value message = env->make_string(env, buf, buflen);
env->non_local_exit_signal(env, signal, message);
return nil;
}
return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);
}
The C function name doesn’t matter to Emacs. It’s static
because it
doesn’t even matter if the function visible to Emacs. It will get the
function pointer later as part of initialization.
This is the prototype for all functions callable by Emacs Lisp, regardless of its arity. It has four arguments:
It gets an environment, env
, through which to call back into
Emacs.
It gets n
, the number of arguments. This is guaranteed to be the
correct number of arguments, as specified later when creating the
function object, so only variadic functions need to inspect this
argument.
The Lisp arguments are passed as an array of values, args
.
There’s no type declaration when declaring a function object, so
these may be of the wrong type. I’ll go over how to deal with this.
Finally, it gets an arbitrary pointer, supplied at function object creation time. This allows the module to create closures, but will usually be ignored.
The first thing the function does is extract its integer argument.
This is actually an intmax_t
, but I don’t think anyone has that many
USB ports. An int
will suffice.
int id = env->extract_integer(env, args[0]);
if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
return nil;
As for not underestimating fools, what if the user passed a value that
isn’t an integer? Will the world come crashing down? Fortunately Emacs
checks that in extract_integer
and, if there’s a mismatch, sets a
pending error signal in the environment. This is really great because
checking types directly in the module is a real pain the ass. So,
before committing to anything further, such as opening a file, I check
for this signal and bail out early if necessary. In Emacs 25.1 it’s
safe to return NULL since the return value will be completely ignored,
but I’d rather hedge my bets.
By the way, the nil
here is a global variable set in initialization.
You don’t just get that for free!
The next step is opening the joystick device, read-only and non-blocking. The non-blocking is vital because the module would otherwise hang Emacs later if there are no events (well, except for the read being quickly interrupted by a POSIX signal).
char buf[64];
int buflen = sprintf(buf, "/dev/input/js%d", id);
int fd = open(buf, O_RDONLY | O_NONBLOCK);
If the joystick fails to open (e.g. it doesn’t exist, or the user
lacks permission), manually set an error signal for a non-local exit.
I chose the file-error
signal and I’m just using the filename as the
signal data.
if (fd == -1) {
emacs_value signal = env->intern(env, "file-error");
emacs_value message = env->make_string(env, buf, buflen);
env->non_local_exit_signal(env, signal, message);
return nil;
}
Otherwise create the user pointer. No need to allocate any memory; just stuff it in the pointer itself. If the user mistakenly passes it to another module, it will sure be in for a surprise when it tries to dereference it.
return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);
The fin_close()
function is defined as:
static void
fin_close(void *fdptr)
{
int fd = (intptr_t)fdptr;
if (fd != -1)
close(fd);
}
The garbage collector will call this function when the user pointer is
lost. If the user closes it early with joymacs-close
, that function
will set the user pointer to -1, an invalid file descriptor, so that
it doesn’t get closed a second time here.
Here’s joymacs-close
, which is a bit simpler.
static emacs_value
joymacs_close(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
(void)ptr;
(void)n;
int fd = (intptr_t)env->get_user_ptr(env, args[0]);
if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
return nil;
if (fd != -1) {
close(fd);
env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
}
return nil;
}
Again, it starts by extracting its argument, relying on Emacs to do the check:
int fd = (intptr_t)env->get_user_ptr(env, args[0]);
if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
return nil;
If the user pointer hasn’t been closed yet, then close it and strip out the file descriptor to prevent further closes.
if (fd != -1) {
close(fd);
env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
}
The joymacs-read
function is doing something a little unusual for an
Emacs Lisp function. It takes two arguments: the joystick handle and a
5-element vector. Instead of returning the event in some
representation, it fills the vector with the event details. The are
two reasons for this:
The API has no function for creating vectors … though the module
could get the make-symbol
vector and call it to create a
vector.
The idiom for event pumps is for the caller to supply a buffer to the pump. This has better performance by avoiding lots of unnecessary allocations, especially since events tend to be message-like objects with a short, well-defined extent.
Here’s the full definition:
static emacs_value
joymacs_read(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
(void)n;
(void)ptr;
int fd = (intptr_t)env->get_user_ptr(env, args[0]);
if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
return nil;
struct js_event e;
int r = read(fd, &e, sizeof(e));
if (r == -1 && errno == EAGAIN) {
/* No more events. */
return nil;
} else if (r == -1) {
/* An actual read error (joystick unplugged, etc.). */
emacs_value signal = env->intern(env, "file-error");
const char *error = strerror(errno);
size_t len = strlen(error);
emacs_value message = env->make_string(env, error, len);
env->non_local_exit_signal(env, signal, message);
return nil;
} else {
/* Fill out event vector. */
emacs_value v = args[1];
emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
emacs_value value;
if (type == button)
value = e.value ? t : nil;
else
value = env->make_float(env, e.value / (double)INT16_MAX);
env->vec_set(env, v, 0, env->make_integer(env, e.time));
env->vec_set(env, v, 1, type);
env->vec_set(env, v, 2, value);
env->vec_set(env, v, 3, env->make_integer(env, e.number));
env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
return args[1];
}
}
As before, extract the first argument and check for a signal. Then
call read(2)
to get an event. If the read fails with EAGAIN
, it’s
not a real failure. There are just no more events, so return nil.
struct js_event e;
int r = read(fd, &e, sizeof(e));
if (r == -1 && errno == EAGAIN) {
/* No more events. */
return nil;
}
If the read failed with something else — perhaps the joystick was
unplugged — signal an error. The strerror(3)
string is used for the
signal data.
if (r == -1) {
/* An actual read error (joystick unplugged, etc.). */
emacs_value signal = env->intern(env, "file-error");
const char *error = strerror(errno);
emacs_value message = env->make_string(env, error, strlen(error));
env->non_local_exit_signal(env, signal, message);
return nil;
}
Otherwise fill out the event vector. If the second argument isn’t a
vector, or if it’s too short, the signal will automatically get raised
by Emacs. The module can keep plowing through the vec_set()
calls
safely since it’s not committing to anything.
/* Fill out event vector. */
emacs_value v = args[1];
emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
emacs_value value;
if (type == button)
value = e.value ? t : nil;
else
value = env->make_float(env, e.value / (double)INT16_MAX);
env->vec_set(env, v, 0, env->make_integer(env, e.time));
env->vec_set(env, v, 1, type);
env->vec_set(env, v, 2, value);
env->vec_set(env, v, 3, env->make_integer(env, e.number));
env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
return args[1];
The Linux event struct has four fields and the function fills out five
values of the vector. This is because the type
field has a bit flag
indicating initialization events. This is split out into an extra
t/nil value. It also normalizes axis values and converts button values
into t/nil, which makes more sense for Emacs Lisp. The event itself is
returned since it’s a truthy value and it’s convenient for the caller.
The astute programmer might notice that the negative side of the axis
could go just below -1.0, since INT16_MIN
has one extra value over
INT16_MAX
(two’s complement). It doesn’t seem to be documented, but
the joystick drivers I’ve seen never exactly return INT16_MIN
, so
this is in fact the correct way to normalize it.
Update 2021: In a previous version of this article, I talked about interning symbols during initialziation so that they do not need to be re-interned each time the module is called. This no longer works, and it was probably never intended to be work in the first place. The lesson is simple: Do not reuse Emacs objects between module calls.
First grab the fset
symbol since this function will be needed to bind
names to the module’s functions.
emacs_value fset = env->intern(env, "fset");
Using fset
, bind the functions. The second and third arguments to
make_function
are the minimum and maximum number of arguments, which
may look familiar. The last argument is that closure pointer
I mentioned at the beginning.
emacs_value args[2];
args[0] = env->intern(env, "joymacs-open");
args[1] = env->make_function(env, 1, 1, joymacs_open, doc, 0);
env->funcall(env, fset, 2, args);
If the module is to be loaded with require
like any other package,
it needs to provide: (provide 'joymacs)
.
emacs_value provide = env->intern(env, "provide");
emacs_value joymacs = env->intern(env, "joymacs");
env->funcall(env, provide, 1, &joymacs);
And that’s it!
The source repository now includes a port to Windows (XInput). If
you’re on Linux or Windows, have Emacs 25 with modules enabled, and a
joystick is plugged in, then make run
in the repository should bring
up Emacs running a joystick calibration demonstration. The module
can’t poke at Emacs when events are ready, so instead there’s a timer
that polls the module for events.
I’d like to someday see an Emacs Lisp game well-suited for a joystick.
]]>char *
.
char *colors_ptr[] = {
"red",
"orange",
"yellow",
"green",
"blue",
"violet"
};
The other is a two-dimensional char
array.
char colors_2d[][7] = {
"red",
"orange",
"yellow",
"green",
"blue",
"violet"
};
The initializers are identical, and the syntax by which these tables are used is the same, but the underlying data structures are very different. For example, suppose I had a lookup() function that searches the table for a particular color.
int
lookup(const char *color)
{
int ncolors = sizeof(colors) / sizeof(colors[0]);
for (int i = 0; i < ncolors; i++)
if (strcmp(colors[i], color) == 0)
return i;
return -1;
}
Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is char
colors[][7]
or char *colors[]
. It’s a little bit misleading because
the compiler generates different code depending on the type.
Here’s what colors_ptr
, a jagged array, typically looks like in
memory.
The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression colors_ptr[n]
is char *
.
On x86-64, suppose the base of the table is in rax
, the index of the
string I want to retrieve is rcx
, and I want to put the string’s
address back into rax
. It’s one load instruction.
mov rax, [rax + rcx*8]
Contrast this with colors_2d
: six 7-byte elements in a row. No
pointers or addresses. Only strings.
The strings are in their defined order, packed together. The type of
the expression colors_2d[n]
is char [7]
, an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.
In the same scenario before with x86-64, it takes two instructions to
put the string’s address in rax
, but neither is a load.
imul rcx, rcx, 7
add rax, rcx
In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. char colors_2d[][8]
). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.
lea rax, [rax + rcx*8]
There’s another factor to consider: relocation. Nearly every process running on a modern system takes advantage of a security feature called Address Space Layout Randomization (ASLR). The virtual address of code and data is randomized at process load time. For shared libraries, it’s not just a security feature, it’s essential to their basic operation. Libraries cannot possibly coordinate their preferred load addresses with every other library on the system, and so must be relocatable.
If the program is compiled with GCC or Clang configured for position
independent code — -fPIC
(for libraries) or -fpie
+ -pie
(for
programs) — extra work has to be done to support colors_ptr
. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.
However, colors_2d
doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.
On x86-64, in both cases the table itself typically doesn’t need a relocation entry because it will be RIP-relative (in the small code model). That is, code that uses the table will be at a fixed offset from the table no matter where the program is loaded. It won’t need to be looked up using the Global Offset Table (GOT).
In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in rax
looks
like so:
;; NASM:
lea rax, [rel address]
;; Some others:
lea rax, [rip + address]
Or in AT&T syntax:
lea address(%rip), %rax
Besides (trivially) more work for the loader, there’s another consequence to relocations: Pages containing relocations are not shared between processes (except after fork()). When loading a program, the loader doesn’t copy programs and libraries to memory so much as it memory maps their binaries with copy-on-write semantics. If another process is running with the same binaries loaded (e.g. libc.so), they’ll share the same physical memory so long as those pages haven’t been modified by either process. Modifying the page creates a unique copy for that process.
Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means colors_2d
has the possibility of being shared
between processes, but colors_ptr
(and its entire page) definitely
does not. Shucks.
This is one of the reasons why the Procedure Linkage Table (PLT) exists. The PLT is an array of function stubs for shared library functions, such as those in the C standard library. Sure, the loader could go through the program and fill out the address of every library function call, but this would modify lots and lots of code pages, creating a unique copy of large parts of the program. Instead, the dynamic linker lazily supplies jump addresses for PLT function stubs, one per accessed library function.
However, as I’ve written it above, it’s unlikely that even colors_2d
will be shared. It’s still missing an important ingredient: const.
They say const isn’t for optimization but, darnit, this
situation keeps coming up. Since colors_ptr
and colors_2d
are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
colors_2d
to being a local copy.
Fortunately it’s trivial to fix by adding a const:
const char colors_2d[][7] = { /* ... */ };
Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (.rodata
) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.
Note, the equivalent for colors_ptr
requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)
const char *const colors_ptr[] = { /* ... */ };
String literals are already effectively const, though the C specification (unlike C++) doesn’t actually define them to be this way. But, like setting your relationship status on Facebook, declaring it makes it official.
These little details are all deep down the path of micro-optimization and will rarely ever matter in practice, but perhaps you learned something broader from all this. This stuff fascinates me.
]]>A widespread extension to C is the alloca() pseudo-function. It’s like malloc(), but allocates memory on the stack, just like an automatic variable. The allocation is automatically freed when the function (not its scope!) exits, even with a longjmp() or other non-local exit.
void *alloca(size_t size);
Besides its portability issues, the most dangerous property is the
complete lack of error detection. If size
is too large, the
program simply crashes, or worse.
For example, suppose I have an intern() function that finds or creates the canonical representation/storage for a particular string. My program needs to intern a string composed of multiple values, and will construct a temporary string to do so.
const char *intern(const char *);
const char *
intern_identifier(const char *prefix, long id)
{
size_t size = strlen(prefix) + 32;
char *buffer = alloca(size);
sprintf(buffer, "%s%ld", prefix, id);
return intern(buffer);
}
I expect the vast majority of these prefix
strings to be very small,
perhaps on the order of 10 to 80 bytes, and this function will handle
them extremely efficiently. But should this function get passed a huge
prefix
, perhaps by a malicious actor, the program will misbehave
without warning.
A portable alternative to alloca() is variable-length arrays (VLA), introduced in C99. Arrays with automatic storage duration need not have a fixed, compile-time size. It’s just like alloca(), having exactly the same dangers, but at least it’s properly scoped. It was rejected for inclusion in C++11 due to this danger.
const char *
intern_identifier(const char *prefix, long id)
{
char buffer[strlen(prefix) + 32];
sprintf(buffer, "%s%ld", prefix, id);
return intern(buffer);
}
There’s a middle-ground to this, using neither VLAs nor alloca(). Suppose the function always allocates a small, fixed size buffer — essentially a free operation — but only uses this buffer if it’s large enough for the job. If it’s not, a normal heap allocation is made with malloc().
const char *
intern_identifier(const char *prefix, long id)
{
char temp[256];
char *buffer = temp;
size_t size = strlen(prefix) + 32;
if (size > sizeof(temp))
if (!(buffer = malloc(size)))
return NULL;
sprintf(buffer, "%s%ld", prefix, id);
const char *result = intern(buffer);
if (buffer != temp)
free(buffer);
return result;
}
Since the function can now detect allocation errors, this version has an error condition. Though, intern() itself would presumably return NULL for its own allocation errors, so this is probably transparent to the caller.
We’ve now entered the realm of small-size optimization. The vast majority of cases are small and will therefore be very fast, but we haven’t given up on the odd large case either. In fact, it’s been made a little bit worse (via the unnecessary small allocation), selling it out to make the common case fast. That’s sound engineering.
Visual Studio has a pair of functions that nearly automate this solution: _malloca() and _freea(). It’s like alloca(), but allocations beyond a certain threshold go on the heap. This allocation is freed with _freea(), which does nothing in the case of a stack allocation.
void *_malloca(size_t);
void _freea(void *);
I said “nearly” because Microsoft screwed it up: instead of returning NULL on failure, it generates a stack overflow structured exception (for a heap allocation failure).
I haven’t tried it yet, but I bet something similar to malloca() / freea() could be implemented using a couple of macros.
CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the talks. I don’t like developing in C++, but I always learn new, interesting concepts from this conference, many of which apply directly to C. I look forward to Chandler Carruth’s talks the most, having learned so much from his past talks. I recommend these all:
After writing this article, I saw Nicholas Ormrod’s talk, The strange details of std::string at Facebook, which is also highly relevant.
Chandler’s talk this year was the one on hybrid data structures. I’d already been mulling over small-size optimization for months, and the first 5–10 minutes of his talk showed me I was on the right track. In his talk he describes LLVM’s SmallVector class (among others), which is basically a small-size-optimized version of std::vector, which, due to constraints on iterators under std::move() semantics, can’t itself be small-size optimized.
I picked up a new trick from this talk, which I’ll explain in C’s
terms. Suppose I have a dynamically growing buffer “vector” of long
values. I can keep pushing values into the buffer, doubling the
storage in size each time it fills. I’ll call this one “simple.”
struct vec_simple {
size_t size;
size_t count;
long *values;
};
Initialization is obvious. Though for easy overflow checks, and for
another reason I’ll explain later, I’m going to require the starting
size, hint
, to be a power of two. It returns 1 on success and 0 on
error.
int
vec_simple_init(struct vec_simple *v, size_t hint)
{
assert(hint && (hint & (hint - 1)) == 0); // power of 2
v->size = hint;
v->count = 0;
v->values = malloc(sizeof(v->values[0]) * v->size);
return !!v->values;
}
Pushing is straightforward, using realloc() when the buffer fills, returning 0 for integer overflow or allocation failure.
int
vec_simple_push(struct vec_simple *v, long x)
{
if (v->count == v->size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = v->size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; // overflow
void *new_values = realloc(v->values, new_size * value_size);
if (!new_values)
return 0; // out of memory
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
And finally, cleaning up. I hadn’t thought about this before, but if the compiler manages to inline vec_simple_free(), that NULL pointer assignment will probably get optimized out, possibly even in the face of a use-after-free bug.
void
vec_simple_free(struct vec_simple *v)
{
free(v->values);
v->values = 0; // trap use-after-free bugs
}
And finally an example of its use (without checking for errors).
long
example(long (*f)(void *), void *arg)
{
struct vec_simple v;
vec_simple_init(&v, 16);
long n;
while ((n = f(arg)) > 0)
vec_simple_push(&v, n);
// ... process vector ...
vec_simple_free(&v);
return result;
}
If the common case is only a handful of long
values, and this
function is called frequently, we’re doing a lot of heap allocation
that could be avoided. Wouldn’t it be nice to put all that on the
stack?
Modify the struct to add this temp
field. It’s probably obvious what
I’m getting at here. This is essentially the technique in SmallVector.
struct vec_small {
size_t size;
size_t count;
long *values;
long temp[16];
};
The values
field is initially pointed at the small buffer. Notice
that unlike the “simple” vector above, this initialization function
cannot fail. It’s one less thing for the caller to check. It also
doesn’t take a hint
since the buffer size is fixed.
void
vec_small_init(struct vec_small *v)
{
v->size = sizeof(v->temp) / sizeof(v->temp[0]);
v->count = 0;
v->values = v->temp;
}
Pushing gets a little more complicated. If it’s the first time the buffer has grown, the realloc() has to be done “manually” with malloc() and memcpy().
int
vec_small_push(struct vec_small *v, long x)
{
if (v->count == v->size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = v->size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; // overflow
void *new_values;
if (v->temp == v->values) {
/* First time heap allocation. */
new_values = malloc(new_size * value_size);
if (new_values)
memcpy(new_values, v->temp, sizeof(v->temp));
} else {
new_values = realloc(v->values, new_size * value_size);
}
if (!new_values)
return 0; // out of memory
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
Finally, only call free() if the buffer was actually allocated on the heap.
void
vec_small_free(struct vec_small *v)
{
if (v->values != v->temp)
free(v->values);
v->values = 0;
}
If 99% of these vectors never exceed 16 elements, then 99% of the time the heap isn’t touched. That’s much better than before. The 1% case is still covered, too, at what is probably an insignificant cost.
An important difference to SmallVector is that they parameterize the small buffer’s size through the template. In C we’re stuck with fixed sizes or macro hacks. Or are we?
This time remove the temporary buffer, making it look like the simple vector from before.
struct vec_flex {
size_t size;
size_t count;
long *values;
};
The user will provide the initial buffer, which will presumably be an adjacent, stack-allocated array, but whose size is under the user’s control.
void
vec_flex_init(struct vec_flex *v, long *init, size_t nmemb)
{
assert(nmemb > 1); // we need that low bit!
assert(nmemb && (nmemb & (nmemb - 1)) == 0); // power of 2
v->size = nmemb | 1;
v->count = 0;
v->values = init;
}
The power of two size, greater than one, means the size will always be
an even number. Why is this important? There’s one piece of
information missing from the struct: Is the buffer currently heap
allocated or not? That’s just one bit of information, but adding just
one more bit to the struct will typically pad it out another 31 or 63
more bits. What a waste! Since I’m not using the lowest bit of the
size (always being an even number), I can smuggle it in there. Hence
the nmemb | 1
, the 1 indicating that it’s not heap allocated.
When pushing, the actual_size
is extracted by clearing the bottom
bit (size & ~1
) and the indicator bit is extracted with a 1 bit mask
(size & 1
). The bit is cleared by virtue of not intentionally
setting it again.
int
vec_flex_push(struct vec_flex *v, long x)
{
size_t actual_size = v->size & ~(size_t)1; // clear bottom bit
if (v->count == actual_size) {
size_t value_size = sizeof(v->values[0]);
size_t new_size = actual_size * 2;
if (!new_size || value_size > (size_t)-1 / new_size)
return 0; /* overflow */
void *new_values;
if (v->size & 1) {
/* First time heap allocation. */
new_values = malloc(new_size * value_size);
if (new_values)
memcpy(new_values, v->values, actual_size * value_size);
} else {
new_values = realloc(v->values, new_size * value_size);
}
if (!new_values)
return 0; /* out of memory */
v->size = new_size;
v->values = new_values;
}
v->values[v->count++] = x;
return 1;
}
Only free() when it’s been allocated, like before.
void
vec_flex_free(struct vec_flex *v)
{
if (!(v->size & 1))
free(v->values);
v->values = 0;
}
And here’s what it looks like in action.
long
example(long (*f)(void *), void *arg)
{
struct vec_flex v;
long buffer[16];
vec_flex_init(&v, buffer, sizeof(buffer) / sizeof(buffer[0]));
long n;
while ((n = f(arg)) > 0)
vec_flex_push(&v, n);
// ... process vector ...
vec_flex_free(&v);
return result;
}
If you were to log all vector sizes as part of profiling, and the assumption about their typical small number of elements was correct, you could easily tune the array size in each case to remove the vast majority of vector heap allocations.
Now that I’ve learned this optimization trick, I’ll be looking out for good places to apply it. It’s also a good reason for me to stop abusing VLAs.
]]>auto
keyword has been a part of C and C++ since the very
beginning, originally as a one of the four storage class specifiers:
auto
, register
, static
, and extern
. An auto
variable has
“automatic storage duration,” meaning it is automatically allocated at
the beginning of its scope and deallocated at the end. It’s the
default storage class for any variable without external linkage or
without static
storage, so the vast majority of variables in a
typical C program are automatic.
In C and C++ prior to C++11, the following definitions are
equivalent because the auto
is implied.
int
square(int x)
{
int x2 = x * x;
return x2;
}
int
square(int x)
{
auto int x2 = x * x;
return x2;
}
As a holdover from really old school C, unspecified types in C are
implicitly int
, and even today you can get away with weird stuff
like this:
/* C only */
square(x)
{
auto x2 = x * x;
return x2;
}
By “get away with” I mean in terms of the compiler accepting this as valid input. Your co-workers, on the other hand, may become violent.
Like register
, as a storage class auto
is an historical artifact
without direct practical use in modern code. However, as a concept
it’s indispensable for the specification. In practice, automatic
storage means the variables lives on “the” stack (or one of the
stacks), but the specifications make no mention of a
stack. In fact, the word “stack” doesn’t appear even once. Instead
it’s all described in terms of “automatic storage,” rightfully leaving
the details to the implementations. A stack is the most sensible
approach the vast majority of the time, particularly because it’s both
thread-safe and re-entrant.
One of the major changes in C++11 was repurposing the auto
keyword,
moving it from a storage class specifier to a a type specifier. In
C++11, the compiler infers the type of an auto
variable from its
initializer. In C++14, it’s also permitted for a function’s return
type, inferred from the return
statement.
This new specifier is very useful in idiomatic C++ with its ridiculously complex types. Transient variables, such as variables bound to iterators in a loop, don’t need a redundant type specification. It keeps code DRY (“Don’t Repeat Yourself”). Also, templates easier to write, since it makes the compiler do more of the work. The necessary type information is already semantically present, and the compiler is a lot better at dealing with it.
With this change, the following is valid in both C and C++11, and, by sheer coincidence, has the same meaning, but for entirely different reasons.
int
square(int x)
{
auto x2 = x * x;
return x2;
}
In C the type is implied as int
, and in C++11 the type is inferred
from the type of x * x
, which, in this case, is int
. The prior
example with auto int x2
, valid in C++98 and C++03, is no longer
valid in C++11 since auto
and int
are redundant type specifiers.
Occasionally I wish I had something like auto
in C. If I’m writing a
for
loop from 0 to n
, I’d like the loop variable to be the same
type as n
, even if I decide to change the type of n
in the future.
For example,
struct foo *foo = foo_create();
for (int i = 0; i < foo->n; i++)
/* ... */;
The loop variable i
should be the same type as foo->n
. If I decide
to change the type of foo->n
in the struct definition, I’d have to
find and update every loop. The idiomatic C solution is to typedef
the integer, using the new type both in the struct and in loops, but I
don’t think that’s much better.
Why is all this important? Well, I was recently reviewing some C++ and
came across this odd specimen. I’d never seen anything like it before.
Notice the use of auto
for the parameter types.
void
set_odd(auto first, auto last, const auto &x)
{
bool toggle = false;
for (; first != last; first++, toggle = !toggle)
if (toggle)
*first = x;
}
Given the other uses of auto
as a type specifier, this kind of makes
sense, right? The compiler infers the type from the input argument.
But, as you should often do, put yourself in the compiler’s shoes for
a moment. Given this function definition in isolation, can you
generate any code? Nope. The compiler needs to see the call site
before it can infer the type. Even more, different call sites may use
different types. That sounds an awful lot like a template, eh?
template<typename T, typename V>
void
set_odd(T first, T last, const V &x)
{
bool toggle = false;
for (; first != last; first++, toggle = !toggle)
if (toggle)
*first = x;
}
This is a proposed feature called abbreviated function templates, part of C++ Extensions for Concepts. It’s intended to be shorthand for the template version of the function. GCC 4.9 implements it as an extension, which is why the author was unaware of its unofficial status. In March 2016 it was established that abbreviated function templates would not be part of C++17, but may still appear in a future revision.
Personally, I find this use of auto
to be vulgar. It overloads the
keyword with a third definition. This isn’t unheard of — static
also
serves a number of unrelated purposes — but while similar to the
second form of auto
(type inference), this proposed third form is
very different in its semantics (far more complex) and overhead
(potentially very costly). I’m glad it’s been rejected so far.
Templates better reflect the nature of this sort of code.
MAP_FAILED
) on error and sets
errno. But how do you check errno without libc?
As a reminder here’s what the (unoptimized) assembly looks like.
stack_create:
mov rdi, 0
mov rsi, STACK_SIZE
mov rdx, PROT_WRITE | PROT_READ
mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
mov rax, SYS_mmap
syscall
ret
As usual, the system call return value is in rax
, which becomes the
return value for stack_create()
. Again, its C prototype would look
like this:
void *stack_create(void);
If you were to, say, intentionally botch the arguments to force an error, you might notice that the system call isn’t returning -1, but other negative values. What gives?
The trick is that errno is a C concept. That’s why it’s documented as errno(3) — the 3 means it belongs to C. Just think about how messy this thing is: it’s a thread-local value living in the application’s address space. The kernel rightfully has nothing to do with it. Instead, the mmap(2) wrapper in libc assigns errno (if needed) after the system call returns. This is how all system calls through libc work, even with the syscall(2) wrapper.
So how does the kernel report the error? It’s an old-fashioned return value. If you have any doubts, take it straight from the horse’s mouth: mm/mmap.c:do_mmap(). Here’s a sample of return statements.
if (!len)
return -EINVAL;
/* Careful about overflows.. */
len = PAGE_ALIGN(len);
if (!len)
return -ENOMEM;
/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW;
/* Too many mappings? */
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
It’s returning the negated error number. Simple enough.
If you think about it a moment, you might notice a complication: This is a form of in-band signaling. On success, mmap(2) returns a memory address. All those negative error numbers are potentially addresses that a caller might want to map. How can we tell the difference?
1) None of the possible error numbers align on a page boundary, so they’re not actually valid return values. NULL does lie on a page boundary, which is one reason why it’s not used as an error return value for mmap(2). The other is that you might actually want to map NULL, for better or worse.
2) Those low negative values lie in a region of virtual memory reserved exclusively for the kernel (sometimes called “low memory”). On x86-64, any address with the most significant bit set (i.e. the sign bit of a signed integer) is one of these addresses. Processes aren’t allowed to map these addresses, and so mmap(2) will never return such a value on success.
So what’s a clean, safe way to go about checking for error values? It’s a lot easier to read musl than glibc, so let’s take a peek at how musl does it in its own mmap: src/mman/mmap.c.
if (off & OFF_MASK) {
errno = EINVAL;
return MAP_FAILED;
}
if (len >= PTRDIFF_MAX) {
errno = ENOMEM;
return MAP_FAILED;
}
if (flags & MAP_FIXED) {
__vm_wait();
}
return (void *)syscall(SYS_mmap, start, len, prot, flags, fd, off);
Hmm, it looks like its returning the result directly. What happened to setting errno? Well, syscall() is actually a macro that runs the result through __syscall_ret().
#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))
Looking a little deeper: src/internal/syscall_ret.c.
long __syscall_ret(unsigned long r)
{
if (r > -4096UL) {
errno = -r;
return -1;
}
return r;
}
Bingo. As documented, if the value falls within that “high” (unsigned) range of negative values for any system call, it’s an error number.
Getting back to the original question, we could employ this same check in the assembly code. However, since this is a anonymous memory map with a kernel-selected address, there’s only one possible error: ENOMEM (12). This error happens if the maximum number of memory maps has been reached, or if there’s no contiguous region available for the 4MB stack. The check will only need to test the result against -12.
]]>My original solution was not to compress anything until it gathered the entirety of the data. The input would get concatenated into a huge buffer, then finally compressed and written out. It’s not ideal, because the program uses a lot more memory than it theoretical could, especially if the data is highly compressible. It would be far better to compress the data as it arrives and somehow update the header later.
My first thought was to ask zlib to leave the header uncompressed,
then enable compression (deflateParams()
) for the data. I’d work out
the magic offset and overwrite the uncompressed header bytes once I’m
done. There are two major issues with this, and I’ll address each:
zlib includes a checksum (adler32) at the end of the data, and editing the stream would cause a mismatch. This fairly easy to correct thanks to adler32’s properties.
zlib is an LZ77-family compressor and compression comes from back-references into past (and sometimes future) bytes of decompressed output. Up to 32kB of data following the header could reference bytes in the header as a dictionary. I would need to ask zlib not to reference these bytes. Fortunately the zlib API is intentionally designed for this, though for different purposes.
Ignoring the second problem for a moment, I could fix the checksum by computing it myself. When I overwrite my uncompressed header bytes, I could also overwrite the checksum at the end of the compressed stream. For illustration, here’s an simple example implementation of adler32 (from Wikipedia):
#define MOD_ADLER 65521
uint32_t
example_adler32(uint8_t *data, size_t len)
{
uint32_t a = 1;
uint32_t b = 0;
for (size_t i = 0; i < len; i++) {
a = (a + data[i]) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}
return (b << 16) | a;
}
If you think about this for a moment, you may notice this puts me back at square one. If I don’t know the header, then I don’t know the checksum value at the end of the header, going into the data buffer. I’d need to buffer all the data to compute the checksum. Fortunately adler32 has the nice property that two checksums can be concatenated as if they were one long stream. In a malicious context this is known as a length extension attack, but it’s a real benefit here.
It’s like the zlib authors anticipated my needs, because the zlib library has a function exactly for this:
uint32_t adler32_combine(uint32_t adler1, uint32_t adler2, size_t len2);
I just have to keep track of the data checksum adler2
and I can
compute the proper checksum later.
uint64_t total = 0;
uint32_t data_adler = adler32(0, 0, 0); // initial value
while (processing_input) {
// ...
data_adler = adler32(data_adler, data, size);
total += size;
}
// ...
uint32_t header_adler = adler32(0, 0, 0);
header_adler = adler32(header_adler, header, header_size);
uint32_t adler = adler32_combine(header_adler, data_adler, total);
This part is more complicated and it helps to have some familiarity
with zlib. Every time zlib is asked to compress data, it’s given a
flush parameter. Under normal operation, this value is always
Z_NO_FLUSH
until the end of the stream, in which case it’s finalized
with Z_FINISH
. Other flushing options force it to emit data sooner
at the cost of reduced compression ratio. This would primarily be used
to eliminate output latency on interactive streams (e.g. compressed
SSH sessions).
The necessary flush option for this situation is Z_FULL_FLUSH
. It
forces out all output data and resets the dictionary: a fence.
Future inputs cannot reference anything before a full flush. Since
the header is uncompressed, it will not reference itself either.
Ignoring the checksum problem, I can safely modify these bytes.
To fully demonstrate all of this, I’ve put together an example using one of my favorite image formats, Netpbm P6.
In the P6 format, the image header is an ASCII description of the image’s dimensions followed immediately by raw pixel data.
P6
width height
depth
[RGB bytes]
It’s a bit contrived, but it’s the project I used to work it all out. The demo reads arbitrary raw byte data on standard input and uses it to produce a zlib-compressed PPM file on standard output. It doesn’t know the size of the input ahead of time, nor does it naively buffer it all. There’s no dynamic allocation (except for what zlib does internally), but the program can process arbitrarily large input. The only requirement is that standard output is seekable. Using the technique described above, it patches the header within the zlib stream with the final image dimensions after the input has been exhausted.
If you’re on a Debian system, you can use zlib-flate
to decompress
raw zlib streams (gzip wraps zlib, but can’t raw zlib). Alternatively
your system’s openssl
program may have zlib support. Here’s running
it on itself as input. Remember, you can’t pipe it into zlib-flate
because the output needs to be seekable in order to write the header.
$ ./zppm < zppm > out.ppmz
$ zlib-flate -uncompress < out.ppmz > out.ppm
Unfortunately due to the efficiency-mindedness of zlib, its use requires careful bookkeeping that’s easy to get wrong. It’s a little machine that at each step needs to be either fed more input or its output buffer cleared. Even with all the error checking stripped away, it’s still too much to go over in full here, but I’ll summarize the parts.
First I process an empty buffer with compression disabled. The output
buffer will be discarded, so input buffer could be left uninitialized,
but I don’t want to upset anyone. All I need is the output
size, which I use to seek over the to-be-written header. I use
Z_FULL_FLUSH
as described, and there’s no loop because I presume my
output buffer is easily big enough for this.
char bufin[4096];
char bufout[4096];
z_stream z = {
.next_in = (void *)bufin,
.avail_in = HEADER_SIZE,
.next_out = (void *)bufout,
.avail_out = sizeof(bufout),
};
deflateInit(&z, Z_NO_COMPRESSION);
memset(bufin, 0, HEADER_SIZE);
deflate(&z, Z_FULL_FLUSH);
fseek(stdout, sizeof(bufout) - z.avail_out, SEEK_SET);
Next I enable compression and reset the checksum. This makes zlib track the checksum for all of the non-header input. Otherwise I’d be throwing away all its checksum work and repeating it myself.
deflateParams(&z, Z_BEST_COMPRESSION, Z_DEFAULT_STRATEGY);
z.adler = adler32(0, 0, 0);
I won’t include it in this article, but what follows is a standard
zlib compression loop, consuming all the input data. There’s one key
difference compared to a normal zlib compression loop: when the input
is exhausted, instead of Z_FINISH
I use Z_SYNC_FLUSH
to force
everything out. The problem with Z_FINISH
is that it will write the
checksum, but we’re not ready for that.
With all the input processed, it’s time to go back to rewrite the
header. Rather than mess around with magic byte offsets, I start a
second, temporary zlib stream and do the Z_FULL_FLUSH
like before,
but this time with the real header. In deciding the header size, I
reserved 6 characters for the width and 10 characters for the height.
sprintf(bufin, "P6\n%-6lu\n%-10lu\n255\n", width, height);
uint32_t adler = adler32(0, 0, 0);
adler = adler32(adler, (void *)bufin, HEADER_SIZE);
z_stream zh = {
.next_in = (void *)bufin,
.avail_in = HEADER_SIZE,
.next_out = (void *)bufout,
.avail_out = sizeof(bufout),
};
deflateInit(&zh, Z_NO_COMPRESSION);
deflate(&zh, Z_FULL_FLUSH);
fseek(stdout, 0, SEEK_SET);
fwrite(bufout, 1, sizeof(bufout) - zh.avail_out, stdout);
fseek(stdout, 0, SEEK_END);
deflateEnd(&zh);
The header is now complete, so I can go back to finish the original compression stream. Again, I assume the output buffer is big enough for these final bytes.
z.adler = adler32_combine(adler, z.adler, z.total_in - HEADER_SIZE);
z.next_out = (void *)bufout;
z.avail_out = sizeof(bufout);
deflate(&z, Z_FINISH);
fwrite(bufout, 1, sizeof(bufout) - z.avail_out, stdout);
deflateEnd(&z);
It’s a lot more code than I expected, but it wasn’t too hard to work out. If you want to get into the nitty gritty and really hack a zlib stream, check out RFC 1950 and RFC 1951.
]]>I added a drawing routine to a comparison function to see what the sort function was doing for different C libraries. Every time it’s called for a comparison, it writes out a snapshot of the array as a Netpbm PPM image. It’s easy to turn concatenated PPMs into a GIF or video. Here’s my code if you want to try it yourself:
Adjust the parameters at the top to taste. Rather than call rand() in the standard library, I included xorshift64star() with a hard-coded seed so that the array will be shuffled exactly the same across all platforms. This makes for a better comparison.
To get an optimized GIF on unix-like systems, run it like so. (Microsoft’s UCRT currently has serious bugs with pipes, so it was run differently in that case.)
./a.out | convert -delay 10 ppm:- gif:- | gifsicle -O3 > sort.gif
The number of animation frames reflects the efficiency of the sort, but this isn’t really a benchmark. The input array is fully shuffled, and real data often not. For a benchmark, have a look at a libc qsort() shootout of sorts instead.
To help you follow along, clicking on any animation will restart it.
Sorted in 307 frames. glibc prefers to use mergesort, which, unlike quicksort, isn’t an in-place algorithm, so it has to allocate memory. That allocation could fail for huge arrays, and, since qsort() can’t fail, it uses quicksort as a backup. You can really see the mergesort in action: changes are made that we cannot see until later, when it’s copied back into the original array.
Sorted in 503 frames. dietlibc is an alternative C standard library for Linux. It’s optimized for size, which shows through its slower performance. It looks like a quicksort that always chooses the last element as the pivot.
Update: Felix von Leitner, the primary author of dietlibc, has alerted me that, as of version 0.33, it now chooses a random pivot. This comment from the source describes it:
We chose the rightmost element in the array to be sorted as pivot, which is OK if the data is random, but which is horrible if the data is already sorted. Try to improve by exchanging it with a random other pivot.
Sort in 637 frames. musl libc is another alternative C standard library for Linux. It’s my personal preference when I statically link Linux binaries. Its qsort() looks a lot like a heapsort, and with some research I see it’s actually smoothsort, a heapsort variant.
Sorted in 354 frames. I ran it on both OpenBSD and FreeBSD with identical results, so, unsurprisingly, they share an implementation. It’s quicksort, and what’s neat about it is at the beginning you can see it searching for a median for use as the pivot. This helps avoid the O(n^2) worst case.
BSD also includes a mergesort() with the same prototype, except with
an int
return for reporting failures. This one sorted in 247
frames. Like glibc before, there’s some behind-the-scenes that isn’t
captured. But even more, notice how the markers disappear during the
merge? It’s running the comparator against copies, stored outside the
original array. Sneaky!
Again, BSD also includes heapsort(), so ran that too. It sorted in 418 frames. It definitely looks like a heapsort, and the worse performance is similar to musl. It seems heapsort is a poor fit for this data.
It turns out Cygwin borrowed its qsort() from BSD. It’s pixel identical to the above. I hadn’t noticed until I looked at the frame counts.
MinGW builds against MSVCRT.DLL, found on every Windows system despite its unofficial status. Until recently Microsoft didn’t include a C standard library as part of the OS, but that changed with their Universal CRT (UCRT) announcement. I thought I’d try them both.
Turns out they borrowed their old qsort() for the UCRT, and the result is the same: sorted in 417 frames. It chooses a pivot from the median of the ends and the middle, swaps the pivot to the middle, then partitions. Looking to the middle for the pivot makes sorting pre-sorted arrays much more efficient.
Finally I ran it against Pelles C, a C compiler for Windows. It sorted in 463 frames. I can’t find any information about it, but it looks like some sort of hybrid between quicksort and insertion sort. Like BSD qsort(), it finds a good median for the pivot, partitions the elements, and if a partition is small enough, it switches to insertion sort. This should behave well on mostly-sorted arrays, but poorly on well-shuffled arrays (like this one).
That’s everything that was readily accessible to me. If you can run it against something new, I’m certainly interested in seeing more implementations.
]]>I’ve been using tools like this going back 20 years, but I never tried to write one myself until now. There are many memory cheat tools to pick from these days, the most prominent being Cheat Engine. These tools use the platform’s debugging API, so of course any good debugger could do the same thing, though a debugger won’t be specialized appropriately (e.g. locating the particular address and locking its value).
My motivation was bypassing an in-app purchase in a single player Windows game. I wanted to convince the game I had made the purchase when, in fact, I hadn’t. Once I had it working successfully, I ported MemDig to Linux since I thought it would be interesting to compare. I’ll start with Windows for this article.
Only three Win32 functions are needed, and you could almost guess at how it works.
It’s very straightforward and, for this purpose, is probably the
simplest API for any platform (see update).
As you probably guessed, you first need to open the process, given its
process ID (integer). You’ll need to select the desired access bit a
bit set. To read memory, you need the PROCESS_VM_READ
and
PROCESS_QUERY_INFORMATION
rights. To write memory, you need the
PROCESS_VM_WRITE
and PROCESS_VM_OPERATION
rights. Alternatively
you could just ask for all rights with PROCESS_ALL_ACCESS
, but I
prefer to be precise.
DWORD access = PROCESS_VM_READ |
PROCESS_QUERY_INFORMATION |
PROCESS_VM_WRITE |
PROCESS_VM_OPERATION;
HANDLE proc = OpenProcess(access, FALSE, pid);
And then to read or write:
void *addr; // target process address
SIZE_T written;
ReadProcessMemory(proc, addr, &value, sizeof(value), &written);
// or
WriteProcessMemory(proc, addr, &value, sizeof(value), &written);
Don’t forget to check the return value and verify written
. Finally,
don’t forget to close it when you’re done.
CloseHandle(proc);
That’s all there is to it. For the full cheat tool you’d need to find the mapped regions of memory, via VirtualQueryEx. It’s not as simple, but I’ll leave that for another article.
Unfortunately there’s no standard, cross-platform debugging API for unix-like systems. Most have a ptrace() system call, though each works a little differently. Note that ptrace() is not part of POSIX, but appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere. The following will all be specific to Linux, though the procedure is similar on other unix-likes.
In typical Linux fashion, if it involves other processes, you use the standard file API on the /proc filesystem. Each process has a directory under /proc named as its process ID. In this directory is a virtual file called “mem”, which is a file view of that process’ entire address space, including unmapped regions.
char file[64];
sprintf(file, "/proc/%ld/mem", (long)pid);
int fd = open(file, O_RDWR);
The catch is that while you can open this file, you can’t actually
read or write on that file without attaching to the process as a
debugger. You’ll just get EIO errors. To attach, use ptrace() with
PTRACE_ATTACH
. This asynchronously delivers a SIGSTOP
signal to
the target, which has to be waited on with waitpid().
You could select the target address with lseek(), but it’s cleaner and more efficient just to do it all in one system call with pread() and pwrite(). I’ve left out the error checking, but the return value of each function should be checked:
ptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, NULL, 0);
off_t addr = ...; // target process address
pread(fd, &value, sizeof(value), addr);
// or
pwrite(fd, &value, sizeof(value), addr);
ptrace(PTRACE_DETACH, pid, 0, 0);
The process will (and must) be stopped during this procedure, so do your reads/writes quickly and get out. The kernel will deliver the writes to the other process’ virtual memory.
Like before, don’t forget to close.
close(fd);
To find the mapped regions in the real cheat tool, you would read and parse the virtual text file /proc/pid/maps. I don’t know if I’d call this stringly-typed method elegant — the kernel converts the data into string form and the caller immediately converts it right back — but that’s the official API.
Update: Konstantin Khlebnikov has pointed out the process_vm_readv() and process_vm_writev() system calls, available since Linux 3.2 (January 2012) and glibc 2.15 (March 2012). These system calls do not require ptrace(), nor does the remote process need to be stopped. They’re equivalent to ReadProcessMemory() and WriteProcessMemory(), except there’s no requirement to first “open” the process.
]]>For example, compression programs such as gzip, bzip2, and xz when given a compressed file as an argument will create a new file with the compression extension removed. They write to this file as the compressed input is being processed. If the compressed stream contains an error in the middle, the partially-completed output is removed.
There are exceptions of course, such as programs that download files over a network. The partial result has value, especially if the transfer can be continued from where it left off. The convention is to append another extension, such as “.part”, to indicate a partial output.
The straightforward solution is to always delete the file as part of error handling. A non-interactive program would report the error on standard error, delete the file, and exit with an error code. However, there are at least two situations where error handling would be unable to operate: unhandled signals (usually including a segmentation fault) and power failures. A partial or corrupted output file will be left behind, possibly looking like a valid file.
A common, more complex approach is to name the file differently from its final name while being written. If written successfully, the completed file is renamed into place. This is already required for durable replacement, so it’s basically free for many applications. In the worst case, where the program is unable to clean up, the obviously incomplete file is left behind only wasting space.
Looking to be more robust, I had the following misguided idea: Rely completely on the operating system to perform cleanup in the case of a failure. Initially the file would be configured to be automatically deleted when the final handle is closed. This takes care of all abnormal exits, and possibly even power failures. The program can just exit on error without deleting the file. Once written successfully, the automatic-delete indicator is cleared so that the file survives.
The target application for this technique supports both Linux and
Windows, so I would need to figure it out for both systems. On
Windows, there’s the flag FILE_FLAG_DELETE_ON_CLOSE
. I’d just need
to find a way to clear it. On POSIX, file would be unlinked while
being written, and linked into the filesystem on success. The latter
turns out to be a lot harder than I expected.
I’ll start with Windows since the technique actually works fairly well here — ignoring the usual, dumb Win32 filesystem caveats. This is a little surprising, since it’s usually Win32 that makes these things far more difficult than they should be.
The primary Win32 function for opening and creating files is
CreateFile. There are many options, but the key is
FILE_FLAG_DELETE_ON_CLOSE
. Here’s how an application might typically
open a file for output.
DWORD access = GENERIC_WRITE;
DWORD create = CREATE_ALWAYS;
DWORD flags = FILE_FLAG_DELETE_ON_CLOSE;
HANDLE f = CreateFile("out.tmp", access, 0, 0, create, flags, 0);
This special flag asks Windows to delete the file as soon as the last handle to to file object is closed. Notice I said file object, not file, since these are different things. The catch: This flag is a property of the file object, not the file, and cannot be removed.
However, the solution is simple. Create a new link to the file so that it survives deletion. This even works for files residing on a network shares.
CreateHardLink("out", "out.tmp", 0);
CloseHandle(f); // deletes out.tmp file
The gotcha is that the underlying filesystem must be NTFS. FAT32 doesn’t support hard links. Unfortunately, since FAT32 remains the least common denominator and is still widely used for removable media, depending on the application, your users may expect support for saving files to FAT32. A workaround is probably required.
This is where things really fall apart. It’s just barely possible on Linux, it’s messy, and it’s not portable anywhere else. There’s no way to do this for POSIX in general.
My initial thought was to create a file then unlink it. Unlike the situation on Windows, files can be unlinked while they’re currently open by a process. These files are finally deleted when the last file descriptor (the last reference) is closed. Unfortunately, using unlink(2) to remove the last link to a file prevents that file from being linked again.
Instead, the solution is to use the relatively new (since Linux 3.11),
Linux-specific O_TMPFILE
flag when creating the file. Instead of a
filename, this variation of open(2) takes a directory and creates an
unnamed, temporary file in it. These files are special in that they’re
permitted to be given a name in the filesystem at some future point.
For this example, I’ll assume the output is relative to the current working directory. If it’s not, you’ll need to open an additional file descriptor for the parent directory, and also use openat(2) to avoid possible race conditions (since paths can change from under you). The number of ways this can fail is already rapidly multiplying.
int fd = open(".", O_TMPFILE|O_WRONLY, 0600);
The catch is that only a handful of filesystems support O_TMPFILE
.
It’s like the FAT32 problem above, but worse. You could easily end up
in a situation where it’s not supported, and will almost certainly
require a workaround.
Linking a file from a file descriptor is where things get messier. The file descriptor must be linked with linkat(2) from its name on the /proc virtual filesystem, constructed as a string. The following snippet comes straight from the Linux open(2) manpage.
char buf[64];
sprintf(buf, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, buf, AT_FDCWD, "out", AT_SYMLINK_FOLLOW);
Even on Linux, /proc isn’t always available, such as within a chroot
or a container, so this part can fail as well. In theory there’s a way
to do this with the Linux-specific AT_EMPTY_PATH
and avoid /proc,
but I couldn’t get it to work.
// Note: this doesn't actually work for me.
linkat(fd, "", AT_FDCWD, "out", AT_EMPTY_PATH);
Given the poor portability (even within Linux), the number of ways this can go wrong, and that a workaround is definitely needed anyway, I’d say this technique is worthless. I’m going to stick with the tried-and-true approach for this one.
]]>1) The append must be atomic such that it doesn’t clobber previous appends by other threads and processes. For example, suppose a write requires two separate operations: first moving the file pointer to the end of the file, then performing the write. There would be a race condition should another process or thread intervene in between with its own write.
2) The output will be interleaved. The primary solution is to design the data format as atomic records, where the ordering of records is unimportant — like rows in a relational database. This could be as simple as a text file with each line as a record. The concern is then ensuring records are written atomically.
This article discusses processes, but the same applies to threads when directly dealing with file descriptors.
The first concern is solved by the operating system, with one caveat.
On POSIX systems, opening a file with the O_APPEND
flag will
guarantee that writes always safely append.
If the
O_APPEND
flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.
However, this says nothing about interleaving. Two processes successfully appending to the same file will result in all their bytes in the file in order, but not necessarily contiguously.
The caveat is that not all filesystems are POSIX-compatible. Two famous examples are NFS and the Hadoop Distributed File System (HDFS). On these networked filesystems, appends are simulated and subject to race conditions.
On POSIX systems, fopen(3) with the a
flag will use
O_APPEND
, so you don’t necessarily need to use open(2). On
Linux this can be verified for any language’s standard library with
strace.
#include <stdio.h>
int main(void)
{
fopen("/dev/null", "a");
return 0;
}
And the result of the trace:
$ strace -e open ./a.out
open("/dev/null", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
For Win32, the equivalent is the FILE_APPEND_DATA
access right, and
similarly only applies to “local files.”
The interleaving problem has two layers, and gets more complicated the more correct you want to be. Let’s start with pipes.
On POSIX, a pipe is unseekable and doesn’t have a file position, so
appends are the only kind of write possible. When writing to a pipe
(or FIFO), writes less than the system-defined PIPE_BUF
are
guaranteed to be atomic and non-interleaving.
Write requests of
PIPE_BUF
bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater thanPIPE_BUF
bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, […]
The minimum value for PIPE_BUF
for POSIX systems is 512 bytes. On
Linux it’s 4kB, and on other systems it’s as high as 32kB.
As long as each record is less than 512 bytes, a simple write(2) will
due. None of this depends on a filesystem since no files are involved.
If more than PIPE_BUF
bytes isn’t enough, the POSIX writev(2) can be
used to atomically write up to IOV_MAX
buffers of
PIPE_BUF
bytes. The minimum value for IOV_MAX
is 16, but is
typically 1024. This means the maximum safe atomic write size for
pipes — and therefore the largest record size — for a perfectly
portable program is 8kB (16✕512). On Linux it’s 4MB.
That’s all at the system call level. There’s another layer to contend with: buffered I/O in your language’s standard library. Your program may pass data in appropriately-sized pieces for atomic writes to the I/O library, but it may be undoing your hard work, concatenating all these writes into a buffer, splitting apart your records. For this part of the article, I’ll focus on single-threaded C programs.
Suppose you’re writing a simple space-separated format with one line per record.
int foo, bar;
float baz;
while (condition) {
// ...
printf("%d %d %f\n", foo, bar, baz);
}
Whether or not this works depends on how stdout
is buffered. C
standard library streams (FILE *
) have three buffering modes:
unbuffered, line buffered, and fully buffered. Buffering is configured
through setbuf(3) and setvbuf(3), and the initial buffering state of a
stream depends on various factors. For buffered streams, the default
buffer is at least BUFSIZ
bytes, itself at least 256 (C99
§7.19.2¶7). Note: threads share this buffer.
Since each record in the above program easily fits inside 256 bytes, if stdout is a line buffered pipe then this program will interleave correctly on any POSIX system without further changes.
If instead your output is comma-separated values (CSV) and your
records may contain new line characters, there are two
approaches. In each, the record must still be no larger than
PIPE_BUF
bytes.
Unbuffered pipe: construct the record in a buffer (i.e. sprintf(3)) and output the entire buffer in a single fwrite(3). While I believe this will always work in practice, it’s not guaranteed by the C specification, which defines fwrite(3) as a series of fputc(3) calls (C99 §7.19.8.2¶2).
Fully buffered pipe: set a sufficiently large stream buffer and follow each record with a fflush(3). Unlike fwrite(3) on an unbuffered stream, the specification says the buffer will be “transmitted to the host environment as a block” (C99 §7.19.3¶3), so this should be perfectly correct on any POSIX system.
If your situation is more complicated than this, you’ll probably have to bypass your standard library buffered I/O and call write(2) or writev(2) yourself.
If interleaving writes to a pipe stdout sounds contrived, here’s the
real life scenario: GNU xargs with its --max-procs
(-P
) option to
process inputs in parallel.
xargs -n1 -P$(nproc) myprogram < inputs.txt | cat > outputs.csv
The | cat
ensures the output of each myprogram
process is
connected to the same pipe rather than to the same file.
A non-portable alternative to | cat
, especially if you’re
dispatching processes and threads yourself, is the splice(2) system
call on Linux. It efficiently moves the output from the pipe to the
output file without an intermediate copy to userspace. GNU Coreutils’
cat doesn’t use this.
On Win32, anonymous pipes have no semantics regarding interleaving. Named pipes have per-client buffers that prevent interleaving. However, the pipe buffer size is unspecified, and requesting a particular size is only advisory, so it comes down to trial and error, though the unstated limits should be comparatively generous.
Suppose instead of a pipe we have an O_APPEND
file on POSIX. Common
wisdom states that the same PIPE_BUF
atomic write rule applies.
While this often works, especially on Linux, this is not correct. The
POSIX specification doesn’t require it and there are systems where it
doesn’t work.
If you know the particular limits of your operating system and filesystem, and you don’t care much about portability, then maybe you can get away with interleaving appends. For full portability, pipes are required.
On Win32, writes on local files up to the underlying drive’s sector size (typically 512 bytes to 4kB) are atomic. Otherwise the only options are deprecated Transactional NTFS (TxF), or manually synchronizing your writes. All in all, it’s going to take more work to get correct.
My true use case for mucking around with clean, atomic appends is to compute giant CSV tables in parallel, with the intention of later loading into a SQL database (i.e. SQLite) for analysis. A more robust and traditional approach would be to write results directly into the database as they’re computed. But I like the platform-neutral intermediate CSV files — good for archival and sharing — and the simplicity of programs generating the data — concerned only with atomic write semantics rather than calling into a particular SQL database API.
]]>const
on optimization. Variations of this question
have been asked many times over the past two decades. Personally, I
blame naming of const
.
Given this program:
void foo(const int *);
int
bar(void)
{
int x = 0;
int y = 0;
for (int i = 0; i < 10; i++) {
foo(&x);
y += x; // this load not optimized out
}
return y;
}
The function foo
takes a pointer to const, which is a promise from
the author of foo
that it won’t modify the value of x
. Given this
information, it would seem the compiler may assume x
is always zero,
and therefore y
is always zero.
However, inspecting the assembly output of several different compilers
shows that x
is loaded each time around the loop. Here’s gcc 4.9.2
at -O3, with annotations, for x86-64,
bar:
push rbp
push rbx
xor ebp, ebp ; y = 0
mov ebx, 0xa ; loop variable i
sub rsp, 0x18 ; allocate x
mov dword [rsp+0xc], 0 ; x = 0
.L0: lea rdi, [rsp+0xc] ; compute &x
call foo
add ebp, dword [rsp+0xc] ; y += x (not optmized?)
sub ebx, 1
jne .L0
add rsp, 0x18 ; deallocate x
mov eax, ebp ; return y
pop rbx
pop rbp
ret
The output of clang 3.5 (with -fno-unroll-loops) is the same, except
ebp and ebx are swapped, and the computation of &x
is hoisted out of
the loop, into r14
.
Are both compilers failing to take advantage of this useful
information? Wouldn’t it be undefined behavior for foo
to modify
x
? Surprisingly, the answer is no. In this situation, this would
be a perfectly legal definition of foo
.
void
foo(const int *readonly_x)
{
int *x = (int *)readonly_x; // cast away const
(*x)++;
}
The key thing to remember is that const
doesn’t mean
constant. Chalk it up as a misnomer. It’s not an
optimization tool. It’s there to inform programmers — not the compiler
— as a tool to catch a certain class of mistakes at compile time. I
like it in APIs because it communicates how a function will use
certain arguments, or how the caller is expected to handle returned
pointers. It’s usually not strong enough for the compiler to change
its behavior.
Despite what I just said, occasionally the compiler can take
advantage of const
for optimization. The C99 specification, in
§6.7.3¶5, has one sentence just for this:
If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined.
The original x
wasn’t const-qualified, so this rule didn’t apply.
And there aren’t any rules against casting away const
to modify an
object that isn’t itself const
. This means the above (mis)behavior
of foo
isn’t undefined behavior for this call. Notice how the
undefined-ness of foo
depends on how it was called.
With one tiny tweak to bar
, I can make this rule apply, allowing the
optimizer do some work on it.
const int x = 0;
The compiler may now assume that foo
modifying x
is undefined
behavior, therefore it never happens. For better or worse, this is
a major part of how a C optimizer reasons about your programs. The
compiler is free to assume x
never changes, allowing it to optimize
out both the per-iteration load and y
.
bar:
push rbx
mov ebx, 0xa ; loop variable i
sub rsp, 0x10 ; allocate x
mov dword [rsp+0xc], 0 ; x = 0
.L0: lea rdi, [rsp+0xc] ; compute &x
call foo
sub ebx, 1
jne .L0
add rsp, 0x10 ; deallocate x
xor eax, eax ; return 0
pop rbx
ret
The load disappears, y
is gone, and the function always returns
zero.
Curiously, the specification almost allows the compiler to go
further. Consider what would happen if x
were allocated somewhere
off the stack in read-only memory. That transformation would look like
this:
static const int __x = 0;
int
bar(void)
{
for (int i = 0; i < 10; i++)
foo(&__x);
return 0;
}
We would see a few more instructions shaved off (-fPIC, small code model):
section .rodata
x: dd 0
section .text
bar:
push rbx
mov ebx, 0xa ; loop variable i
.L0: lea rdi, [rel x] ; compute &x
call foo
sub ebx, 1
jne .L0
xor eax, eax ; return 0
pop rbx
ret
Because the address of x
is taken and “leaked,” this last transform
is not permitted. If bar
is called recursively such that a second
address is taken for x
, that second pointer would compare equally
(==
) with the first pointer depsite being semantically distinct
objects, which is forbidden (§6.5.9¶6).
Even with this special const
rule, stick to using const
for
yourself and for your fellow human programmers. Let the optimizer
reason for itself about what is constant and what is not.
Travis Downs nicely summed up this article in the comments:
]]>In general,
const
declarations can’t help the optimizer, butconst
definitions can.
I primarily work on and develop for unix-like operating systems — Linux in particular. However, when it comes to desktop applications, most potential users are on Windows. Rather than develop on Windows, which I’d rather avoid, I’ll continue developing, testing, and debugging on Linux while keeping portability in mind. Unfortunately every option I’ve found for building Windows C programs has some significant limitations. These limitations advise my approach to portability and restrict the C language features used by the program for all platforms.
As of this writing I’ve identified four different practical ways to build C applications for Windows. This information will definitely become further and further out of date as this article ages, so if you’re visiting from the future take a moment to look at the date. Except for LLVM shaking things up recently, development tooling on unix-like systems has had the same basic form for the past 15 years (i.e. dominated by GCC). While Visual C++ has been around for more than two decades, the tooling on Windows has seen more churn by comparison.
Before I get into the specifics, let me point out a glaring problem
common to all four: Unicode arguments and filenames. Microsoft jumped
the gun and adopted UTF-16 early. UTF-16 is a kludge, a worst of all
worlds, being a variable length encoding (surrogate pairs), backwards
incompatible (unlike UTF-8), and having byte-order issues (BOM).
Most Win32 functions that accept strings generally come in two flavors,
ANSI and UTF-16. The standard, portable C library functions wrap the
ANSI-flavored functions. This means portable C programs can’t interact
with Unicode filenames. (Update 2021: Now they can.) They must
call the non-portable, Windows-specific versions. This includes main
itself, which is only handed ANSI-truncated arguments.
Compare this to unix-like systems, which generally adopted UTF-8, but rather as a convention than as a hard rule. The operating system doesn’t know or care about Unicode. Program arguments and filenames are just zero-terminated bytestrings. Implicitly decoding these as UTF-8 would be a mistake anyway. What happens when the encoding isn’t valid?
This doesn’t have to be a problem on Windows. A Windows standard C library could connect to Windows’ Unicode-flavored functions and encode to/from UTF-8 as needed, allowing portable programs to maintain the bytestring illusion. It’s only that none of the existing standard C libraries do it this way.
Of course my first natural choice is MinGW, specifically the Mingw-w64 fork. It’s GCC ported to Windows. You can continue relying on GCC-specific features when you need them. It’s got all the core language features up through C11, plus the common extensions. It’s probably packaged by your Linux distribution of choice, making it trivial to cross-compile programs and libraries from Linux — and with Wine you can even execute them on x86. Like regular GCC, it outputs GDB-friendly DWARF debugging information, so you can debug applications with GDB.
If I’m using Mingw-w64 on Windows, I prefer to do so from inside
Cygwin. Since it provides a complete POSIX environment, it maximizes
portability for the whole tool chain. This isn’t strictly required.
However, it has one big flaw. Unlike unix-like systems, Windows doesn’t
supply a system standard C library. That’s the compiler’s job. But
Mingw-w64 doesn’t have one. Instead it links against msvcrt.dll
,
which isn’t officially supported by Microsoft. It just
happens to exist on modern Windows installations. Since it’s not
supported, it’s way out of date and doesn’t support much of C99. A lot
of these problems are patched over by the compiler, but if you’re
relying on Mingw-w64, you still have to stick to some C89 library
features, such as limiting yourself to the C89 printf specifiers.
Update: Mārtiņš Možeiko has pointed out (Update 2021: Mingw-w64 now does the right thing
out of the box.)__USE_MINGW_ANSI_STDIO
, an
undocumented feature that fixes the printf family. I now use this by
default in all of my Mingw-w64 builds. It fixes most of the formatted
output issues, except that it’s incompatible with the format
function
attribute.
Another problem is that position-independent code generation is
broken, and so ASLR is not an option. This means binaries produced
by Mingw-w64 are less secure than they should be. There are also a
number of subtle code generation bugs that might arise if you’re
doing something unusual. (Update 2021: Mingw-w64 makes PIE mandatory.)
The behemoth usually considered in this situation is Visual Studio and the Visual C++ build tools. I strongly prefer open source development tools, and Visual Studio obviously the least open source option, but at least it’s cost-free these days. Now, I have absolutely no interest in Visual Studio, but fortunately the Visual C++ compiler and associated build tools can be used standalone, supporting both C and C++.
Included is a “vcvars” batch file — vcvars64.bat for x64. Execute that
batch file in a cmd.exe console and the Visual C++ command line build
tools will be made available in that console and in any programs
executed from it (your editor). It includes the compiler (cl.exe),
linker (link.exe), assembler (ml64.exe), disassembler (dumpbin.exe),
and more. It also includes a mostly POSIX-complete make called
nmake.exe. All these tools are noisy and print a copyright banner on
every invocation, so get used to passing -nologo
every time, which
suppresses some of it.
When I said behemoth, I meant it. In my experience it literally takes
hours (unattended) to install Visual Studio 2015. The good news is you
don’t actually need it all anymore. The build tools are available
standalone. While it’s still a larger and slower installation
process than it really should be, it’s is much more reasonable to
install. It’s good enough that I’d even say I’m comfortable relying on
it for Windows builds. (Update: The build tools are unfortunately no
longer standalone.)
That being said, it’s not without its flaws. Microsoft has never announced any plans to support C99. They only care about C++, with C as a second class citizen. Since C++11 incorporated most of C99 and Microsoft supports C++11, Visual Studio 2015 supports most of C99. The only things missing as far as I can tell are variable length arrays (VLAs), complex numbers, and C99’s array parameter declarators, since none of these were adopted by C++. Some C99 features are considered extensions (as they would be for C89), so you’ll also get warnings about them, which can be disabled.
The command line interface (option flags, intermediates, etc.) isn’t quite reconcilable with the unix-like ecosystem (i.e. GCC, Clang), so you’ll need separate Makefiles, or you’ll need to use a build system that generates Visual C++ Makefiles.
Debugging is a major problem. (Update 2022: It’s actually quite good
once you know how to do it.) Visual C++ outputs separate .pdb
program database files, which aren’t usable from GDB. Visual
Studio has a built-in debugger, though it’s not included in the
standalone Visual C++ build tools. I’m still searching for a decent
debugging solution for this scenario. I tried WinDbg, but I can’t stand
it. (Update 2022: RemedyBG is amazing.)
In general the output code performance is on par with GCC and Clang, so you’re not really gaining or losing performance with Visual C++.
Unsurprisingly, Clang has been ported to Windows. It’s like Mingw-w64 in that you get the same features and interface across platforms.
Unlike Mingw-w64, it doesn’t link against msvcrt.dll. Instead it relies directly on the official Windows SDK. You’ll basically need to install the Visual C++ build tools as if were going to build with Visual C++. This means no practical cross-platform builds and you’re still relying on the proprietary Microsoft toolchain. In the past you even had to use Microsoft’s linker, but LLVM now provides its own.
It generates GDB-friendly DWARF debug information (in addition to CodeView) so in theory you can debug with GDB again. I haven’t given this a thorough evaluation yet.
Finally there’s Pelles C. It’s cost-free but not open source. It’s a reasonable, small install that includes a full IDE with an integrated debugger and command line tools. It has its own C library and Win32 SDK with the most complete C11 support around. It also supports OpenMP 3.1. All in all it’s pretty nice and is something I wouldn’t be afraid to rely upon for Windows builds.
Like Visual C++, it has a couple of “povars” batch files to set up the right environment, which includes a C compiler, linker, assembler, etc. The compiler interface mostly mimics cl.exe, though there are far fewer code generation options. The make program, pomake.exe, mimics nmake.exe, but is even less POSIX-complete. The compiler’s output code performance is also noticeably poorer than GCC, Clang, and Visual C++. It’s definitely a less mature compiler.
It outputs CodeView debugging information, so GDB is of no use. The best solution is to simply use the compiler built into the IDE, which can be invoked directly from the command line. You don’t normally need to code from within the IDE just to use the debugger.
Like Visual C++, it’s Windows only, so cross-compilation isn’t really in the picture.
If performance isn’t of high importance, and you don’t require specific code generation options, then Pelles C is a nice choice for Windows builds.
I’m sure there are a few other options out there, and I’d like to hear about them so I can try them out. I focused on these since they’re all cost free and easy to download. If I have to register or pay, then it’s not going to beat these options.
]]>str_a == str_b
) rather than, more slowly,
by their contents (strcmp(str_a, str_b) == 0
). The intern table
ensures that these expressions both have the same result.
As a key in a hash table, or other efficient map/dictionary data structure, I’ll need to turn pointers into numerical values. However, C pointers aren’t integers. Following certain rules it’s permitted to cast pointers to integers and back, but doing so will reduce the program’s portability. The most important consideration is that the integer form isn’t guaranteed to have any meaningful or stable value. In other words, even in a conforming implementation, the same pointer might cast to two different integer values. This would break any algorithm that isn’t keenly aware of the implementation details.
To show why this is, I’m going to be citing the relevant parts of the C99 standard (ISO/IEC 9899:1999). The draft for C99 is freely available (and what I use myself since I’m a cheapass). My purpose is not to discourage you from casting pointers to integers and using the result. The vast majority of the time this works fine and as you would expect. I just think it’s an interesting part of the language, and C/C++ programmers should be aware of potential the trade-offs.
What does the standard have to say about casting pointers to integers? §6.3.2.3¶5:
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
It also includes a footnote:
The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.
Casting an integer to a pointer depends entirely on the implementation. This is intended for things like memory mapped hardware. The programmer may need to access memory as a specific physical address, which would be encoded in the source as an integer constant and cast to a pointer of the appropriate type.
int
read_sensor_voltage(void)
{
return *(int *)0x1ffc;
}
It may also be used by a loader and dynamic linker to compute the virtual address of various functions and variables, then cast to a pointer before use.
Both cases are already dependent on implementation defined behavior, so there’s nothing lost in relying on these casts.
An integer constant expression of 0 is a special case. It casts to a
NULL pointer in all implementations (§6.3.2.3¶3). However, a NULL
pointer doesn’t necessarily point to address zero, nor is it
necessarily a zero bit pattern (i.e. beware memset
and calloc
on
memory with pointers). It’s just guaranteed never to compare equally
with a valid object, and it is undefined behavior to dereference.
What about the other way around? §6.3.2.3¶6:
Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.
Like before, it’s implementation defined. However, the negatives are a little stronger: the cast itself may be undefined behavior. I speculate this is tied to integer overflow. The last part makes pointer to integer casts optional for an implementation. This is one way that the hash table above would be less portable.
When the cast is always possible, an implementation can provide an integer type wide enough to hold any pointer value. §7.18.1.4¶1:
The following type designates a signed integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:
intptr_t
The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer:
uintptr_t
These types are optional.
The take-away is that the integer has no meaningful value. The only guarantee is that the integer can be cast back into a void pointer that will compare equally. It would be perfectly legal for an implementation to pass these assertions (and still sometimes fail).
void
example(void *ptr_a, void *ptr_b)
{
if (ptr_a == ptr_b) {
uintptr_t int_a = (uintptr_t)ptr_a;
uintptr_t int_b = (uintptr_t)ptr_b;
assert(int_a != int_b);
assert((void *)int_a == (void *)int_b);
}
}
Since the bits don’t have any particular meaning, arithmetic
operations involving them will also have no meaning. When a pointer
might map to two different integers, the hash values might not match
up, breaking hash tables that rely on them. Even with uintptr_t
provided, casting pointers to integers isn’t useful without also
relying on implementation defined properties of the result.
What purpose could such strange pointer-to-integer casts serve?
A security-conscious implementation may choose to annotate pointers with additional information by setting unused bits. It might be for baggy bounds checks or, someday, in an undefined behavior sanitizer. Before dereferencing annotated pointers, the metadata bits would be checked for validity, and cleared/set before use as an address. Or it may map the same object at multiple virtual addresses) to avoid setting/clearing the metadata bits, providing interoperability with code unaware of the annotations. When pointers are compared, these bits would be ignored.
When these annotated pointers are cast to integers, the metadata bits will be present, but a program using the integer wouldn’t know their meaning without tying itself closely to that implementation. Completely unused bits may even be filled with random garbage when cast. It’s allowed.
You may have been thinking before about using a union or char *
to
bypass the cast and access the raw pointer bytes, but you’d run into
the same problems on the same implementations.
The standard makes a distinction between strictly conforming
programs (§4¶5) and conforming programs (§4¶7). A strictly
conforming program must not produce output depending on implementation
defined behavior nor exceed minimum implementation limits. Very few
programs fit in this category, including any program using uintptr_t
since it’s optional. Here are more examples of code that isn’t
strictly conforming:
printf("%zu", sizeof(int)); // §6.5.3.4
printf("%d", -1 >> 1); // §6.5¶4
printf("%d", MAX_INT); // §5.2.4.2.1
On the other hand, a conforming program is allowed to depend on
implementation defined behavior. Relying on meaningful, stable values
for pointers cast to uintptr_t
/intptr_t
is conforming even if your
program may exhibit bugs on some implementations.
This functionality allows for a neat hack: A physical memory address can be mapped to multiple virtual memory addresses at the same time. A process running with such a mapping will see these regions of memory as aliased — views of the same physical memory. A store to one of these addresses will simultaneously appear across all of them.
Some useful applications of this feature include:
Both POSIX and Win32 allow user space applications to create these aliased mappings. The original purpose for these APIs is for shared memory between processes, where the same physical memory is mapped into two different processes’ virtual memory. But the OS doesn’t stop us from mapping the shared memory to a different address within the same process.
On POSIX systems (Linux, *BSD, OS X, etc.), the three key functions
are shm_open(3)
, ftruncate(2)
, and mmap(2)
.
First, create a file descriptor to shared memory using shm_open
. It
has very similar semantics to open(2)
.
int shm_open(const char *name, int oflag, mode_t mode);
The name
works much like a filesystem path, but is actually a
different namespace (though on Linux it is a tmpfs mounted at
/dev/shm
). Resources created here (O_CREAT
) will persist until
explicitly deleted (shm_unlink(3)
) or until the system reboots. It’s
an oversight in POSIX that a name is required even if we never intend
to access it by name. File descriptors can be shared with other
processes via fork(2)
or through UNIX domain sockets, so a name
isn’t strictly required.
OpenBSD introduced shm_mkstemp(3)
to solve this problem,
but it’s not widely available. On Linux, as of this writing, the
O_TMPFILE
flag may or may not provide a fix (it’s
undocumented).
The portable workaround is to attempt to choose a unique name, open
the file with O_CREAT | O_EXCL
(either atomically create the file or
fail), shm_unlink
the shared memory object as soon as possible, then
cross our fingers. The shared memory object will still exist (the file
descriptor keeps it alive) but will not longer be accessible by name.
int fd = shm_open("/example", O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
handle_error(); // non-local exit
shm_unlink("/example");
The shared memory object is brand new (O_EXCL
) and is therefore of
zero size. ftruncate
sets it to the desired size. This does not
need to be a multiple of the page size. Failing to allocate memory
will result in a bus error on access.
size_t size = sizeof(uint32_t);
ftruncate(fd, size);
Finally mmap
the shared memory into place just as if it were a file.
We can choose an address (aligned to a page) or let the operating
system choose one for use (NULL). If we don’t plan on making any more
mappings, we can also close the file descriptor. The shared memory
object will be freed as soon as it completely unmapped (munmap(2)
).
int prot = PROT_READ | PROT_WRITE;
uint32_t *a = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
uint32_t *b = mmap(NULL, size, prot, MAP_SHARED, fd, 0);
close(fd);
At this point both a
and b
have different addresses but point (via
the page table) to the same physical memory. Changes to one are
reflected in the other. So this:
*a = 0xdeafbeef;
printf("%p %p 0x%x\n", a, b, *b);
Will print out something like:
0x6ffffff0000 0x6fffffe0000 0xdeafbeef
It’s also possible to do all this only with open(2)
and mmap(2)
by
mapping the same file twice, but you’d need to worry about where to
put the file, where it’s going to be backed, and the operating system
will have certain obligations about syncing it to storage somewhere.
Using POSIX shared memory is simpler and faster.
Windows is very similar, but directly supports anonymous shared
memory. The key functions are CreateFileMapping
, and
MapViewOfFileEx
.
First create a file mapping object from an invalid handle value. Like POSIX, the word “file” is used without actually involving files.
size_t size = sizeof(uint32_t);
HANDLE h = CreateFileMapping(INVALID_HANDLE_VALUE,
NULL,
PAGE_READWRITE,
0, size,
NULL);
There’s no truncate step because the space is allocated at creation time via the two-part size argument.
Then, just like mmap
:
uint32_t *a = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
uint32_t *b = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, size);
CloseHandle(h);
If I wanted to choose the target address myself, I’d call
MapViewOfFileEx
instead, which takes the address as additional
argument.
From here on it’s the same as above.
Having some fun with this, I came up with a general API to allocate an aliased mapping at an arbitrary number of addresses.
int memory_alias_map(size_t size, size_t naddr, void **addrs);
void memory_alias_unmap(size_t size, size_t naddr, void **addrs);
Values in the address array must either be page-aligned or NULL to allow the operating system to choose, in which case the map address is written to the array.
It returns 0 on success. It may fail if the size is too small (0), too large, too many file descriptors, etc.
Pass the same pointers back to memory_alias_unmap
to free the
mappings. When called correctly it cannot fail, so there’s no return
value.
The full source is here: memalias.c
Starting with the simpler of the two functions, the POSIX implementation looks like so:
void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
for (size_t i = 0; i < naddr; i++)
munmap(addrs[i], size);
}
The complex part is creating the mapping:
int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
char path[128];
snprintf(path, sizeof(path), "/%s(%lu,%p)",
__FUNCTION__, (long)getpid(), addrs);
int fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL, 0600);
if (fd == -1)
return -1;
shm_unlink(path);
ftruncate(fd, size);
for (size_t i = 0; i < naddr; i++) {
addrs[i] = mmap(addrs[i], size,
PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0);
if (addrs[i] == MAP_FAILED) {
memory_alias_unmap(size, i, addrs);
close(fd);
return -1;
}
}
close(fd);
return 0;
}
The shared object name includes the process ID and pointer array address, so there really shouldn’t be any non-malicious name collisions, even if called from multiple threads in the same process.
Otherwise it just walks the array setting up the mappings.
The Windows version is very similar.
void
memory_alias_unmap(size_t size, size_t naddr, void **addrs)
{
(void)size;
for (size_t i = 0; i < naddr; i++)
UnmapViewOfFile(addrs[i]);
}
Since Windows tracks the size internally, it’s unneeded and ignored.
int
memory_alias_map(size_t size, size_t naddr, void **addrs)
{
HANDLE m = CreateFileMapping(INVALID_HANDLE_VALUE,
NULL,
PAGE_READWRITE,
0, size,
NULL);
if (m == NULL)
return -1;
DWORD access = FILE_MAP_ALL_ACCESS;
for (size_t i = 0; i < naddr; i++) {
addrs[i] = MapViewOfFileEx(m, access, 0, 0, size, addrs[i]);
if (addrs[i] == NULL) {
memory_alias_unmap(size, i, addrs);
CloseHandle(m);
return -1;
}
}
CloseHandle(m);
return 0;
}
In the future I’d like to find some unique applications of these multiple memory views.
]]>If you want to see it all up front, here’s the full source: hotpatch.c
Here’s the function that I’m going to change:
void
hello(void)
{
puts("hello");
}
It’s dead simple, but that’s just for demonstration purposes. This will work with any function of arbitrary complexity. The definition will be changed to this:
void
hello(void)
{
static int x;
printf("goodbye %d\n", x++);
}
I was only going change the string, but I figured I should make it a little more interesting.
Here’s how it’s going to work: I’m going to overwrite the beginning of the function with an unconditional jump that immediately moves control to the new definition of the function. It’s vital that the function prototype does not change, since that would be a far more complex problem.
But first there’s some preparation to be done. The target needs to be augmented with some GCC function attributes to prepare it for its redefinition. As is, there are three possible problems that need to be dealt with:
The solution is the ms_hook_prologue
function attribute. This tells
GCC to put a hotpatch prologue on the function: a big, fat, 8-byte NOP
that I can safely clobber. This idea originated in Microsoft’s Win32
API, hence the “ms” in the name.
The solution is the aligned
function attribute, ensuring the
hotpatch prologue is properly aligned.
As you might have guessed, this is primarily fixed with the noinline
function attribute. Since GCC may also clone the function and call
that instead, so it also needs the noclone
attribute.
Even further, if GCC determines there are no side effects, it may
cache the return value and only ever call the function once. To
convince GCC that there’s a side effect, I added an empty inline
assembly string (__asm("")
). Since puts()
has a side effect
(output), this isn’t truly necessary for this particular example, but
I’m being thorough.
What does the function look like now?
__attribute__ ((ms_hook_prologue))
__attribute__ ((aligned(8)))
__attribute__ ((noinline))
__attribute__ ((noclone))
void
hello(void)
{
__asm("");
puts("hello");
}
And what does the assembly look like?
$ objdump -Mintel -d hotpatch
0000000000400848 <hello>:
400848: 48 8d a4 24 00 00 00 lea rsp,[rsp+0x0]
40084f: 00
400850: bf d4 09 40 00 mov edi,0x4009d4
400855: e9 06 fe ff ff jmp 400660 <puts@plt>
It’s 8-byte aligned and it has the 8-byte NOP: that lea
instruction
does nothing. It copies rsp
into itself and changes no flags. Why
not 8 1-byte NOPs? I need to replace exactly one instruction with
exactly one other instruction. I can’t have another thread in between
those NOPs.
Next, let’s take a look at the function that will perform the hotpatch. I’ve written a generic patching function for this purpose. This part is entirely specific to x86.
void
hotpatch(void *target, void *replacement)
{
assert(((uintptr_t)target & 0x07) == 0); // 8-byte aligned?
void *page = (void *)((uintptr_t)target & ~0xfff);
mprotect(page, 4096, PROT_WRITE | PROT_EXEC);
uint32_t rel = (char *)replacement - (char *)target - 5;
union {
uint8_t bytes[8];
uint64_t value;
} instruction = { {0xe9, rel >> 0, rel >> 8, rel >> 16, rel >> 24} };
*(uint64_t *)target = instruction.value;
mprotect(page, 4096, PROT_EXEC);
}
It takes the address of the function to be patched and the address of the function to replace it. As mentioned, the target must be 8-byte aligned (enforced by the assert). It’s also important this function is only called by one thread at a time, even on different targets. If that was a concern, I’d wrap it in a mutex to create a critical section.
There are a number of things going on here, so let’s go through them one at a time:
The .text segment will not be writeable by default. This is for both
security and safety. Before I can hotpatch the function I need to make
the function writeable. To make the function writeable, I need to make
its page writable. To make its page writeable I need to call
mprotect()
. If there was another thread monkeying with the page
attributes of this page at the same time (another thread calling
hotpatch()
) I’d be in trouble.
It finds the page by rounding the target address down to the nearest
4096, the assumed page size (sorry hugepages). Warning: I’m being a
bad programmer and not checking the result of mprotect()
. If it
fails, the program will crash and burn. It will always fail systems
with W^X enforcement, which will likely become the standard in the
future. Under W^X (“write XOR execute”), memory can either
be writeable or executable, but never both at the same time.
What if the function straddles pages? Well, I’m only patching the first 8 bytes, which, thanks to alignment, will sit entirely inside the page I just found. It’s not an issue.
At the end of the function, I mprotect()
the page back to
non-writeable.
I’m assuming the replacement function is within 2GB of the original in virtual memory, so I’ll use a 32-bit relative jmp instruction. There’s no 64-bit relative jump, and I only have 8 bytes to work within anyway. Looking that up in the Intel manual, I see this:
Fortunately it’s a really simple instruction. It’s opcode 0xE9 and it’s followed immediately by the 32-bit displacement. The instruction is 5 bytes wide.
To compute the relative jump, I take the difference between the functions, minus 5. Why the 5? The jump address is computed from the position after the jump instruction and, as I said, it’s 5 bytes wide.
I put 0xE9 in a byte array, followed by the little endian displacement. The astute may notice that the displacement is signed (it can go “up” or “down”) and I used an unsigned integer. That’s because it will overflow nicely to the right value and make those shifts clean.
Finally, the instruction byte array I just computed is written over the hotpatch NOP as a single, atomic, 64-bit store.
*(uint64_t *)target = instruction.value;
Other threads will see either the NOP or the jump, nothing in between. There’s no synchronization, so other threads may continue to execute the NOP for a brief moment even through I’ve clobbered it, but that’s fine.
Here’s what my test program looks like:
void *
worker(void *arg)
{
(void)arg;
for (;;) {
hello();
usleep(100000);
}
return NULL;
}
int
main(void)
{
pthread_t thread;
pthread_create(&thread, NULL, worker, NULL);
getchar();
hotpatch(hello, new_hello);
pthread_join(thread, NULL);
return 0;
}
I fire off the other thread to keep it pinging at hello()
. In the
main thread, it waits until I hit enter to give the program input,
after which it calls hotpatch()
and changes the function called by
the “worker” thread. I’ve now changed the behavior of the worker
thread without its knowledge. In a more practical situation, this
could be used to update parts of a running program without restarting
or even synchronizing.
These related articles have been shared with me since publishing this article:
]]>The Native API is a low-level API, a foundation for the implementation of the Windows API and various components that don’t use the Windows API (drivers, etc.). It includes a runtime library (RTL) suitable for replacing important parts of the C standard library, unavailable to freestanding programs. Very useful for a minimal program.
Unfortunately, using the Native API is a bit of a minefield. Not all of the documented Native API functions are actually exported by ntdll.dll, making them inaccessible both for linking and GetProcAddress(). Some are exported, but not documented as such. Others are documented as exported but are not documented when (which release of Windows). If a particular function wasn’t exported until Windows 8, I don’t want to use when supporting Windows 7.
This is further complicated by the Microsoft Windows SDK, where many
of these functions are just macros that alias C runtime functions.
Naturally, MinGW closely follows suit. For example, in both cases,
here is how the Native API function RtlCopyMemory
is “declared.”
#define RtlCopyMemory(dest,src,n) memcpy((dest),(src),(n))
This is certainly not useful for freestanding programs, though it has
a significant benefit for hosted programs: The C compiler knows the
semantics of memcpy()
and can properly optimize around it. Any C
compiler worth its salt will replace a small or aligned, fixed-sized
memcpy()
or memmove()
with the equivalent inlined code. For
example:
char buffer0[16];
char buffer1[16];
// ...
memcpy(buffer0, buffer1, 16);
// ...
On x86_64 (GCC 4.9.3, -Os), this memmove()
call is replaced with
two instructions. This isn’t possible when calling an opaque function
in a non-standard dynamic library. The side effects could be anything.
movaps xmm0, [rsp + 48]
movaps [rsp + 32], xmm0
These Native API macro aliases are what have allowed certain Wine issues to slip by unnoticed for years. Very few user space applications actually call Native API functions, even when addressed directly by name in the source. The development suite is pulling a bait and switch.
Like last time I danced at the edge of the compiler, this has
caused headaches in my recent experimentation with freestanding
executables. The MinGW headers assume that the programs including them
will link against a C runtime. Dirty hack warning: To work around it,
I have to undo the definition in the MinGW headers and make my own.
For example, to use the real RtlMoveMemory()
:
#include <windows.h>
#undef RtlMoveMemory
__declspec(dllimport)
void RtlMoveMemory(void *, const void *, size_t);
Anywhere where I might have previously used memmove()
I can instead
use RtlMoveMemory()
. Or I could trivially supply my own wrapper:
void *
memmove(void *d, const void *s, size_t n)
{
RtlMoveMemory(d, s, n);
return d;
}
As of this writing, the same approach is not reliable with
RtlCopyMemory()
, the cousin to memcpy()
. As far as I can tell, it
was only exported starting in Windows 7 SP1 and Wine 1.7.46 (June
2015). Use RtlMoveMemory()
instead. The overlap-handling overhead is
negligible compared to the function call overhead anyway.
As a side note: one reason besides minimalism for not implementing
your own memmove()
is that it can’t be implemented efficiently in a
conforming C program. According to the language specification, your
implementation of memmove()
would not be permitted to compare its
pointer arguments with <
, >
, <=
, or >=
. That would lead to
undefined behavior when pointing to unrelated objects (ISO/IEC
9899:2011 §6.5.8¶5). The simplest legal approach is to allocate a
temporary buffer, copy the source buffer into it, then copy it into
the destination buffer. However, buffer allocation may fail — i.e.
NULL return from malloc()
— introducing a failure case to
memmove()
, which isn’t supposed to fail.
Update July 2016: Alex Elsayed pointed out a solution to the
memmove()
problem in the comments. In short: iterate over the
buffers bytewise (char *
) using equality (==
) tests to check for
an overlap. In theory, a compiler could optimize away the loop and
make it efficient.
I keep mentioning Wine because I’ve been careful to ensure my applications run correctly with it. So far it’s worked perfectly with both Windows API and Native API functions. Thanks to the hard work behind the Wine project, despite being written sharply against the Windows API, these tiny programs remain relatively portable (x86 and ARM). It’s a good fit for graphical applications (games), but I would never write a command line application like this. The command line has always been a second class citizen on Windows.
Mostly for my own future reference, here are export lists for two different versions of kernel32.dll and ntdll.dll:
As I collect more of these export lists, I’ll be able to paint a full
picture of when particular functions first appeared as exports. These
lists were generated with objdump -p <path_to_dll>
.
Now that I’ve got these Native API issues sorted out, I’ve
significantly expanded the capabilities of my tiny, freestanding
programs without adding anything to their size. Functions like
RtlUnicodeToUTF8N()
and RtlUTF8ToUnicodeN()
will surely be handy.
Recently I’ve been experimenting with freestanding C programs on
Windows. Freestanding refers to programs that don’t link, either
statically or dynamically, against a standard library (i.e. libc).
This is typical for operating systems and similar, bare metal
situations. Normally a C compiler can make assumptions about the
semantics of functions provided by the C standard library. For
example, the compiler will likely replace a call to a small,
fixed-size memmove()
with move instructions. Since a freestanding
program would supply its own, it may have different semantics.
My usual go to for C/C++ on Windows is Mingw-w64, which has greatly suited my needs the past couple of years. It’s packaged on Debian, and, when combined with Wine, allows me to fully develop Windows applications on Linux. Being GCC, it’s also great for cross-platform development since it’s essentially the same compiler as the other platforms. The primary difference is the interface to the operating system (POSIX vs. Win32).
However, it has one glaring flaw inherited from MinGW: it links against msvcrt.dll, an ancient version of the Microsoft C runtime library that currently ships with Windows. Besides being dated and quirky, it’s not an official part of Windows and never has been, despite its inclusion with every release since Windows 95. Mingw-w64 doesn’t have a C library of its own, instead patching over some of the flaws of msvcrt.dll and linking against it.
Since so much depends on msvcrt.dll despite its unofficial nature, it’s unlikely Microsoft will ever drop it from future releases of Windows. However, if strict correctness is a concern, we must ask Mingw-w64 not to link against it. An alternative would be PlibC, though the LGPL licensing is unfortunate. Another is Cygwin, which is a very complete POSIX environment, but is heavy and GPL-encumbered.
Sometimes I’d prefer to be more direct: skip the C standard library altogether and talk directly to the operating system. On Windows that’s the Win32 API. Ultimately I want a tiny, standalone .exe that only links against system DLLs.
The most important benefit of a standard library like libc is a portable, uniform interface to the host system. So long as the standard library suits its needs, the same program can run anywhere. Without it, the programs needs an implementation of each host-specific interface.
On Linux, operating system requests at the lowest level are made
directly via system calls. This requires a bit of assembly language
for each supported architecture (int 0x80
on x86, syscall
on
x86-64, swi
on ARM, etc.). The POSIX functions of the various Linux
libc implementations are built on top of this mechanism.
For example, here’s a function for a 1-argument system call on x86-64.
long
syscall1(long n, long arg)
{
long result;
__asm__ volatile (
"syscall"
: "=a"(result)
: "a"(n), "D"(arg)
);
return result;
}
Then exit()
is implemented on top. Note: A real libc would do
cleanup before exiting, like calling registered atexit()
functions.
#include <syscall.h> // defines SYS_exit
void
exit(int code)
{
syscall1(SYS_exit, code);
}
The situation is simpler on Windows. Its low level system calls are
undocumented and unstable, changing across even minor updates. The
formal, stable interface is through the exported functions in
kernel32.dll. In fact, kernel32.dll is essentially a standard library
on its own (making the term “freestanding” in this case dubious). It
includes functions usually found only in user-space, like string
manipulation, formatted output, font handling, and heap management
(similar to malloc()
). It’s not POSIX, but it has analogs to much of
the same functionality.
The standard entry for a C program is main()
. However, this is not
the application’s true entry. The entry is in the C library, which
does some initialization before calling your main()
. When main()
returns, it performs cleanup and exits. Without a C library, programs
don’t start at main()
.
On Linux the default entry is the symbol _start
. It’s prototype
would look like so:
void _start(void);
Returning from this function leads to a segmentation fault, so it’s up to your application to perform the exit system call rather than return.
On Windows, the entry depends on the type of application. The two
relevant subsystems today are the console and windows subsystems.
The former is for console applications (duh). These programs may still
create windows and such, but must always have a controlling console.
The latter is primarily for programs that don’t run in a console,
though they can still create an associated console if they like. In
Mingw-w64, give -mconsole
(default) or -mwindows
to the linker to
choose the subsystem.
The default entry for each is slightly different.
int WINAPI mainCRTStartup(void);
int WINAPI WinMainCRTStartup(void);
Unlike Linux’s _start
, Windows programs can safely return from these
functions, similar to main()
, hence the int
return. The WINAPI
macro means the function may have a special calling convention,
depending on the platform.
On any system, you can choose a different entry symbol or address
using the --entry
option to the GNU linker.
One problem I’ve run into is Mingw-w64 generating code that calls
__chkstk_ms()
from libgcc. I believe this is a long-standing bug,
since -ffreestanding
should prevent these sorts of helper functions
from being used. The workaround I’ve found is to disable the stack
probe and pre-commit the whole stack.
-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000
Alternatively you could link against libgcc (statically) with -lgcc
,
but, again, I’m going for a tiny executable.
Here’s an example of a Windows “Hello, World” that doesn’t use a C library.
#include <windows.h>
int WINAPI
mainCRTStartup(void)
{
char msg[] = "Hello, world!\n";
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
WriteFile(stdout, msg, sizeof(msg), (DWORD[]){0}, NULL);
return 0;
}
To build it:
x86_64-w64-mingw32-gcc -std=c99 -Wall -Wextra \
-nostdlib -ffreestanding -mconsole -Os \
-mno-stack-arg-probe -Xlinker --stack=0x100000,0x100000 \
-o example.exe example.c \
-lkernel32
Notice I manually linked against kernel32.dll. The stripped final result is only 4kB, mostly PE padding. There are techniques to trim this down even further, but for a substantial program it wouldn’t make a significant difference.
From here you could create a GUI by linking against user32.dll
and
gdi32.dll
(both also part of Win32) and calling the appropriate
functions. I already ported my OpenGL demo to a freestanding
.exe, dropping GLFW and directly using Win32 and WGL. It’s much less
portable, but the final .exe is only 4kB, down from the original 104kB
(static linking against GLFW).
I may go this route for the upcoming 7DRL 2016 in March.
]]>The build system I use most often is GNU Make, either directly or
indirectly (Autoconf, CMake). It’s far from perfect, but it does what
I need. I almost always invoke it from within Emacs rather than in a
terminal. In fact, I do it so often that I’ve wrapped Emacs’ compile
command for rapid invocation.
I recently helped a co-worker set this set up for himself, so it had me thinking about the problem again. The situation in my config is much more complicated than it needs to be, so I’ll share a simplified version instead.
First bring in the usual goodies (we’re going to be making closures):
;;; -*- lexical-binding: t; -*-
(require 'cl-lib)
We need a couple of configuration variables.
(defvar quick-compile-command "make -k ")
(defvar quick-compile-build-file "Makefile")
Then a couple of interactive functions to set these on the fly. It’s
not strictly necessary, but I like giving each a key binding. I also
like having a history available via read-string
, so I can switch
between a couple of different options with ease.
(defun quick-compile-set-command (command)
(interactive
(list (read-string "Command: " quick-compile-command)))
(setf quick-compile-command command))
(defun quick-compile-set-build-file (build-file)
(interactive
(list (read-string "Build file: " quick-compile-build-file)))
(setf quick-compile-build-file build-file))
Now finally to the good part. Below, quick-compile
is a
non-interactive function that returns an interactive closure ready to
be bound to any key I desire. It takes an optional target. This means
I don’t use the above quick-compile-set-command
to choose a target,
only for setting other options. That will make more sense in a moment.
(cl-defun quick-compile (&optional (target ""))
"Return an interaction function that runs `compile' for TARGET."
(lambda ()
(interactive)
(save-buffer) ; so I don't get asked
(let ((default-directory
(locate-dominating-file
default-directory quick-compile-build-file)))
(if default-directory
(compile (concat quick-compile-command " " target))
(error "Cannot find %s" quick-compile-build-file)))))
It traverses up (down?) the directory hierarchy towards root looking
for a Makefile — or whatever is set for quick-compile-build-file
— then invokes the build system there. I don’t believe in recursive
make
.
So how do I put this to use? I clobber some key bindings I don’t otherwise care about. A better choice might be the F-keys, but my muscle memory is already committed elsewhere.
(global-set-key (kbd "C-x c") (quick-compile)) ; default target
(global-set-key (kbd "C-x C") (quick-compile "clean"))
(global-set-key (kbd "C-x t") (quick-compile "test"))
(global-set-key (kbd "C-x r") (quick-compile "run"))
Each of those invokes a different target without second guessing me. Let me tell you, having “clean” at the tip of my fingers is wonderful.
An extension common to many different make
programs is -j
, which
asks make
to build targets in parallel where possible. These days
where multi-core machines are the norm, you nearly always want to use
this option, ideally set to the number of logical processor cores on
your system. It’s a huge time-saver.
My recent revelation was that my default build command could be
better: make -k
is minimal. It should at least include -j
, but
choosing an argument (number of processor cores) is a problem. Today I
use different machines with 2, 4, or 8 cores, so most of the time any
given number will be wrong. I could use a per-system configuration,
but I’d rather not. Unfortunately GNU Make will not automatically
detect the number of cores. That leaves the matter up to Emacs Lisp.
Emacs doesn’t currently have a built-in function that returns the number of processor cores. I’ll need to reach into the operating system to figure it out. My usual development environments are Linux, Windows, and OpenBSD, so my solution should work on each. I’ve ranked them by order of importance.
Linux has the /proc
virtual filesystem in the fashion of Plan 9,
allowing different aspects of the system to be explored through the
standard filesystem API. The relevant file here is /proc/cpuinfo
,
listing useful information about each of the system’s processors. To
get the number of processors, count the number of processor entries in
this file. I’ve wrapped it in if-file-exists
so that it returns
nil
on other operating systems instead of throwing an error.
(when (file-exists-p "/proc/cpuinfo")
(with-temp-buffer
(insert-file-contents "/proc/cpuinfo")
(how-many "^processor[[:space:]]+:")))
When I was first researching how to do this on Windows, I thought I
would need to invoke the wmic
command line program and hope the
output could be parsed the same way on different versions of the
operating system and tool. However, it turns out the solution for
Windows is trivial. The environment variable NUMBER_OF_PROCESSORS
gives every process the answer for free. Being an environment
variable, it will need to be parsed.
(let ((number-of-processors (getenv "NUMBER_OF_PROCESSORS")))
(when number-of-processors
(string-to-number number-of-processors)))
This seems to work the same across all the BSDs, including OS X,
though I haven’t yet tested it exhaustively. Invoke sysctl
, which
returns an undecorated number to be parsed.
(with-temp-buffer
(ignore-errors
(when (zerop (call-process "sysctl" nil t nil "-n" "hw.ncpu"))
(string-to-number (buffer-string)))))
Also not complicated, but it’s the heaviest solution of the three.
Join all these together with or
, call it numcores
, and ta-da.
(setf quick-compile-command (format "make -kj%d" (numcores)))
Now make
is invoked correctly on any system by default.
FILE
buffer. The program had been running
for two days straight printing its results, but the last few kilobytes
of output were missing. It wouldn’t output these last bytes until the
program completed its day-long (or worse!) cleanup operation and
exited. This is easy to fix — and, honestly, the cleanup step was
unnecessary anyway — but I didn’t want to start all over and wait
two more days to recompute the result.
Here’s a minimal example of the situation. The first loop represents the long-running computation and the infinite loop represents a cleanup job that will never complete.
#include <stdio.h>
int
main(void)
{
/* Compute output. */
for (int i = 0; i < 10; i++)
printf("%d/%d ", i, i * i);
putchar('\n');
/* "Slow" cleanup operation ... */
for (;;)
;
return 0;
}
Both printf
and putchar
are C library functions and are usually
buffered in some way. That is, each call to these functions doesn’t
necessarily send data out of the program. This is in contrast to the
POSIX functions read
and write
, which are unbuffered system calls.
Since system calls are relatively expensive, buffered input and output
is used to change a large number of system calls on small buffers into
a single system call on a single large buffer.
Typically, stdout is line-buffered if connected to a terminal. When the program completes a line of output, the user probably wants to see it immediately. So, if you compile the example program and run it at your terminal you will probably see the output before the program hangs on the infinite loop.
$ cc -std=c99 example.c
$ ./a.out
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81
However, when stdout is connected to a file or pipe, it’s generally
buffered to something like 4kB. For this program, the output will
remain empty no matter how long you wait. It’s trapped in a FILE
buffer in process memory.
$ ./a.out > output.txt
The primary way to fix this is to use the fflush
function, to force
the buffer empty before starting a long, non-output operation.
Unfortunately for me I didn’t think of this two days earlier.
Fortunately there is a way to interrupt a running program and
manipulate its state: a debugger. First, find the process ID of the
running program (the one writing to output.txt
above).
$ pgrep a.out
12934
Now attach GDB, which will pause the program’s execution.
$ gdb ./a.out
Reading symbols from ./a.out...(no debugging symbols found)...done.
gdb> attach 12934
Attaching to program: /tmp/a.out, process 12934
... snip ...
0x0000000000400598 in main ()
gdb>
From here I could examine the stdout FILE
struct and try to extract
the buffer contents by hand. However, the easiest thing is to do is
perform the call I forgot in the first place: fflush(stdout)
.
gdb> call fflush(stdout)
$1 = 0
gdb> quit
Detaching from program: /tmp/a.out, process 12934
The program is still running, but the output has been recovered.
$ cat output.txt
0/0 1/1 2/4 3/9 4/16 5/25 6/36 7/49 8/64 9/81
As I said, in my case the cleanup operation was entirely unnecessary,
so it would be safe to just kill the program at this point. It was
taking a really long time to tear down a humongous data structure (on
the order of 50GB) one little node at a time with free
. Obviously,
the memory would be freed much more quickly by the OS when the program
exited.
Freeing memory in the program was only to satisfy Valgrind, since it’s so incredibly useful for debugging. Not freeing the data structure would hide actual memory leaks in Valgrind’s final report. For the real “production” run, I should have disabled cleanup.
]]>If you want to take a look at my code before reading further:
Having multiple CPU cores allows different instructions to operation on (usually) different data independently. In contrast, under SIMD a specific operation (single instruction) acts upon several values (multiple data) at once. It’s another form of parallelization. For example, with image processing — perhaps the most common use case — this means multiple pixels could be computed within the same number of cycles it would normally take to compute just one. SIMD is generally implemented on CPUs through wide registers: 64, 128, 256, and even 512 bits wide. Values are packed into the register like an array and are operated on independently, generally with saturation arithmetic (clamped, non-wrapping).
Rather than hand-code all this in assembly, I’m using yet another technique I picked up from the always-educational Handmade Hero: compiler intrinsics. The code is all C, but in place of C’s operators are pseudo-function calls operating on special SIMD types. These aren’t actual function calls, they’re intrinsics. The compiler will emit a specific assembly instruction for each intrinsic, sort of like an inline function. This is more flexible for mixing with other C code, the compiler will manage all the registers, and the compiler will attempt to re-order and interleave instructions to maximize throughput. It’s a big win!
The first widely consumer available SIMD hardware was probably the MMX
instruction set, introduced to 32-bit x86 in 1997. This provided 8
64-bit mm0
- mm7
, registers aliasing the older x87 floating
pointer registers, which operated on packed integer values. This was
extended by AMD with its 3DNow! instruction set, adding floating point
instructions.
However, you don’t need to worry about any of that because these both
were superseded by Streaming SIMD Extensions (SSE) in 1999. SSE has
128-bit registers — confusingly named xmm0
- xmm7
— and a much
richer instruction set. SSE has been extended with SSE2 (2001), SSE3
(2004), SSSE3 (2006), SSE4.1 (2007), and SSE4.2 (2008). x86-64 doesn’t
have SSE2 as an extension but instead as a core component of the
architecture (adding xmm8
- xmm15
), baking it into its ABI.
In 2009, ARM introduced the NEON instruction set as part of ARMv6. Like SSE, it has 128-bit registers, but its instruction set is more consistent and uniform. One of its most visible features over SSE is a stride load parameter making it flexible for a wider variety data arrangements. NEON is available on your Raspberry Pi, which is why I’m using it here.
In 2011, Intel and AMD introduced the Advanced Vector Extensions
(AVX) instruction set. Essentially it’s SSE with 256-bit registers,
named ymm0
- ymm15
. That means operating on 8 single-precision
floats at once! As of this writing, this extensions is just starting
to become commonplace on desktops and laptops. It also has extensions:
AVX2 (2013) and AVX-512 (2015).
Moving on to the code, in mandel.c
you’ll find mandel_basic
, a
straight C implementation that produces a monochrome image. Normally I
would post the code here within the article, but it’s 30 lines long
and most of it isn’t of any particular interest.
I didn’t use C99’s complex number support because — continuing to follow the approach Handmade Hero — I intended to port this code directly into SIMD intrinsics. It’s much easier to work from a straight non-SIMD implementation towards one with compiler intrinsics than coding with compiler intrinsics right away. In fact, I’d say it’s almost trivial, since I got it right the first attempt on all three.
There’s just one unusual part:
#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < s->height; y++) {
/* ... */
}
This is an Open Multi-Processing (OpenMP) pragma. It’s a higher-level
threading API than POSIX or Win32 threads. OpenMP takes care of all
thread creation, work scheduling, and cleanup. In this case, the for
loop is parallelized such that each row of the image will be scheduled
individually to a thread, with one thread spawned for each CPU core.
This one line saves all the trouble of managing a work queue and such.
I also use it in my SIMD implementations, composing both forms of
parallelization for maximum performance.
I did it in single precision because I really want to exploit SIMD. Obviously, being half as wide as double precision, twice an many single precision operands can fit in a SIMD register.
On my wife’s i7-4770 (8 logical cores), it takes 29.9ms to render one image using the defaults (1440x1080, real{-2.5, 1.5}, imag{-1.5, 1.5}, 256 iterations). I’ll use the same machine for both the SSE2 and AVX benchmarks.
The first translation I did was SSE2 (mandel_sse2.c
). As with just
about any optimization, it’s more complex and harder to read than the
straight version. Again, I won’t post the code here, especially when
this one has doubled to 60 lines long.
Porting to SSE2 (and SIMD in general) is simply a matter of converting all assignments and arithmetic operators to their equivalent intrinsics. The Intel Intrinsics Guide is a godsend for this step. It’s easy to search for specific operations and it tells you what headers they come from. Notice that there are no C arithmetic operators until the very end, after the results have been extracted from SSE and pixels are being written.
There are two new types present in this version, __m128
and
__m128i
. These will be mapped to SSE registers by the compiler, sort
of like the old (outdated) C register
keyword. One big difference is
that it’s legal to take the address of these values with &
, and the
compiler will worry about the store/load operations. The first type is
for floating point values and the second is for integer values. At
first it’s annoying for these to be separate types (the CPU doesn’t
care), but it becomes a set of compiler-checked rails for avoiding
mistakes.
Here’s how assignment was written in the straight C version:
float iter_scale = 1.0f / s->iterations;
And here’s the SSE version. SSE intrinsics are prefixed with _mm
,
and the “ps” stands for “packed single-precision.”
__m128 iter_scale = _mm_set_ps1(1.0f / s->iterations);
This sets all four lanes of the register to the same value (a broadcast). Lanes can also be assigned individually, such as at the beginning of the innermost loop.
__m128 mx = _mm_set_ps(x + 3, x + 2, x + 1, x + 0);
This next part shows why the SSE2 version is longer. Here’s the straight C version:
float zr1 = zr * zr - zi * zi + cr;
float zi1 = zr * zi + zr * zi + ci;
zr = zr1;
zi = zi1;
To make it easier to read in the absence of operator syntax, I broke out the intermediate values. Here’s the same operation across four different complex values simultaneously. The purpose of these intrinsics should be easy to guess from their names.
__m128 zr2 = _mm_mul_ps(zr, zr);
__m128 zi2 = _mm_mul_ps(zi, zi);
__m128 zrzi = _mm_mul_ps(zr, zi);
zr = _mm_add_ps(_mm_sub_ps(zr2, zi2), cr);
zi = _mm_add_ps(_mm_add_ps(zrzi, zrzi), ci);
There are a bunch of swizzle instructions added in SSSE3 and beyond for re-arranging bytes within registers. With those I could eliminate that last bit of non-SIMD code at the end of the function for packing pixels. In an earlier version I used them, but since pixel packing isn’t a hot spot in this code (it’s outside the tight, innermost loop), it didn’t impact the final performance, so I took it out for the sake of simplicity.
The running time is now 8.56ms per image, a 3.5x speedup. That’s close to the theoretical 4x speedup from moving to 4-lane SIMD. That’s fast enough to render fullscreen at 60FPS.
With SSE2 explained, there’s not much to say about AVX
(mandel_avx.c
). The only difference is the use of __m256
,
__m256i
, the _mm256
intrinsic prefix, and that this operates on 8
points on the complex plane instead of 4.
It’s interesting that the AVX naming conventions are subtly improved over SSE. For example, here are the SSE broadcast intrinsics.
_mm_set1_epi8
_mm_set1_epi16
_mm_set1_epi32
_mm_set1_epi64x
_mm_set1_pd
_mm_set_ps1
Notice the oddball at the end? That’s discrimination against sufferers of obsessive-compulsive personality disorder. This was fixed in AVX’s broadcast intrinsics:
_mm256_set1_epi8
_mm256_set1_epi16
_mm256_set1_epi32
_mm256_set1_epi64x
_mm256_set1_pd
_mm256_set1_ps
The running time here is 5.20ms per image, a 1.6x speedup from SSE2. That’s not too far from the theoretical 2x speedup from using twice as many lanes. We can render at 60FPS and spend most of the time waiting around for the next vsync.
NEON is ARM’s take on SIMD. It’s what you’d find on your phone and tablet rather than desktop or laptop. NEON behaves much like a co-processor: NEON instructions are (cheaply) dispatched asynchronously to their own instruction pipeline, but transferring data back out of NEON is expensive and will stall the ARM pipeline until the NEON pipeline catches up.
Going beyond __m128
and __m256
, NEON intrinsics have a
type for each of the possible packings. On x86, the old stack-oriented
x87 floating-point instructions are replaced with SSE single-value
(“ss”, “sd”) instructions. On ARM, there’s no reason to use NEON to
operate on single values, so these “packings” don’t exist. Instead
there are half-wide packings. Note the lack of double-precision
support.
float32x2_t
, float32x4_t
int16x4_t
, int16x8_t
int32x2_t
, int32x4_t
int64x1_t
, int64x2_t
int8x16_t
, int8x8_t
uint16x4_t
, uint16x8_t
uint32x2_t
, uint32x4_t
uint64x1_t
, uint64x2_t
uint8x16_t
, uint8x8_t
Again, the CPU doesn’t really care about any of these types. It’s all
to help the compiler help us. For example, we don’t want to multiply a
float32x4_t
and a float32x2_t
since it wouldn’t have a meaningful
result.
Otherwise everything is similar (mandel_neon.c
). NEON intrinsics are
(less-cautiously) prefixed with v
and suffixed with a type (_f32
,
_u32
, etc.).
The performance on my model Raspberry Pi 2 (900 MHz quad-core ARM Cortex-A7) is 545ms per frame without NEON and 232ms with NEON, a 2.3x speedup. This isn’t nearly as impressive as SSE2, also at 4 lanes. My implementation almost certainly needs more work, especially since I know less about ARM than x86.
For the x86 build, I wanted the same binary to have AVX, SSE2, and
plain C versions, selected by a command line switch and feature
availability, so that I could easily compare benchmarks. Without any
special options, gcc and clang will make conservative assumptions
about the CPU features of the target machine. In order to build using
AVX intrinsics, I need the compiler to assume the target has AVX. The
-mavx
argument does this.
mandel_avx.o : mandel_avx.c
$(CC) -c $(CFLAGS) -mavx -o $@ $^
mandel_sse2.o : mandel_sse2.c
$(CC) -c $(CFLAGS) -msse2 -o $@ $^
mandel_neon.o : mandel_neon.c
$(CC) -c $(CFLAGS) -mfpu=neon -o $@ $^
All x86-64 CPUs have SSE2 but I included it anyway for clarity. But it should also enable it for 32-bit x86 builds.
It’s absolutely critical that each is done in a separate translation unit. Suppose I compiled like so in one big translation unit,
gcc -msse2 -mavx mandel.c mandel_sse2.c mandel_avx.c
The compiler will likely use some AVX instructions outside of the explicit intrinsics, meaning it’s going to crash on machine without AVX (“illegal instruction”). The main program needs to be compiled with AVX disabled. That’s where it will test for AVX before executing any special instructions.
Intrinsics are well-supported across different compilers (surprisingly,
even including the late-to-the-party Microsoft). Unfortunately testing
for CPU features differs across compilers. Intel advertises a
_may_i_use_cpu_feature
intrinsic, but it’s not supported in either
gcc or clang. gcc has a __builtin_cpu_supports
built-in, but it’s
only supported by gcc.
The most portable solution I came up with is cpuid.h
(x86 specific).
It’s supported by at least gcc and clang. The clang version of the
header is much better documented, so if you want to read up on how
this works, read that one.
#include <cpuid.h>
static inline int
is_avx_supported(void)
{
unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
__get_cpuid(1, &eax, &ebx, &ecx, &edx);
return ecx & bit_AVX ? 1 : 0;
}
And in use:
if (use_avx && is_avx_supported())
mandel_avx(image, &spec);
else if (use_sse2)
mandel_sse2(image, &spec);
else
mandel_basic(image, &spec);
I don’t know how to test for NEON, nor do I have the necessary hardware to test it, so on ARM assume it’s always available.
Using SIMD intrinsics for the Mandelbrot set was just an exercise to learn how to use them. Unlike in Handmade Hero, where it makes a 1080p 60FPS software renderer feasible, I don’t have an immediate, practical use for CPU SIMD, but, like so many similar techniques, I like having it ready in my toolbelt for the next time an opportunity arises.
]]>However, since much of the OpenGL-related content to be found online, even today, is outdated — and, worse, it’s not marked as such — good, modern core profile examples have been hard to come by. The relevant examples I could find at the time were more complicated than necessary, due to the common problem that full 3D graphics are too closely conflated with OpenGL. The examples would include matrix libraries, texture loading, etc. This is a big reason I ended up settling on WebGL: a clean slate in a completely different community. (The good news is that this situation has already improved dramatically over the last few years!)
Until recently, all of my OpenGL experience had been WebGL. Wanting to break out of that, earlier this year I set up a minimal OpenGL 3.3 core profile demo in C, using GLFW and gl3w. You can find it here:
No 3D graphics, no matrix library, no textures. It’s just a spinning red square.
It supports both Linux and Windows. The Windows’ build is static, so it compiles to a single, easily distributable, standalone binary. With some minor tweaking it would probably support the BSDs as well. For simplicity’s sake, the shaders are baked right into the source as strings, but if you’re extending the demo for your own use, you may want to move them out into their own source files.
I chose OpenGL 3.3 in particular for three reasons:
layout
keyword.LIBGL_ALWAYS_SOFTWARE=1
. The
software renderer will take advantage of your CPU’s SIMD features.)As far as “desktop” OpenGL goes, 3.3 is currently the prime target.
Until EGL someday fills this role, the process for obtaining an OpenGL context is specific to each operating system, where it’s generally a pain in the butt. GLUT, the OpenGL Utility Toolkit, was a library to make this process uniform across the different platforms. It also normalized user input (keyboard and mouse) and provided some basic (and outdated) utility functions.
The original GLUT isn’t quite open source (licensing issues) and it’s no longer maintained. The open source replacement for GLUT is FreeGLUT. It’s what you’d typically find on a Linux system in place of the original GLUT.
I just need a portable library that creates a window, handles keyboard and mouse events in that window, and gives me an OpenGL 3.3 core profile context. FreeGLUT does this well, but we can do better. One problem is that it includes a whole bunch of legacy cruft from GLUT: immediate mode rendering utilities, menus, spaceball support, lots of global state, and only one OpenGL context per process.
One of the biggest problems is that FreeGLUT doesn’t have a swap interval function. This is used to lock the application’s redraw rate to the system’s screen refresh rate, preventing screen tearing and excessive resource consumption. I originally used FreeGLUT for the demo, and, as a workaround, had added my own macro work around this by finding the system’s swap interval function, but it was a total hack.
The demo was initially written with FreeGLUT, but I switched over to GLFW since it’s smaller, simpler, cleaner, and more modern. GLFW also has portable joystick handling. With the plethora of modern context+window creation libraries out there, it seems there’s not much reason to use FreeGLUT anymore.
SDL 2.0 would also be an excellent choice. It goes beyond GLFW with threading, audio, networking, image loading, and timers: basically all the stuff you’d need when writing a game.
I’m sure there are some other good alternatives, especially when you’re not sticking to plain C, but these are the libraries I’m familiar with at the time of this article.
If you didn’t think the interface between OpenGL and the operating system was messy enough, I have good news for you. Neither the operating system nor the video card drivers are going to provide any of the correct headers, nor will you have anything meaningful to link against! For these, you’re on your own.
The OpenGL Extension Wrangler Library (GLEW) was invented solve this problem. It dynamically loads the system’s OpenGL libraries and finds all the relevant functions at run time. That way your application avoids linking to anything too specific. At compile time, it provides the headers defining all of the OpenGL functions.
Over the years, GLEW has become outdated, to this day having no support for core profile. So instead I used a replacement called gl3w. It’s just like GLEW, but, as the name suggests, oriented around core profile … exactly what I needed. Unlike GLEW, it is generated directly from Kronos’ documentation by a script. In practice, you drop the generated code directly into your project (embedded) rather than rely on the system to provide it as a library.
A great (and probably better) alternative to gl3w is glLoadgen. It’s the same idea — an automatically generated OpenGL loader — but allows for full customization of the output, such as the inclusion of select OpenGL extensions.
While I hope it serves an educational resources for others, I primarily have it for my own record-keeping, pedagogical, and reference purposes, born out of a weekend’s worth of research. It’s a starting point for future projects, and it’s somewhere easy to start when I want to experiment with an idea.
Plus, someday I want to write a sweet, standalone game with fancy OpenGL graphics.
]]>This article has a followup.
Linux has an elegant and beautiful design when it comes to threads: threads are nothing more than processes that share a virtual address space and file descriptor table. Threads spawned by a process are additional child processes of the main “thread’s” parent process. They’re manipulated through the same process management system calls, eliminating the need for a separate set of thread-related system calls. It’s elegant in the same way file descriptors are elegant.
Normally on Unix-like systems, processes are created with fork(). The new process gets its own address space and file descriptor table that starts as a copy of the original. (Linux uses copy-on-write to do this part efficiently.) However, this is too high level for creating threads, so Linux has a separate clone() system call. It works just like fork() except that it accepts a number of flags to adjust its behavior, primarily to share parts of the parent’s execution context with the child.
It’s so simple that it takes less than 15 instructions to spawn a thread with its own stack, no libraries needed, and no need to call Pthreads! In this article I’ll demonstrate how to do this on x86-64. All of the code with be written in NASM syntax since, IMHO, it’s by far the best (see: nasm-mode).
I’ve put the complete demo here if you want to see it all at once:
I want you to be able to follow along even if you aren’t familiar with x86_64 assembly, so here’s a short primer of the relevant pieces. If you already know x86-64 assembly, feel free to skip to the next section.
x86-64 has 16 64-bit general purpose registers, primarily used to manipulate integers, including memory addresses. There are many more registers than this with more specific purposes, but we won’t need them for threading.
rsp
: stack pointerrbp
: “base” pointer (still used in debugging and profiling)rax
rbx
rcx
rdx
: general purpose (notice: a, b, c, d)rdi
rsi
: “destination” and “source”, now meaningless namesr8
r9
r10
r11
r12
r13
r14
r15
: added for x86-64The “r” prefix indicates that they’re 64-bit registers. It won’t be relevant in this article, but the same name prefixed with “e” indicates the lower 32-bits of these same registers, and no prefix indicates the lowest 16 bits. This is because x86 was originally a 16-bit architecture, extended to 32-bits, then to 64-bits. Historically each of of these registers had a specific, unique purpose, but on x86-64 they’re almost completely interchangeable.
There’s also a “rip” instruction pointer register that conceptually walks along the machine instructions as they’re being executed, but, unlike the other registers, it can only be manipulated indirectly. Remember that data and code live in the same address space, so rip is not much different than any other data pointer.
The rsp register points to the “top” of the call stack. The stack keeps track of who called the current function, in addition to local variables and other function state (a stack frame). I put “top” in quotes because the stack actually grows downward on x86 towards lower addresses, so the stack pointer points to the lowest address on the stack. This piece of information is critical when talking about threads, since we’ll be allocating our own stacks.
The stack is also sometimes used to pass arguments to another function. This happens much less frequently on x86-64, especially with the System V ABI used by Linux, where the first 6 arguments are passed via registers. The return value is passed back via rax. When calling another function function, integer/pointer arguments are passed in these registers in this order:
So, for example, to perform a function call like foo(1, 2, 3)
, store
1, 2 and 3 in rdi, rsi, and rdx, then call
the function. The mov
instruction stores the source (second) operand in its destination
(first) operand. The call
instruction pushes the current value of
rip onto the stack, then sets rip (jumps) to the address of the
target function. When the callee is ready to return, it uses the ret
instruction to pop the original rip value off the stack and back
into rip, returning control to the caller.
mov rdi, 1
mov rsi, 2
mov rdx, 3
call foo
Called functions must preserve the contents of these registers (the same value must be stored when the function returns):
When making a system call, the argument registers are slightly different. Notice rcx has been changed to r10.
Each system call has an integer identifying it. This number is
different on each platform, but, in Linux’s case, it will never
change. Instead of call
, rax is set to the number of the
desired system call and the syscall
instruction makes the request to
the OS kernel. Prior to x86-64, this was done with an old-fashioned
interrupt. Because interrupts are slow, a special,
statically-positioned “vsyscall” page (now deprecated as a security
hazard), later vDSO, is provided to allow certain system
calls to be made as function calls. We’ll only need the syscall
instruction in this article.
So, for example, the write() system call has this C prototype.
ssize_t write(int fd, const void *buf, size_t count);
On x86-64, the write() system call is at the top of the system call
table as call 1 (read() is 0). Standard output is file
descriptor 1 by default (standard input is 0). The following bit of
code will write 10 bytes of data from the memory address buffer
(a
symbol defined elsewhere in the assembly program) to standard output.
The number of bytes written, or -1 for error, will be returned in rax.
mov rdi, 1 ; fd
mov rsi, buffer
mov rdx, 10 ; 10 bytes
mov rax, 1 ; SYS_write
syscall
There’s one last thing you need to know: registers often hold a memory
address (i.e. a pointer), and you need a way to read the data behind
that address. In NASM syntax, wrap the register in brackets (e.g.
[rax]
), which, if you’re familiar with C, would be the same as
dereferencing the pointer.
These bracket expressions, called an effective address, may be
limited mathematical expressions to offset that base address
entirely within a single instruction. This expression can include
another register (index), a power-of-two scalar (bit shift), and
an immediate signed offset. For example, [rax + rdx*8 + 12]
. If
rax is a pointer to a struct, and rdx is an array index to an element
in array on that struct, only a single instruction is needed to read
that element. NASM is smart enough to allow the assembly programmer to
break this mold a little bit with more complex expressions, so long as
it can reduce it to the [base + index*2^exp + offset]
form.
The details of addressing aren’t important this for this article, so don’t worry too much about it if that didn’t make sense.
Threads share everything except for registers, a stack, and thread-local storage (TLS). The OS and underlying hardware will automatically ensure that registers are per-thread. Since it’s not essential, I won’t cover thread-local storage in this article. In practice, the stack is often used for thread-local data anyway. The leaves the stack, and before we can span a new thread, we need to allocate a stack, which is nothing more than a memory buffer.
The trivial way to do this would be to reserve some fixed .bss (zero-initialized) storage for threads in the executable itself, but I want to do it the Right Way and allocate the stack dynamically, just as Pthreads, or any other threading library, would. Otherwise the application would be limited to a compile-time fixed number of threads.
You can’t just read from and write to arbitrary addresses in virtual memory, you first have to ask the kernel to allocate pages. There are two system calls this on Linux to do this:
brk(): Extends (or shrinks) the heap of a running process, typically located somewhere shortly after the .bss segment. Many allocators will do this for small or initial allocations. This is a less optimal choice for thread stacks because the stacks will be very near other important data, near other stacks, and lack a guard page (by default). It would be somewhat easier for an attacker to exploit a buffer overflow. A guard page is a locked-down page just past the absolute end of the stack that will trigger a segmentation fault on a stack overflow, rather than allow a stack overflow to trash other memory undetected. A guard page could still be created manually with mprotect(). Also, there’s also no room for these stacks to grow.
mmap(): Use an anonymous mapping to allocate a contiguous set of pages at some randomized memory location. As we’ll see, you can even tell the kernel specifically that you’re going to use this memory as a stack. Also, this is simpler than using brk() anyway.
On x86-64, mmap() is system call 9. I’ll define a function to allocate a stack with this C prototype.
void *stack_create(void);
The mmap() system call takes 6 arguments, but when creating an anonymous memory map the last two arguments are ignored. For our purposes, it looks like this C prototype.
void *mmap(void *addr, size_t length, int prot, int flags);
For flags
, we’ll choose a private, anonymous mapping that, being a
stack, grows downward. Even with that last flag, the system call will
still return the bottom address of the mapping, which will be
important to remember later. It’s just a simple matter of setting the
arguments in the registers and making the system call.
%define SYS_mmap 9
%define STACK_SIZE (4096 * 1024) ; 4 MB
stack_create:
mov rdi, 0
mov rsi, STACK_SIZE
mov rdx, PROT_WRITE | PROT_READ
mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
mov rax, SYS_mmap
syscall
ret
Now we can allocate new stacks (or stack-sized buffers) as needed.
Spawning a thread is so simple that it doesn’t even require a branch instruction! It’s a call to clone() with two arguments: clone flags and a pointer to the new thread’s stack. It’s important to note that, as in many cases, the glibc wrapper function has the arguments in a different order than the system call. With the set of flags we’re using, it takes two arguments.
long sys_clone(unsigned long flags, void *child_stack);
Our thread spawning function will have this C prototype. It takes a function as its argument and starts the thread running that function.
long thread_create(void (*)(void));
The function pointer argument is passed via rdi, per the ABI. Store
this for safekeeping on the stack (push
) in preparation for calling
stack_create(). When it returns, the address of the low end of stack
will be in rax.
thread_create:
push rdi
call stack_create
lea rsi, [rax + STACK_SIZE - 8]
pop qword [rsi]
mov rdi, CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
CLONE_PARENT | CLONE_THREAD | CLONE_IO
mov rax, SYS_clone
syscall
ret
The second argument to clone() is a pointer to the high address of
the stack (specifically, just above the stack). So we need to add
STACK_SIZE
to rax to get the high end. This is done with the lea
instruction: load effective address. Despite the brackets,
it doesn’t actually read memory at that address, but instead stores
the address in the destination register (rsi). I’ve moved it back by 8
bytes because I’m going to place the thread function pointer at the
“top” of the new stack in the next instruction. You’ll see why in a
moment.
Remember that the function pointer was pushed onto the stack for safekeeping. This is popped off the current stack and written to that reserved space on the new stack.
As you can see, it takes a lot of flags to create a thread with clone(). Most things aren’t shared with the callee by default, so lots of options need to be enabled. See the clone(2) man page for full details on these flags.
CLONE_THREAD
: Put the new process in the same thread group.CLONE_VM
: Runs in the same virtual memory space.CLONE_PARENT
: Share a parent with the callee.CLONE_SIGHAND
: Share signal handlers.CLONE_FS
, CLONE_FILES
, CLONE_IO
: Share filesystem information.A new thread will be created and the syscall will return in each of the two threads at the same instruction, exactly like fork(). All registers will be identical between the threads, except for rax, which will be 0 in the new thread, and rsp which has the same value as rsi in the new thread (the pointer to the new stack).
Now here’s the really cool part, and the reason branching isn’t
needed. There’s no reason to check rax to determine if we are the
original thread (in which case we return to the caller) or if we’re
the new thread (in which case we jump to the thread function).
Remember how we seeded the new stack with the thread function? When
the new thread returns (ret
), it will jump to the thread function
with a completely empty stack. The original thread, using the original
stack, will return to the caller.
The value returned by thread_create() is the process ID of the new
thread, which is essentially the thread object (e.g. Pthread’s
pthread_t
).
The thread function has to be careful not to return (ret
) since
there’s nowhere to return. It will fall off the stack and terminate
the program with a segmentation fault. Remember that threads are just
processes? It must use the exit() syscall to terminate. This won’t
terminate the other threads.
%define SYS_exit 60
exit:
mov rax, SYS_exit
syscall
Before exiting, it should free its stack with the munmap() system call, so that no resources are leaked by the terminated thread. The equivalent of pthread_join() by the main parent would be to use the wait4() system call on the thread process.
If you found this interesting, be sure to check out the full demo link
at the top of this article. Now with the ability to spawn threads,
it’s a great opportunity to explore and experiment with x86’s
synchronization primitives, such as the lock
instruction prefix,
xadd
, and compare-and-exchange (cmpxchg
). I’ll discuss
these in a future article.
Monday’s /r/dailyprogrammer challenge was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(u(0)
) and a sequence of operations, f
, to apply to the previous
term (u(n + 1) = f(u(n))
) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.
For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5
would be
input as +2 *3 -5
. If u(0) = 0
then,
u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.
Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).
Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.
The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).
Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, NULL
points to a special page with read, write, and execute disabled.
Memory returned by malloc()
and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through malloc()
, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an asmbuf
struct.
#define PAGE_SIZE 4096
struct asmbuf {
uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
uint64_t count;
};
To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use sysconf(_SC_PAGESIZE)
to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.
Instead of malloc()
, the compiler allocates memory as an anonymous
memory map (mmap()
). It’s anonymous because it’s not backed by a
file.
struct asmbuf *
asmbuf_create(void)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}
Windows doesn’t have POSIX mmap()
, so on that platform we use
VirtualAlloc()
instead. Here’s the equivalent in Win32.
struct asmbuf *
asmbuf_create(void)
{
DWORD type = MEM_RESERVE | MEM_COMMIT;
return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}
Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.
The POSIX mprotect()
function is used to change memory protections.
void
asmbuf_finalize(struct asmbuf *buf)
{
mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}
Or on Win32 (that last parameter is not allowed to be NULL
),
void
asmbuf_finalize(struct asmbuf *buf)
{
DWORD old;
VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}
Finally, instead of free()
it gets unmapped.
void
asmbuf_free(struct asmbuf *buf)
{
munmap(buf, PAGE_SIZE);
}
And on Win32,
void
asmbuf_free(struct asmbuf *buf)
{
VirtualFree(buf, 0, MEM_RELEASE);
}
I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.
asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);
We’re only going to be concerned with three of x86-64’s many
registers: rdi
, rax
, and rdx
. These are 64-bit (r
) extensions
of the original 16-bit 8086 registers. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.
long recurrence(long);
The System V AMD64 ABI calling convention says that the first
integer/pointer function argument is passed in the rdi
register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in rax
when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy rdi
to rax
.
mov rax, rdi
There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in asmbuf
. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in rcx
rather than rdi
. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.
The very last thing it will do, assuming the result is in rax
, is
return to the caller.
ret
So we know the assembly, but what do we pass to asmbuf_ins()
? This
is where we get our hands dirty.
If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.
Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file peek.s
and hand it to nasm
. It will produce a raw binary
with the machine code, which we’ll disassemble with nidsasm
(the
NASM disassembler).
$ nasm peek.s
$ ndisasm -b64 peek
00000000 4889F8 mov rax,rdi
00000003 C3 ret
That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.
asmbuf_ins(buf, 3, 0x4889f8); // mov rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3); // ret
For each operation, we’ll set it up so the operand will already be
loaded into rdi
regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use 0x0123456789abcdef
as the
operand.
mov rdi, 0x0123456789abcdef
Which disassembled with ndisasm
is,
00000000 48BFEFCDAB896745 mov rdi,0x123456789abcdef
-2301
Notice the operand listed little endian immediately after the instruction. That’s also easy!
long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf); // mov rdi, operand
asmbuf_immediate(buf, 8, &operand);
Apply the same discovery process individually for each operator you
want to support, accumulating the result in rax
for each.
switch (operator) {
case '+':
asmbuf_ins(buf, 3, 0x4801f8); // add rax, rdi
break;
case '-':
asmbuf_ins(buf, 3, 0x4829f8); // sub rax, rdi
break;
case '*':
asmbuf_ins(buf, 4, 0x480fafc7); // imul rax, rdi
break;
case '/':
asmbuf_ins(buf, 3, 0x4831d2); // xor rdx, rdx
asmbuf_ins(buf, 3, 0x48f7ff); // idiv rdi
break;
}
As an exercise, try adding support for modulus operator (%
), XOR
(^
), and bit shifts (<
, >
). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the closed form solution to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.
Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a void *
just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)
asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);
That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.
I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.
]]>telnet gcom.nullprogram.com
As with previous years, the ideas behind the game are not all that original. The goal was to be a fantasy version of classic X-COM with an ANSI terminal interface. You are the ruler of a fledgling human nation that is under attack by invading goblins. You hire heroes, operate squads, construct buildings, and manage resource income.
The inspiration this year came from watching BattleBunny play OpenXCOM, an open source clone of the original X-COM. It had its major 1.0 release last year. Like the early days of OpenTTD, it currently depends on the original game assets. But also like OpenTTD, it surpasses the original game in every way, so there’s no reason to bother running the original anymore. I’ve also recently been watching One F Jef play Silent Storm, which is another turn-based squad game with a similar combat simulation.
As in X-COM, the game is broken into two modes of play: the geoscape (strategic) and the battlescape (tactical). Unfortunately I ran out of time and didn’t get to the battlescape part, though I’d like to add it in the future. What’s left is a sort-of city-builder with some squad management. You can hire heroes and send them out in squads to eliminate goblins, but rather than dropping to the battlescape, battles always auto-resolve in your favor. Despite this, the game still has a story, a win state, and a lose state. I won’t say what they are, so you have to play it for yourself!
My previous entries were HTML5 games, but this entry is a plain old standalone application. C has been my preferred language for the past few months, so that’s what I used. Both UTF-8-capable ANSI terminals and the Windows console are supported, so it should be perfectly playable on any modern machine. Note, though, that some of the poorer-quality terminal emulators that you’ll find in your Linux distribution’s repositories (rxvt and its derivatives) are not Unicode-capable, which means they won’t work with G-COM.
I didn’t make use of ncurses, instead opting to write my own terminal graphics engine. That’s because I wanted a single, small binary that was easy to build, and I didn’t want to mess around with PDCurses. I’ve also been studying the Win32 API lately, so writing my own terminal platform layer would rather easy to do anyway.
I experimented with a number of terminal emulators — LXTerminal, Konsole, GNOME/MATE terminal, PuTTY, xterm, mintty, Terminator — but the least capable “terminal” by far is the Windows console, so it was the one to dictate the capabilities of the graphics engine. Some ANSI terminals are capable of 256 colors, bold, underline, and strikethrough fonts, but a highly portable API is basically limited to 16 colors (RGBCMYKW with two levels of intensity) for each of the foreground and background, and no other special text properties.
ANSI terminals also have a concept of a default foreground color and a default background color. Most applications that output color (git, grep, ls) leave the background color alone and are careful to choose neutral foreground colors. G-COM always sets the background color, so that the game looks the same no matter what the default colors are. Also, the Windows console doesn’t really have default colors anyway, even if I wanted to use them.
I put in partial support for Unicode because I wanted to use
interesting characters in the game (≈, ♣, ∩, ▲). Windows has supported
Unicode for a long time now, but since they added it too early,
they’re locked into the outdated UTF-16. For me this wasn’t
too bad, because few computers, Linux included, are equipped to render
characters outside of the Basic Multilingual Plane anyway, so
there’s no need to deal with surrogate pairs. This is especially true
for the Windows console, which can only render a very small set of
characters: another limit on my graphics engine. Internally individual
codepoints are handled as uint16_t
and strings are handled as UTF-8.
I said partial support because, in addition to the above, it has no support for combining characters, or any other situation where a codepoint takes up something other than one space in the terminal. This requires lookup tables and dealing with pitfalls, but since I get to control exactly which characters were going to be used I didn’t need any of that.
In spite of the limitations, I’m really happy with the graphical results. The waves are animated continuously, even while the game is paused, and it looks great. Here’s GNOME Terminal’s rendering, which I think looked the best by default.
I’ll talk about how G-COM actually communicates with the terminal in
another article. The interface between the game and the graphics
engine is really clean (device.h
), so it would be an interesting
project to write a back end that renders the game to a regular window,
no terminal needed.
I came up with a format directive to help me colorize everything. It
runs in addition to the standard printf
directives. Here’s an example,
panel_printf(&panel, 1, 1, "Really save and quit? (Rk{y}/Rk{n})");
The color is specified by two characters, and the text it applies to
is wrapped in curly brackets. There are eight colors to pick from:
RGBCMYKW. That covers all the binary values for red, green, and blue.
To specify an “intense” (bright) color, capitalize it. That means the
Rk{...}
above makes the wrapped text bright red.
Nested directives are also supported. (And, yes, that K
means “high
intense black,” a.k.a. dark gray. A w
means “low intensity white,”
a.k.a. light gray.)
panel_printf(p, x, y++, "Kk{♦} wk{Rk{B}uild} Kk{♦}");
And it mixes with the normal printf
directives:
panel_printf(p, 1, y++, "(Rk{m}) Yk{Mine} [%s]", cost);
The GNU linker has a really nice feature for linking arbitrary binary
data into your application. I used this to embed my assets into a
single binary so that the user doesn’t need to worry about any sort of
data directory or anything like that. Here’s what the make
rule
would look like:
$(LD) -r -b binary -o $@ $^
The -r
specifies that output should be relocatable — i.e. it can be
fed back into the linker later when linking the final binary. The -b
binary
says that the input is just an opaque binary file (“plain”
text included). The linker will create three symbols for each input
file:
_binary_filename_start
_binary_filename_end
_binary_filename_size
When then you can access from your C program like so:
extern const char _binary_filename_txt_start[];
I used this to embed the story texts, and I’ve used it in the past to embed images and textures. If you were to link zlib, you could easily compress these assets, too. I’m surprised this sort of thing isn’t done more often!
To save time, and because it doesn’t really matter, saves are just
memory dumps. I took another page from Handmade Hero and
allocate everything in a single, contiguous block of memory. With one
exception, there are no pointers, so the entire block is relocatable.
When references are needed, it’s done via integers into the embedded
arrays. This allows it to be cleanly reloaded in another process
later. As a side effect, it also means there are no dynamic
allocations (malloc()
) while the game is running. Here’s roughly
what it looks like.
typedef struct game {
uint64_t map_seed;
map_t *map;
long time;
float wood, gold, food;
long population;
float goblin_spawn_rate;
invader_t invaders[16];
squad_t squads[16];
hero_t heroes[128];
game_event_t events[16];
} game_t;
The map
pointer is that one exception, but that’s because it’s
generated fresh after loading from the map_seed
. Saving and loading
is trivial (error checking omitted) and very fast.
void
game_save(game_t *game, FILE *out)
{
fwrite(game, sizeof(*game), 1, out);
}
game_t *
game_load(FILE *in)
{
game_t *game = malloc(sizeof(*game));
fread(game, sizeof(*game), 1, in);
game->map = map_generate(game->map_seed);
return game;
}
The data isn’t important enough to bother with rename+fsync durability. I’ll risk the data if it makes savescumming that much harder!
The downside to this technique is that saves are generally not portable across architectures (particularly where endianness differs), and may not even portable between different platforms on the same architecture. I only needed to persist a single game state on the same machine, so this wouldn’t be a problem.
I’m definitely going to be reusing some of this code in future projects. The G-COM terminal graphics layer is nifty, and I already like it better than ncurses, whose API I’ve always thought was kind of ugly and old-fashioned. I like writing terminal applications.
Just like the last couple of years, the final game is a lot simpler than I had planned at the beginning of the week. Most things take longer to code than I initially expect. I’m still enjoying playing it, which is a really good sign. When I play, I’m having enough fun to deliberately delay the end of the game so that I can sprawl my nation out over the island and generate crazy income.
]]>It’s incredibly simple and lives entirely in a header file, so
without further ado (ref.h
):
#pragma once
struct ref {
void (*free)(const struct ref *);
int count;
};
static inline void
ref_inc(const struct ref *ref)
{
((struct ref *)ref)->count++;
}
static inline void
ref_dec(const struct ref *ref)
{
if (--((struct ref *)ref)->count == 0)
ref->free(ref);
}
It has only two fields: the reference count and a “method” that knows
how to free the object once the reference count hits 0. Structs using
this reference counter will know how to free themselves, so callers
will never call a specific *_destroy()
/*_free()
function. Instead
they call ref_dec()
to decrement the reference counter and let it
happen on its own.
I decided to go with a signed count because it allows for better error
checking. It may be worth putting an assert()
in ref_inc()
and
ref_dec()
to ensure the count is always non-negative. I chose an
int
because it’s fast, and anything smaller will be padded out to
at least that size anyway. On x86-64, struct ref
is 16 bytes.
This is basically all there is to a C++ shared_ptr, leveraging C++’s destructors and performing all increment/decrement work automatically.
Those increments and decrements aren’t thread safe, so this won’t work as-is when data structures are shared between threads. If you’re sure that you’re using GCC on a capable platform, you can make use of its atomic builtins, making the reference counter completely thread safe.
static inline void
ref_inc(const struct ref *ref)
{
__sync_add_and_fetch((int *)&ref->count, 1);
}
static inline void
ref_dec(const struct ref *ref)
{
if (__sync_sub_and_fetch((int *)&ref->count, 1) == 0)
ref->free(ref);
}
Or if you’re using C11, make use of the new stdatomic.h.
static inline void
ref_inc(const struct ref *ref)
{
atomic_fetch_add((int *)&ref->count, 1);
}
static inline void
ref_dec(const struct ref *ref)
{
if (atomic_fetch_sub((int *)&ref->count, 1) == 1)
ref->free(ref);
}
There’s a very deliberate decision to make all of the function
arguments const
, for both reference counting functions and the
free()
method. This may seem wrong because these functions are
specifically intended to modify the reference count. There are
dangerous-looking casts in each case to remove the const
.
The reason for this is that’s it’s likely for someone holding a
const
pointer to one of these objects to want to keep their own
reference. Their promise not to modify the object doesn’t really
apply to the reference count, which is merely embedded metadata. They
would need to cast the const
away before being permitted to call
ref_inc()
and ref_dec()
. Rather than litter the program with
dangerous casts, the casts are all kept in one place — in the
reference counting functions — where they’re strictly limited to
mutating the reference counting fields.
On a related note, the stdlib.h
free()
function doesn’t take a
const
pointer, so the free()
method taking a const
pointer is a
slight departure from the norm. Taking a non-const
pointer was a
mistake in the C standard library. The free()
function
mutates the pointer itself — including all other pointers to that
object — making it invalid. Semantically, it doesn’t mutate the
memory behind the pointer, so it’s not actually violating the
const
. To compare, the Linux kernel kfree()
takes a
const void *
.
Just as users may need to increment and decrement the counters on
const
objects, they’ll also need to be able to free()
them, so
it’s also a const
.
So how does one use this generic reference counter? Embed a struct
ref
in your own structure and use our old friend: the
container_of()
macro. For anyone who’s forgotten, this macro not
part of standard C, but you can define it with offsetof()
.
#define container_of(ptr, type, member) \
((type *)((char *)(ptr) - offsetof(type, member)))
Here’s a dumb linked list example where each node is individually reference counted. Adding an extra 16 bytes to each of your linked list nodes isn’t normally going to help with much, but if the tail of the linked list is being shared between different data structures (such as other lists), reference counting makes things a lot simpler.
struct node {
char id[64];
float value;
struct node *next;
struct ref refcount;
};
I put refcount
at the end so that we’ll have to use container_of()
in this example. It conveniently casts away the const
for us.
static void
node_free(const struct ref *ref)
{
struct node *node = container_of(ref, struct node, refcount);
struct node *child = node->next;
free(node);
if (child)
ref_dec(&child->refcount);
}
Notice that it recursively decrements its child’s reference count afterwards (intentionally tail recursive). A whole list will clean itself up when the head is freed and no part of the list is shared.
The allocation function sets up the free()
function pointer and
initializes the count to 1.
struct node *
node_create(char *id, float value)
{
struct node *node = malloc(sizeof(*node));
snprintf(node->id, sizeof(node->id), "%s", id);
node->value = value;
node->next = NULL;
node->refcount = (struct ref){node_free, 1};
return node;
}
(Side note: I used snprintf()
because strncpy()
is
broken and strlcpy()
is non-standard, so it’s the most
straightforward way to do this in standard C.);
And to start making some use of the reference counter, here’s push and pop.
void
node_push(struct node **nodes, char *id, float value)
{
struct node *node = node_create(id, value);
node->next = *nodes;
*nodes = node;
}
struct node *
node_pop(struct node **nodes)
{
struct node *node = *nodes;
*nodes = (*nodes)->next;
if (*nodes)
ref_inc(&(*nodes)->refcount);
return node;
}
Notice node_pop()
increments the reference count of the new head
node before returning. That’s because the node now has an additional
reference: from *nodes
and from the node that was just popped.
It’s up to the caller to free the returned node, which would decrement
the count of the new head node, but not free it. Alternatively
node_pop()
could set next
on the returned node to NULL rather than
increment the counter, which would also prevent the returned node from
freeing the new head when it gets freed. But it’s probably more useful
for the returned node to keep functioning as a list. That’s what the
reference counting is for, after all.
Finally, a simple program to exercise it all. It reads ID/value pairs from standard input.
void
node_print(struct node *node)
{
for (; node; node = node->next)
printf("%s = %f\n", node->id, node->value);
}
int main(void)
{
struct node *nodes = NULL;
char id[64];
float value;
while (scanf(" %63s %f", id, &value) == 2)
node_push(&nodes, id, value);
if (nodes != NULL) {
node_print(nodes);
struct node *old = node_pop(&nodes);
node_push(&nodes, "foobar", 0.0f);
node_print(nodes);
ref_dec(&old->refcount);
ref_dec(&nodes->refcount);
}
return 0;
}
I’ve used this technique several times over the past few months. It’s trivial to remember, so I just code it up from scratch each time I need it.
]]>Last week in Handmade Hero (days 21-25), Casey Muratori added interactive programming to the game engine. This is especially useful in game development, where the developer might want to tweak, say, a boss fight without having to restart the entire game after each tweak. Now that I’ve seen it done, it seems so obvious. The secret is to build almost the entire application as a shared library.
This puts a serious constraint on the design of the program: it
cannot keep any state in global or static variables, though this
should be avoided anyway. Global state will be lost each
time the shared library is reloaded. In some situations, this can also
restrict use of the C standard library, including functions like
malloc()
, depending on how these functions are implemented or
linked. For example, if the C standard library is statically linked,
functions with global state may introduce global state into the shared
library. It’s difficult to know what’s safe to use. This works fine in
Handmade Hero because the core game, the part loaded as a shared
library, makes no use of external libraries, including the standard
library.
Additionally, the shared library must be careful with its use of function pointers. The functions being pointed at will no longer exist after a reload. This is a real issue when combining interactive programming with object oriented C.
To demonstrate how this works, let’s go through an example. I wrote a simple ncurses Game of Life demo that’s easy to modify. You can get the entire source here if you’d like to play around with it yourself on a Unix-like system.
Quick start:
make
then ./main
. Press r
randomize and q
to quit.game.c
to change the Game of Life rules, add colors, etc.make
. Your changes will be reflected
immediately in the original program!As of this writing, Handmade Hero is being written on Windows, so Casey is using a DLL and the Win32 API, but the same technique can be applied on Linux, or any other Unix-like system, using libdl. That’s what I’ll be using here.
The program will be broken into two parts: the Game of Life shared library (“game”) and a wrapper (“main”) whose job is only to load the shared library, reload it when it updates, and call it at a regular interval. The wrapper is agnostic about the operation of the “game” portion, so it could be re-used almost untouched in another project.
To avoid maintaining a whole bunch of function pointer assignments in
several places, the API to the “game” is enclosed in a struct. This
also eliminates warnings from the C compiler about mixing data and
function pointers. The layout and contents of the game_state
struct is private to the game itself. The wrapper will only handle a
pointer to this struct.
struct game_state;
struct game_api {
struct game_state *(*init)();
void (*finalize)(struct game_state *state);
void (*reload)(struct game_state *state);
void (*unload)(struct game_state *state);
bool (*step)(struct game_state *state);
};
In the demo the API is made of 5 functions. The first 4 are primarily concerned with loading and unloading.
init()
: Allocate and return a state to be passed to every other
API call. This will be called once when the program starts and never
again, even after reloading. If we were concerned about using
malloc()
in the shared library, the wrapper would be responsible
for performing the actual memory allocation.
finalize()
: The opposite of init()
, to free all resources held
by the game state.
reload()
: Called immediately after the library is reloaded. This
is the chance to sneak in some additional initialization in the
running program. Normally this function will be empty. It’s only
used temporarily during development.
unload()
: Called just before the library is unloaded, before a new
version is loaded. This is a chance to prepare the state for use by
the next version of the library. This can be used to update structs
and such, if you wanted to be really careful. This would also
normally be empty.
step()
: Called at a regular interval to run the game. A real game
will likely have a few more functions like this.
The library will provide a filled out API struct as a global variable,
GAME_API
. This is the only exported symbol in the entire shared
library! All functions will be declared static, including the ones
referenced by the structure.
const struct game_api GAME_API = {
.init = game_init,
.finalize = game_finalize,
.reload = game_reload,
.unload = game_unload,
.step = game_step
};
The wrapper is focused on calling dlopen()
, dlsym()
, and
dlclose()
in the right order at the right time. The game will be
compiled to the file libgame.so
, so that’s what will be loaded. It’s
written in the source with a ./
to force the name to be used as a
filename. The wrapper keeps track of everything in a game
struct.
const char *GAME_LIBRARY = "./libgame.so";
struct game {
void *handle;
ino_t id;
struct game_api api;
struct game_state *state;
};
The handle
is the value returned by dlopen()
. The id
is the
inode of the shared library, as returned by stat()
. The rest is
defined above. Why the inode? We could use a timestamp instead, but
that’s indirect. What we really care about is if the shared object
file is actually a different file than the one that was loaded. The
file will never be updated in place, it will be replaced by the
compiler/linker, so the timestamp isn’t what’s important.
Using the inode is a much simpler situation than in Handmade Hero. Due to Windows’ broken file locking behavior, the game DLL can’t be replaced while it’s being used. To work around this limitation, the build system and the loader have to rely on randomly-generated filenames.
void game_load(struct game *game)
The purpose of the game_load()
function is to load the game API into
a game
struct, but only if either it hasn’t been loaded yet or if
it’s been updated. Since it has several independent failure
conditions, let’s examine it in parts.
struct stat attr;
if ((stat(GAME_LIBRARY, &attr) == 0) && (game->id != attr.st_ino)) {
First, use stat()
to determine if the library’s inode is different
than the one that’s already loaded. The id
field will be 0
initially, so as long as stat()
succeeds, this will load the library
the first time.
if (game->handle) {
game->api.unload(game->state);
dlclose(game->handle);
}
If a library is already loaded, unload it first, being sure to call
unload()
to inform the library that it’s being updated. It’s
critically important that dlclose()
happens before dlopen()
. On
my system, dlopen()
looks only at the string it’s given, not the
file behind it. Even though the file has been replaced on the
filesystem, dlopen()
will see that the string matches a library
already opened and return a pointer to the old library. (Is this a
bug?) The handles are reference counted internally by libdl.
void *handle = dlopen(GAME_LIBRARY, RTLD_NOW);
Finally load the game library. There’s a race condition here that
cannot be helped due to limitations of dlopen()
. The library may
have been updated again since the call to stat()
. Since we can’t
ask dlopen()
about the inode of the library it opened, we can’t
know. But as this is only used during development, not in production,
it’s not a big deal.
if (handle) {
game->handle = handle;
game->id = attr.st_ino;
/* ... more below ... */
} else {
game->handle = NULL;
game->id = 0;
}
If dlopen()
fails, it will return NULL
. In the case of ELF, this
will happen if the compiler/linker is still in the process of writing
out the shared library. Since the unload was already done, this means
no game will be loaded when game_load
returns. The user of the
struct needs to be prepared for this eventuality. It will need to try
loading again later (i.e. a few milliseconds). It may be worth filling
the API with stub functions when no library is loaded.
const struct game_api *api = dlsym(game->handle, "GAME_API");
if (api != NULL) {
game->api = *api;
if (game->state == NULL)
game->state = game->api.init();
game->api.reload(game->state);
} else {
dlclose(game->handle);
game->handle = NULL;
game->id = 0;
}
When the library loads without error, look up the GAME_API
struct
that was mentioned before and copy it into the local struct. Copying
rather than using the pointer avoids one more layer of redirection
when making function calls. The game state is initialized if it hasn’t
been already, and the reload()
function is called to inform the game
it’s just been reloaded.
If looking up the GAME_API
fails, close the handle and consider it
a failure.
The main loop calls game_load()
each time around. And that’s it!
int main(void)
{
struct game game = {0};
for (;;) {
game_load(&game);
if (game.handle)
if (!game.api.step(game.state))
break;
usleep(100000);
}
game_unload(&game);
return 0;
}
Now that I have this technique in by toolbelt, it has me itching to develop a proper, full game in C with OpenGL and all, perhaps in another Ludum Dare. The ability to develop interactively is very appealing.
]]>Update 2020: DOS Defender was featured on GET OFF MY LAWN.
This past weekend I participated in Ludum Dare #31. Before the theme was even announced, due to recent fascination I wanted to make an old school DOS game. DOSBox would be the target platform since it’s the most practical way to run DOS applications anymore, despite modern x86 CPUs still being fully backwards compatible all the way back to the 16-bit 8086.
I successfully created and submitted a DOS game called DOS Defender. It’s a 32-bit 80386 real mode DOS COM program. All assets are embedded in the executable and there are no external dependencies, so the entire game is packed into that 10kB binary.
You’ll need a joystick/gamepad in order to play. I included mouse support in the Ludum Dare release in order to make it easier to review, but this was removed because it doesn’t work well.
The most technically interesting part is that I didn’t need any
DOS development tools to create this! I only used my every day Linux
C compiler (gcc
). It’s not actually possible to build DOS Defender
in DOS. Instead, I’m treating DOS as an embedded platform, which is
the only form in which DOS still exists today. Along with
DOSBox and DOSEMU, this is a pretty comfortable toolchain.
If all you care about is how to do this yourself, skip to the “Tricking GCC” section, where we’ll write a “Hello, World” DOS COM program with Linux’s GCC.
I didn’t have GCC in mind when I started this project. What really triggered all of this was that I had noticed Debian’s bcc package, Bruce’s C Compiler, that builds 16-bit 8086 binaries. It’s kept around for compiling x86 bootloaders and such, but it can also be used to compile DOS COM files, which was the part that interested me.
For some background: the Intel 8086 was a 16-bit microprocessor released in 1978. It had none of the fancy features of today’s CPU: no memory protection, no floating point instructions, and only up to 1MB of RAM addressable. All modern x86 desktops and laptops can still pretend to be a 40-year-old 16-bit 8086 microprocessor, with the same limited addressing and all. That’s some serious backwards compatibility. This feature is called real mode. It’s the mode in which all x86 computers boot. Modern operating systems switch to protected mode as soon as possible, which provides virtual addressing and safe multi-tasking. DOS is not one of these operating systems.
Unfortunately, bcc is not an ANSI C compiler. It supports a subset of
K&R C, along with inline x86 assembly. Unlike other 8086 C compilers,
it has no notion of “far” or “long” pointers, so inline assembly is
required to access other memory segments (VGA, clock, etc.).
Side note: the remnants of these 8086 “long pointers” still exists
today in the Win32 API: LPSTR
, LPWORD
, LPDWORD
, etc. The inline
assembly isn’t anywhere near as nice as GCC’s inline assembly. The
assembly code has to manually load variables from the stack so, since
bcc supports two different calling conventions, the assembly ends up
being hard-coded to one calling convention or the other.
Given all its limitations, I went looking for alternatives.
DJGPP is the DOS port of GCC. It’s a very impressive project, bringing almost all of POSIX to DOS. The DOS ports of many programs are built with DJGPP. In order to achieve this, it only produces 32-bit protected mode programs. If a protected mode program needs to manipulate hardware (i.e. VGA), it must make requests to a DOS Protected Mode Interface (DPMI) service. If I used DJGPP, I couldn’t make a single, standalone binary as I had wanted, since I’d need to include a DPMI server. There’s also a performance penalty for making DPMI requests.
Getting a DJGPP toolchain working can be difficult, to put it kindly. Fortunately I found a useful project, build-djgpp, that makes it easy, at least on Linux.
Either there’s a serious bug or the official DJGPP binaries have become infected again, because in my testing I kept getting the “Not COFF: check for viruses” error message when running my programs in DOSBox. To double check that it’s not an infection on my own machine, I set up a DJGPP toolchain on my Raspberry Pi, to act as a clean room. It’s impossible for this ARM-based device to get infected with an x86 virus. It still had the same problem, and all the binary hashes matched up between the machines, so it’s not my fault.
So given the DPMI issue and the above, I moved on.
What I finally settled on is a neat hack that involves “tricking” GCC into producing real mode DOS COM files, so long as it can target 80386 (as is usually the case). The 80386 was released in 1985 and was the first 32-bit x86 microprocessor. GCC still targets this instruction set today, even in the x86-64 toolchain. Unfortunately, GCC cannot actually produce 16-bit code, so my main goal of targeting 8086 would not be achievable. This doesn’t matter, though, since DOSBox, my intended platform, is an 80386 emulator.
In theory this should even work unchanged with MinGW, but there’s a
long-standing MinGW bug that prevents it from working right (“cannot
perform PE operations on non PE output file”). It’s still do-able, and
I did it myself, but you’ll need to drop the OUTPUT_FORMAT
directive
and add an extra objcopy
step (objcopy -O binary
).
To demonstrate how to do all this, let’s make a DOS “Hello, World” COM program using GCC on Linux.
There’s a significant burden with this technique: there will be no
standard library. It’s basically like writing an operating system
from scratch, except for the few services DOS provides. This means no
printf()
or anything of the sort. Instead we’ll ask DOS to print a
string to the terminal. Making a request to DOS means firing an
interrupt, which means inline assembly!
DOS has nine interrupts: 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26,
0x27, 0x2F. The big one, and the one we’re interested in, is 0x21,
function 0x09 (print string). Between DOS and BIOS, there are
thousands of functions called this way. I’m not going to try
to explain x86 assembly, but in short the function number is stuffed
into register ah
and interrupt 0x21 is fired. Function 0x09 also
takes an argument, the pointer to the string to be printed, which is
passed in registers dx
and ds
.
Here’s the GCC inline assembly print()
function. Strings passed to
this function must be terminated with a $
. Why? Because DOS.
static void print(char *string)
{
asm volatile ("mov $0x09, %%ah\n"
"int $0x21\n"
: /* no output */
: "d"(string)
: "ah");
}
The assembly is declared volatile
because it has a side effect
(printing the string). To GCC, the assembly is an opaque hunk, and the
optimizer relies in the output/input/clobber constraints (the last
three lines). For DOS programs like this, all inline assembly will
have side effects. This is because it’s not being written for
optimization but to access hardware and DOS, things not accessible to
plain C.
Care must also be taken by the caller, because GCC doesn’t know that
the memory pointed to by string
is ever read. It’s likely the array
that backs the string needs to be declared volatile
too. This is all
foreshadowing into what’s to come: doing anything in this environment
is an endless struggle against the optimizer. Not all of these battles
can be won.
Now for the main function. The name of this function shouldn’t matter,
but I’m avoiding calling it main()
since MinGW has a funny ideas
about mangling this particular symbol, even when it’s asked not to.
int dosmain(void)
{
print("Hello, World!\n$");
return 0;
}
COM files are limited to 65,279 bytes in size. This is because an x86 memory segment is 64kB and COM files are simply loaded by DOS to 0x0100 in the segment and executed. There are no headers, it’s just a raw binary. Since a COM program can never be of any significant size, and no real linking needs to occur (freestanding), the entire thing will be compiled as one translation unit. It will be one call to GCC with a bunch of options.
Here are the essential compiler options.
-std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding
Since no standard libraries are in use, the only difference between
gnu99 and c99 is that trigraphs are disabled (as they should be) and
inline assembly can be written as asm
instead of __asm__
. It’s a
no brainer. This project will be so closely tied to GCC that I don’t
care about using GCC extensions anyway.
I’m using -Os
to keep the compiled output as small as possible. It
will also make the program run faster. This is important when
targeting DOSBox because, by default, it will deliberately run as slow
as a machine from the 1980’s. I want to be able to fit in that
constraint. If the optimizer is causing problems, you may need to
temporarily make this -O0
to determine if the problem is your fault
or the optimizer’s fault.
You see, the optimizer doesn’t understand that the program will be
running in real mode, and under its addressing constraints. It will
perform all sorts of invalid optimizations that break your perfectly
valid programs. It’s not a GCC bug since we’re doing crazy stuff
here. I had to rework my code a number of times to stop the optimizer
from breaking my program. For example, I had to avoid returning
complex structs from functions because they’d sometimes be filled with
garbage. The real danger here is that a future version of GCC will be
more clever and will break more stuff. In this battle, volatile
is
your friend.
Th next option is -nostdlib
, since there are no valid libraries for
us to link against, even statically.
The options -m32 -march=i386
set the compiler to produce 80386 code.
If I was writing a bootloader for a modern computer, targeting 80686
would be fine, too, but DOSBox is 80386.
The -ffreestanding
argument requires that GCC not emit code that
calls built-in standard library helper functions. Sometimes instead of
emitting code to do something, it emits code that calls a built-in
function to do it, especially with math operators. This was one of the
main problems I had with bcc, where this behavior couldn’t be
disabled. This is most commonly used in writing bootloaders and
kernels. And now DOS COM files.
The -Wl
option is used to pass arguments to the linker (ld
). We
need it since we’re doing all this in one call to GCC.
-Wl,--nmagic,--script=com.ld
The --nmagic
turns off page alignment of sections. One, we don’t
need this. Two, that would waste precious space. In my tests it
doesn’t appear to be necessary, but I’m including it just in case.
The --script
option tells the linker that we want to use a custom
linker script. This allows us to precisely lay out the sections
(text
, data
, bss
, rodata
) of our program. Here’s the com.ld
script.
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text);
}
.data :
{
*(.data);
*(.bss);
*(.rodata);
}
_heap = ALIGN(4);
}
The OUTPUT_FORMAT(binary)
says not to put this into an ELF (or PE,
etc.) file. The linker should just dump the raw code. A COM file is
just raw code, so this means the linker will produce a COM file!
I had said that COM files are loaded to 0x0100
. The fourth line
offsets the binary to this location. The first byte of the COM file
will still be the first byte of code, but it will be designed to run
from that offset in memory.
What follows is all the sections, text
(program), data
(static
data), bss
(zero-initialized data), rodata
(strings). Finally I
mark the end of the binary with the symbol _heap
. This will come in
handy later for writing sbrk()
, after we’re done with “Hello,
World.” I’ve asked for the _heap
position to be 4-byte aligned.
We’re almost there.
The linker is usually aware of our entry point (main
) and sets that
up for us. But since we asked for “binary” output, we’re on our own.
If the print()
function is emitted first, our program’s execution
will begin with executing that function, which is invalid. Our program
needs a little header stanza to get things started.
The linker script has a STARTUP
option for handling this, but to
keep it simple we’ll put that right in the program. This is usually
called crt0.o
or Boot.o
, in case those names every come up in your
own reading. This inline assembly must be the very first thing in
our code, before any includes and such. DOS will do most of the setup
for us, we really just have to jump to the entry point.
asm (".code16gcc\n"
"call dosmain\n"
"mov $0x4C, %ah\n"
"int $0x21\n");
The .code16gcc
tells the assembler that we’re going to be running in
real mode, so that it makes the proper adjustment. Despite the name,
this will not make it produce 16-bit code! First it calls dosmain
,
the function we wrote above. Then it informs DOS, using function
0x4C
(terminate with return code), that we’re done, passing the exit
code along in the 1-byte register al
(already set by dosmain
).
This inline assembly is automatically volatile
because it has no
inputs or outputs.
Here’s the entire C program.
asm (".code16gcc\n"
"call dosmain\n"
"mov $0x4C,%ah\n"
"int $0x21\n");
static void print(char *string)
{
asm volatile ("mov $0x09, %%ah\n"
"int $0x21\n"
: /* no output */
: "d"(string)
: "ah");
}
int dosmain(void)
{
print("Hello, World!\n$");
return 0;
}
I won’t repeat com.ld
. Here’s the call to GCC.
gcc -std=gnu99 -Os -nostdlib -m32 -march=i386 -ffreestanding \
-o hello.com -Wl,--nmagic,--script=com.ld hello.c
And testing it in DOSBox:
From here if you want fancy graphics, it’s just a matter of making an interrupt and writing to VGA memory. If you want sound you can perform an interrupt for the PC speaker. I haven’t sorted out how to call Sound Blaster yet. It was from this point that I grew DOS Defender.
To cover one more thing, remember that _heap
symbol? We can use it
to implement sbrk()
for dynamic memory allocation within the main
program segment. This is real mode, and there’s no virtual memory, so
we’re free to write to any memory we can address at any time. Some of
this is reserved (i.e. low and high memory) for hardware. So using
sbrk()
specifically isn’t really necessary, but it’s interesting
to implement ourselves.
As is normal on x86, your text and segments are at a low address
(0x0100 in this case) and the stack is at a high address (around
0xffff in this case). On Unix-like systems, the memory returned by
malloc()
comes from two places: sbrk()
and mmap()
. What sbrk()
does is allocates memory just above the text/data segments, growing
“up” towards the stack. Each call to sbrk()
will grow this space (or
leave it exactly the same). That memory would then managed by
malloc()
and friends.
Here’s how we can get sbrk()
in a COM program. Notice I have to
define my own size_t
, since we don’t have a standard library.
typedef unsigned short size_t;
extern char _heap;
static char *hbreak = &_heap;
static void *sbrk(size_t size)
{
char *ptr = hbreak;
hbreak += size;
return ptr;
}
It just sets a pointer to _heap
and grows it as needed. A slightly
smarter sbrk()
would be careful about alignment as well.
In the making of DOS Defender an interesting thing happened. I was
(incorrectly) counting on the memory return by my sbrk()
being
zeroed. This was the case the first time the game ran. However, DOS
doesn’t zero this memory between programs. When I would run my game
again, it would pick right up where it left off, because the same
data structures with the same contents were loaded back into place. A
pretty cool accident! It’s part of what makes this a fun embedded
platform.
Suppose you’re writing a function pass_match()
that takes an input
stream, an output stream, and a pattern. It works sort of like grep.
It passes to the output each line of input that matches the pattern.
The pattern string contains a shell glob pattern to be handled by
POSIX fnmatch()
. Here’s what the interface looks like.
void pass_match(FILE *in, FILE *out, const char *pattern);
Glob patterns are simple enough that pre-compilation, as would be done for a regular expression, is unnecessary. The bare string is enough.
Some time later the customer wants the program to support regular
expressions in addition to shell-style glob patterns. For efficiency’s
sake, regular expressions need to be pre-compiled and so will not be
passed to the function as a string. It will instead be a POSIX
regex_t
object. A quick-and-dirty approach might be to
accept both and match whichever one isn’t NULL.
void pass_match(FILE *in, FILE *out, const char *pattern, regex_t *re);
Bleh. This is ugly and won’t scale well. What happens when more kinds of filters are needed? It would be much better to accept a single object that covers both cases, and possibly even another kind of filter in the future.
One of the most common ways to customize the the behavior of a
function in C is to pass a function pointer. For example, the final
argument to qsort()
is a comparator that determines how
objects get sorted.
For pass_match()
, this function would accept a string and return a
boolean value deciding if the string should be passed to the output
stream. It gets called once on each line of input.
void pass_match(FILE *in, FILE *out, bool (*match)(const char *));
However, this has one of the same problems as qsort()
:
the passed function lacks context. It needs a pattern string or
regex_t
object to operate on. In other languages these would be
attached to the function as a closure, but C doesn’t have closures. It
would need to be smuggled in via a global variable, which is not
good.
static regex_t regex; // BAD!!!
bool regex_match(const char *string)
{
return regexec(®ex, string, 0, NULL, 0) == 0;
}
Because of the global variable, in practice pass_match()
would be
neither reentrant nor thread-safe. We could take a lesson from GNU’s
qsort_r()
and accept a context to be passed to the filter function.
This simulates a closure.
void pass_match(FILE *in, FILE *out,
bool (*match)(const char *, void *), void *context);
The provided context pointer would be passed to the filter function as
the second argument, and no global variables are needed. This would
probably be good enough for most purposes and it’s about as simple as
possible. The interface to pass_match()
would cover any kind of
filter.
But wouldn’t it be nice to package the function and context together as one object?
How about putting the context on a struct and making an interface out of that? Here’s a tagged union that behaves as one or the other.
enum filter_type { GLOB, REGEX };
struct filter {
enum filter_type type;
union {
const char *pattern;
regex_t regex;
} context;
};
There’s one function for interacting with this struct:
filter_match()
. It checks the type
member and calls the correct
function with the correct context.
bool filter_match(struct filter *filter, const char *string)
{
switch (filter->type) {
case GLOB:
return fnmatch(filter->context.pattern, string, 0) == 0;
case REGEX:
return regexec(&filter->context.regex, string, 0, NULL, 0) == 0;
}
abort(); // programmer error
}
And the pass_match()
API now looks like this. This will be the final
change to pass_match()
, both in implementation and interface.
void pass_match(FILE *input, FILE *output, struct filter *filter);
It still doesn’t care how the filter works, so it’s good enough to
cover all future cases. It just calls filter_match()
on the pointer
it was given. However, the switch
and tagged union aren’t friendly
to extension. Really, it’s outright hostile. We finally have some
degree of polymorphism, but it’s crude. It’s like building duct tape
into a design. Adding new behavior means adding another switch
case.
This is a step backwards. We can do better.
With the switch
we’re no longer taking advantage of function
pointers. So what about putting a function pointer on the struct?
struct filter {
bool (*match)(struct filter *, const char *);
};
The filter itself is passed as the first argument, providing context.
In object oriented languages, that’s the implicit this
argument. To
avoid requiring the caller to worry about this detail, we’ll hide it
in a new switch
-free version of filter_match()
.
bool filter_match(struct filter *filter, const char *string)
{
return filter->match(filter, string);
}
Notice we’re still lacking the actual context, the pattern string or the regex object. Those will be different structs that embed the filter struct.
struct filter_regex {
struct filter filter;
regex_t regex;
};
struct filter_glob {
struct filter filter;
const char *pattern;
};
For both the original filter struct is the first member. This is
critical. We’re going to be using a trick called type punning. The
first member is guaranteed to be positioned at the beginning of the
struct, so a pointer to a struct filter_glob
is also a pointer to a
struct filter
. Notice any resemblance to inheritance?
Each type, glob and regex, needs its own match method.
static bool
method_match_regex(struct filter *filter, const char *string)
{
struct filter_regex *regex = (struct filter_regex *) filter;
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static bool
method_match_glob(struct filter *filter, const char *string)
{
struct filter_glob *glob = (struct filter_glob *) filter;
return fnmatch(glob->pattern, string, 0) == 0;
}
I’ve prefixed them with method_
to indicate their intended usage. I
declared these static
because they’re completely private. Other
parts of the program will only be accessing them through a function
pointer on the struct. This means we need some constructors in order
to set up those function pointers. (For simplicity, I’m not error
checking.)
struct filter *filter_regex_create(const char *pattern)
{
struct filter_regex *regex = malloc(sizeof(*regex));
regcomp(®ex->regex, pattern, REG_EXTENDED);
regex->filter.match = method_match_regex;
return ®ex->filter;
}
struct filter *filter_glob_create(const char *pattern)
{
struct filter_glob *glob = malloc(sizeof(*glob));
glob->pattern = pattern;
glob->filter.match = method_match_glob;
return &glob->filter;
}
Now this is real polymorphism. It’s really simple from the user’s
perspective. They call the correct constructor and get a filter object
that has the desired behavior. This object can be passed around
trivially, and no other part of the program worries about how it’s
implemented. Best of all, since each method is a separate function
rather than a switch
case, new kinds of filter subtypes can be
defined independently. Users can create their own filter types that
work just as well as the two “built-in” filters.
Oops, the regex filter needs to be cleaned up when it’s done, but the
user, by design, won’t know how to do it. Let’s add a free()
method.
struct filter {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
};
void filter_free(struct filter *filter)
{
return filter->free(filter);
}
And the methods for each. These would also be assigned in the constructor.
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = (struct filter_regex *) f;
regfree(®ex->regex);
free(f);
}
static void
method_free_glob(struct filter *f)
{
free(f);
}
The glob constructor should perhaps strdup()
its pattern as a
private copy, in which case it would be freed here.
A good rule of thumb is to prefer composition over inheritance. Having tidy filter objects opens up some interesting possibilities for composition. Here’s an AND filter that composes two arbitrary filter objects. It only matches when both its subfilters match. It supports short circuiting, so put the faster, or most discriminating, filter first in the constructor (user’s responsibility).
struct filter_and {
struct filter filter;
struct filter *sub[2];
};
static bool
method_match_and(struct filter *f, const char *s)
{
struct filter_and *and = (struct filter_and *) f;
return filter_match(and->sub[0], s) && filter_match(and->sub[1], s);
}
static void
method_free_and(struct filter *f)
{
struct filter_and *and = (struct filter_and *) f;
filter_free(and->sub[0]);
filter_free(and->sub[1]);
free(f);
}
struct filter *filter_and(struct filter *a, struct filter *b)
{
struct filter_and *and = malloc(sizeof(*and));
and->sub[0] = a;
and->sub[1] = b;
and->filter.match = method_match_and;
and->filter.free = method_free_and;
return &and->filter;
}
It can combine a regex filter and a glob filter, or two regex filters,
or two glob filters, or even other AND filters. It doesn’t care what
the subfilters are. Also, the free()
method here frees its
subfilters. This means that the user doesn’t need to keep hold of
every filter created, just the “top” one in the composition.
To make composition filters easier to use, here are two “constant” filters. These are statically allocated, shared, and are never actually freed.
static bool
method_match_any(struct filter *f, const char *string)
{
return true;
}
static bool
method_match_none(struct filter *f, const char *string)
{
return false;
}
static void
method_free_noop(struct filter *f)
{
}
struct filter FILTER_ANY = { method_match_any, method_free_noop };
struct filter FILTER_NONE = { method_match_none, method_free_noop };
The FILTER_NONE
filter will generally be used with a (theoretical)
filter_or()
and FILTER_ANY
will generally be used with the
previously defined filter_and()
.
Here’s a simple program that composes multiple glob filters into a single filter, one for each program argument.
int main(int argc, char **argv)
{
struct filter *filter = &FILTER_ANY;
for (char **p = argv + 1; *p; p++)
filter = filter_and(filter_glob_create(*p), filter);
pass_match(stdin, stdout, filter);
filter_free(filter);
return 0;
}
Notice only one call to filter_free()
is needed to clean up the
entire filter.
As I mentioned before, the filter struct must be the first member of filter subtype structs in order for type punning to work. If we want to “inherit” from two different types like this, they would both need to be in this position: a contradiction.
Fortunately type punning can be generalized such that it the
first-member constraint isn’t necessary. This is commonly done through
a container_of()
macro. Here’s a C99-conforming definition.
#include <stddef.h>
#define container_of(ptr, type, member) \
((type *)((char *)(ptr) - offsetof(type, member)))
Given a pointer to a member of a struct, the container_of()
macro
allows us to back out to the containing struct. Suppose the regex
struct was defined differently, so that the regex_t
member came
first.
struct filter_regex {
regex_t regex;
struct filter filter;
};
The constructor remains unchanged. The casts in the methods change to the macro.
static bool
method_match_regex(struct filter *f, const char *string)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
regfree(®ex->regex);
free(f);
}
It’s a constant, compile-time computed offset, so there should be no practical performance impact. The filter can now participate freely in other intrusive data structures, like linked lists and such. It’s analogous to multiple inheritance.
Say we want to add a third method, clone()
, to the filter API, to
make an independent copy of a filter, one that will need to be
separately freed. It will be like the copy assignment operator in C++.
Each kind of filter will need to define an appropriate “method” for
it. As long as new methods like this are added at the end, this
doesn’t break the API, but it does break the ABI regardless.
struct filter {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
struct filter *(*clone)(struct filter *);
};
The filter object is starting to get big. It’s got three pointers — 24 bytes on modern systems — and these pointers are the same between all instances of the same type. That’s a lot of redundancy. Instead, these pointers could be shared between instances in a common table called a virtual method table, commonly known as a vtable.
Here’s a vtable version of the filter API. The overhead is now only one pointer regardless of the number of methods in the interface.
struct filter {
struct filter_vtable *vtable;
};
struct filter_vtable {
bool (*match)(struct filter *, const char *);
void (*free)(struct filter *);
struct filter *(*clone)(struct filter *);
};
Each type creates its own vtable and links to it in the constructor. Here’s the regex filter re-written for the new vtable API and clone method. This is all the tricks in one basket for a big object oriented C finale!
struct filter *filter_regex_create(const char *pattern);
struct filter_regex {
regex_t regex;
const char *pattern;
struct filter filter;
};
static bool
method_match_regex(struct filter *f, const char *string)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return regexec(®ex->regex, string, 0, NULL, 0) == 0;
}
static void
method_free_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
regfree(®ex->regex);
free(f);
}
static struct filter *
method_clone_regex(struct filter *f)
{
struct filter_regex *regex = container_of(f, struct filter_regex, filter);
return filter_regex_create(regex->pattern);
}
/* vtable */
struct filter_vtable filter_regex_vtable = {
method_match_regex, method_free_regex, method_clone_regex
};
/* constructor */
struct filter *filter_regex_create(const char *pattern)
{
struct filter_regex *regex = malloc(sizeof(*regex));
regex->pattern = pattern;
regcomp(®ex->regex, pattern, REG_EXTENDED);
regex->filter.vtable = &filter_regex_vtable;
return ®ex->filter;
}
This is almost exactly what’s going on behind the scenes in C++. When
a method/function is declared virtual
, and therefore dispatches
based on the run-time type of its left-most argument, it’s listed in
the vtables for classes that implement it. Otherwise it’s just a
normal function. This is why functions need to be declared virtual
ahead of time in C++.
In conclusion, it’s relatively easy to get the core benefits of object oriented programming in plain old C. It doesn’t require heavy use of macros, nor do users of these systems need to know that underneath it’s an object system, unless they want to extend it for themselves.
Here’s the whole example program once if you’re interested in poking:
]]>In this article I’m going to use two well-established C APIs to demonstrate why global state is bad for APIs: BSD regular expressions and POSIX Getopt.
The BSD regular expression API dates back to 4.3BSD, released in 1986. It’s just a pair of functions: one compiles the regex, the other executes it on a string.
char *re_comp(const char *regex);
int re_exec(const char *string);
It’s immediately obvious that there’s hidden internal state. Where
else would the resulting compiled regex object be? Also notice there’s
no re_free()
, or similar, for releasing resources held by the
compiled result. That’s because, due to its limited design, it doesn’t
hold any. It’s entirely in static memory, which means there’s some
upper limit on the complexity of the regex given to this API. Suppose
an implementation does use dynamically allocated memory. It seems
this might not matter when only one compiled regex is allowed.
However, this would create warnings in Valgrind and make it harder to
use for bug testing.
This API is not thread-safe. Only one thread can use it at a time. It’s not reentrant. While using a regex, calling another function that might use a regex means you have to recompile when it returns, just in case. The global state being entirely hidden, there’s no way to tell if another part of the program used it.
This API has been deprecated for some time now, so hopefully no one’s
using it anymore. 15 years after the BSD regex API came out, POSIX
standardized a much better API. It operates on an opaque
regex_t
object, on which all state is stored. There’s no global
state.
int regcomp(regex_t *preg, const char *regex, int cflags);
int regexec(const regex_t *preg, const char *string, ...);
size_t regerror(int errcode, const regex_t *preg, ...);
void regfree(regex_t *preg);
This is what a good API looks like.
POSIX defines a C API called Getopt for parsing command line
arguments. It’s a single function that operates on the argc
and
argv
values provided to main()
. An option string specifies which
options are valid and whether or not they require an argument. Typical
use looks like this,
int main(int argc, char **argv)
{
int option;
while ((option = getopt(argc, argv, "ab:c:d")) != -1) {
switch (option) {
case 'a':
/* ... */
}
}
/* ... */
return 0;
}
The b
and c
options require an argument, indicated by the colons.
When encountered, this argument is passed through a global variable
optarg
. There are four external global variables in total.
extern char *optarg;
extern int optind, opterr, optopt;
If an invalid option is found, getopt()
will automatically print a
locale-specific error message and return ?
. The opterr
variable
can be used to disable this message and the optopt
variable is used
to get the actual invalid option character.
The optind
variable keeps track of Getopt’s progress. It slides
along argv
as each option is processed. In a minimal, strictly
POSIX-compliant Getopt, this is all the global state required.
The argc
value in main()
, and therefore the same parameter in
getopt()
, is completely redundant and serves no real purpose. Just
like the C strings it points to, the argv
vector is guaranteed to be
NULL-terminated. At best it’s a premature optimization.
The most immediate problem is that the entire program can only parse one argument vector at a time. It’s not thread-safe. This leaves out the possibility of parsing argument vectors in other threads. For example, if the program is a server that exposes a shell-like interface to remote users, and multiple threads are used to handle those requests, it won’t be able to take advantage of Getopt.
The second problem is that, even in a single-threaded application, the program can’t pause to parse a different argument vector before returning. It’s not reentrant. For example, suppose one of the arguments to the program is a string containing more arguments to be parsed for some subsystem.
# -s Provide a set of sub-options to pass to XXX.
$ myprogram -s "-a -b -c foo"
In theory, the value of optind
could be saved and restored. However,
this isn’t portable. POSIX doesn’t explicitly declare that the entire
state is captured by optind
, nor is it required to be.
Implementations are allowed to have internal, hidden global state.
This has implications in resetting Getopt.
In a minimal, strict Getopt, resetting Getopt for parsing another
argument vector is just a matter of setting optind
to back to its
original value of 1. However, this idiom isn’t portable, and POSIX
provides no portable method for resetting the global parser state.
Real implementations of Getopt go beyond POSIX. Probably the most popular extra feature is option grouping. Typically, multiple options can be grouped into a single argument, so long as only the final option requires an argument.
$ myprogram -adb foo
After processing a
, optind
cannot be incremented, because it’s
still working on the first argument. This means there’s another
internal counter for stepping across the group. In glibc this is
called nextchar
. Setting optind
to 1 will not reset this internal
counter, nor would it be detectable by Getopt if it was already set
to 1. The glibc way to reset Getopt is to set optind
to 0, which is
otherwise an invalid value. Some other Getopt implementations follow
this idiom, but it’s not entirely portable.
Not only does Getopt have nasty global state, the user has no way to reliably control it!
I mentioned that Getopt will automatically print an error message
unless disabled with opterr
. There’s no way to get at this error
message, should you want to redirect it somewhere else. It’s more
hidden, internal state. You could write your own message, but you’d
lose out on the automatic locale support.
The way Getopt should have been designed was to accept a context argument and store all state on that context. Following other POSIX APIs (pthreads, regex), the context itself would be an opaque object. In typical use it would have automatic (i.e. stack) duration. The context would either be zero initialized or a function would be provided to initialize it. It might look something like this (in the zero-initialized case).
int getopt(getopt_t *ctx, char **argv, const chat *optstring);
Instead of optarg
and optopt
global variables, these values would
be obtained by interrogating the context. The same applies for
optind
and the diagnostic message.
const char *getopt_optarg(getopt_t *ctx);
int getopt_optopt(getopt_t *ctx);
int getopt_optind(getopt_t *ctx);
const char *getopt_opterr(getopt_t *ctx);
Alternatively, instead of getopt_optind()
the API could have a
function that continues processing, but returns non-option arguments
instead of options. It would return NULL when no more arguments are
left. This is the API I’d prefer, because it would allow for argument
permutation (allow options to come after non-options, per GNU Getopt)
without actually modifying the argument vector. This common extension
to Getopt could be added cleanly. The real Getopt isn’t designed well
for extension.
const char *getopt_next_arg(getopt_t *ctx);
This API eliminates the global state and, as a result, solves all of the problems listed above. It’s essentially the same API defined by Popt and my own embeddable Optparse. They’re much better options if the limitations of POSIX-style Getopt are an issue.
]]>To my surprise, this turned out to be harder than I expected. A straightforward scan with Boyer-Moore-Horspool across the entire text file is already pretty fast. On modern, COTS hardware it takes about 6 seconds. Comparing bytes is cheap and it’s largely an I/O-bound problem. This means building fancy indexes tends to make it slower because it’s more I/O demanding.
The challenge was inspired by The Pi-Search Page, which offers a search on the first 200 million digits. There’s also a little write-up about how their pi search works. I wanted to try to invent my own solution. I did eventually come up with something that worked, which can be found here. It’s written in plain old C.
You might want to give the challenge a shot on your own before continuing!
The first thing I tried was SQLite. I thought an index (B-tree) over
fixed-length substrings would be efficient. A LIKE
condition with a
right-hand wildcard is sargable and would work well with the
index. Here’s the schema I tried.
CREATE TABLE digits
(position INTEGER PRIMARY KEY, sequence TEXT NOT NULL)
There will be 1 row for each position, i.e. 1 billion rows. Using
INTEGER PRIMARY KEY
means position
will be used directly for row
IDs, saving some database space.
After the data has been inserted by sliding a window along pi, I build an index. It’s better to build an index after data is in the database than before.
CREATE INDEX sequence_index ON digits (sequence, position)
This takes several hours to complete. When it’s done the database is a whopping 60GB! Remember I said that this is very much an I/O-bound problem? I wasn’t kidding. This doesn’t work well at all. Here’s the a search for the example sequence.
SELECT position, sequence FROM digits
WHERE sequence LIKE '141592653%'
You get your answers after about 15 minutes of hammering on the disk.
Sometime later I realized that up to 18-digits sequences could be
encoded into an integer, so that TEXT
column could be a much simpler
INTEGER
. Unfortunately this doesn’t really improve anything. I also
tried this in PostgreSQL but it was even worse. I gave up after 24
hours of waiting on it. These databases are not built for such long,
skinny tables, at least not without beefy hardware.
A couple weeks later I had another idea. A query is just a sequence of
digits, so it can be trivially converted into a unique number. As
before, pick a fixed length for sequences (n
) for the index and an
appropriate stride. The database would be one big file. To look up a
sequence, treat that sequence as an offset into the database and seek
into the database file to that offset times the stride. The total size
of the database is 10^n * stride
.
In this quick and dirty illustration, n=4 and stride=4 (far too small for that n).
For example, if the fixed-length for sequences is 6 and the stride is
4,000 bytes, looking up “141592” is just a matter of seeking to byte
141,592 * 4,000
and reading in positions until some sort of
sentinel. The stride must be long enough to store all the positions
for any indexed sequence.
For this purpose, the digits of pi are practically random numbers. The
good news is that it means a fixed stride will work well. Any
particular sequence appears just as often as any other. The chance a
specific n-length sequence begins at a specific position is 1 /
10^n
. There are 1 billion positions, so a particular sequence will
have 1e9 / 10^n
positions associated with it, which is a good place
to start for picking a stride.
The bad news is that building the index means jumping around the database essentially at random for each write. This will break any sort of cache between the program and the hard drive. It’s incredibly slow, even mmap()ed. The workaround is to either do it entirely in RAM (needs at least 6GB of RAM for 1 billion digits!) or to build it up over many passes. I didn’t try it on an SSD but maybe the random access is more tolerable there.
Doing all the work in memory makes it easier to improve the database
format anyway. It can be broken into an index section and a tables
section. Instead of a fixed stride for the data, front-load the
database with a similar index that points to the section (table) of
the database file that holds that sequence’s pi positions. Each of the
10^n
positions gets a single integer in the index at the front of
the file. Looking up the positions for a sequence means parsing the
sequence as a number, seeking to that offset into the beginning of the
database, reading in another offset integer, and then seeking to that
new offset. Now the database is compact and there are no concerns
about stride.
No sentinel mark is needed either. The tables are concatenated in order in the table part of the database. To determine where to stop, take a peek at the next sequence’s start offset in the index. Its table immediately follows, so this doubles as an end offset. For convenience, one final integer in the index will point just beyond the end of the database, so the last sequence (99999…) doesn’t require special handling.
If the database built for fixed length sequences, how is a sequence of a different length searched? The two cases, shorter and longer, are handled differently.
If the sequence is shorter, fill in the remaining digits, …000 to …999, and look up each sequence. For example, if n=6 and we’re searching for “1415”, get all the positions for “141500”, “141501”, “141502”, …, “141599” and concatenate them. Fortunately the database already has them stored this way! Look up the offsets for “141500” and “141600” and grab everything in between. The downside is that the pi positions are only partially sorted, so they may require sorting before presenting to the user.
If the sequence is longer, the original digits file will be needed. Get the table for the subsequence fixed-length prefix, then seek into the digits file checking each of the pi positions for a full match. This requires lots of extra seeking, but a long sequence will naturally have fewer positions to test. For example, if n=7 and we’re looking for “141592653”, look up the “1415926” table in the database and check each of its 106 positions.
With this database searches are only a few milliseconds (though very much subject to cache misses). Here’s my program in action, from the repository linked above.
$ time ./pipattern 141592653
1: 14159265358979323
427238911: 14159265303126685
570434346: 14159265337906537
678096434: 14159265360713718
real 0m0.004s
user 0m0.000s
sys 0m0.000s
I call that challenge completed!
]]>_Atomic
type specifier and not paying enough attention to memory
ordering constraints.
Still, this is a good opportunity to break new ground with a
demonstration of C11. I’m going to use the new
stdatomic.h
portion of C11 to build a lock-free data
structure. To compile this code you’ll need a C compiler and C library
with support for both C11 and the optional stdatomic.h
features. As
of this writing, as far as I know only GCC 4.9, released April
2014, supports this. It’s in Debian unstable but not in Wheezy.
If you want to take a look before going further, here’s the source. The test code in the repository uses plain old pthreads because C11 threads haven’t been implemented by anyone yet.
I was originally going to write this article a couple weeks ago, but I was having trouble getting it right. Lock-free data structures are trickier and nastier than I expected, more so than traditional mutex locks. Getting it right requires very specific help from the hardware, too, so it won’t run just anywhere. I’ll discuss all this below. So sorry for the long article. It’s just a lot more complex a topic than I had anticipated!
A lock-free data structure doesn’t require the use of mutex locks. More generally, it’s a data structure that can be accessed from multiple threads without blocking. This is accomplished through the use of atomic operations — transformations that cannot be interrupted. Lock-free data structures will generally provide better throughput than mutex locks. And it’s usually safer, because there’s no risk of getting stuck on a lock that will never be freed, such as a deadlock situation. On the other hand there’s additional risk of starvation (livelock), where a thread is unable to make progress.
As a demonstration, I’ll build up a lock-free stack, a sequence with last-in, first-out (LIFO) behavior. Internally it’s going to be implemented as a linked-list, so pushing and popping is O(1) time, just a matter of consing a new element on the head of the list. It also means there’s only one value to be updated when pushing and popping: the pointer to the head of the list.
Here’s what the API will look like. I’ll define lstack_t
shortly.
I’m making it an opaque type because its fields should never be
accessed directly. The goal is to completely hide the atomic
semantics from the users of the stack.
int lstack_init(lstack_t *lstack, size_t max_size);
void lstack_free(lstack_t *lstack);
size_t lstack_size(lstack_t *lstack);
int lstack_push(lstack_t *lstack, void *value);
void *lstack_pop (lstack_t *lstack);
Users can push void pointers onto the stack, check the size of the stack, and pop void pointers back off the stack. Except for initialization and destruction, these operations are all safe to use from multiple threads. Two different threads will never receive the same item when popping. No elements will ever be lost if two threads attempt to push at the same time. Most importantly a thread will never block on a lock when accessing the stack.
Notice there’s a maximum size declared at initialization time. While
lock-free allocation is possible [PDF], C makes no
guarantees that malloc()
is lock-free, so being truly lock-free
means not calling malloc()
. An important secondary benefit to
pre-allocating the stack’s memory is that this implementation doesn’t
require the use of hazard pointers, which would be far more
complicated than the stack itself.
The declared maximum size should actually be the desired maximum size plus the number of threads accessing the stack. This is because a thread might remove a node from the stack and before the node can freed for reuse, another thread attempts a push. This other thread might not find any free nodes, causing it to give up without the stack actually being “full.”
The int
return value of lstack_init()
and lstack_push()
is for
error codes, returning 0 for success. The only way these can fail is
by running out of memory. This is an issue regardless of being
lock-free: systems can simply run out of memory. In the push case it
means the stack is full.
Here’s the definition for a node in the stack. Neither field needs to be accessed atomically, so they’re not special in any way. In fact, the fields are never updated while on the stack and visible to multiple threads, so it’s effectively immutable (outside of reuse). Users never need to touch this structure.
struct lstack_node {
void *value;
struct lstack_node *next;
};
Internally a lstack_t
is composed of two stacks: the value stack
(head
) and the free node stack (free
). These will be handled
identically by the atomic functions, so it’s really a matter of
convention which stack is which. All nodes are initially placed on the
free stack and the value stack starts empty. Here’s what an internal
stack looks like.
struct lstack_head {
uintptr_t aba;
struct lstack_node *node;
};
There’s still no atomic declaration here because the struct is going
to be handled as an entire unit. The aba
field is critically
important for correctness and I’ll go over it shortly. It’s declared
as a uintptr_t
because it needs to be the same size as a pointer.
Now, this is not guaranteed by C11 — it’s only guaranteed to be large
enough to hold any valid void *
pointer, so it could be even larger
— but this will be the case on any system that has the required
hardware support for this lock-free stack. This struct is therefore
the size of two pointers. If that’s not true for any reason, this code
will not link. Users will never directly access or handle this struct
either.
Finally, here’s the actual stack structure.
typedef struct {
struct lstack_node *node_buffer;
_Atomic struct lstack_head head, free;
_Atomic size_t size;
} lstack_t;
Notice the use of the new _Atomic
qualifier. Atomic values may have
different size, representation, and alignment requirements in order to
satisfy atomic access. These values should never be accessed directly,
even just for reading (use atomic_load()
).
The size
field is for convenience to check the number of elements on
the stack. It’s accessed separately from the stack nodes themselves,
so it’s not safe to read size
and use the information to make
assumptions about future accesses (e.g. checking if the stack is empty
before popping off an element). Since there’s no way to lock the
lock-free stack, there’s otherwise no way to estimate the size of the
stack during concurrent access without completely disassembling it via
lstack_pop()
.
There’s no reason to use volatile
here. That’s a
separate issue from atomic operations. The C11 stdatomic.h
macros
and functions will ensure atomic values are accessed appropriately.
As stated before, all nodes are initially placed on the internal free
stack. During initialization they’re allocated in one solid chunk,
chained together, and pinned on the free
pointer. The initial
assignments to atomic values are done through ATOMIC_VAR_INIT
, which
deals with memory access ordering concerns. The aba
counters don’t
actually need to be initialized. Garbage, indeterminate values are
just fine, but not initializing them would probably look like a
mistake.
int
lstack_init(lstack_t *lstack, size_t max_size)
{
struct lstack_head head_init = {0, NULL};
lstack->head = ATOMIC_VAR_INIT(head_init);
lstack->size = ATOMIC_VAR_INIT(0);
/* Pre-allocate all nodes. */
lstack->node_buffer = malloc(max_size * sizeof(struct lstack_node));
if (lstack->node_buffer == NULL)
return ENOMEM;
for (size_t i = 0; i < max_size - 1; i++)
lstack->node_buffer[i].next = lstack->node_buffer + i + 1;
lstack->node_buffer[max_size - 1].next = NULL;
struct lstack_head free_init = {0, lstack->node_buffer};
lstack->free = ATOMIC_VAR_INIT(free_init);
return 0;
}
The free nodes will not necessarily be used in the same order that they’re placed on the free stack. Several threads may pop off nodes from the free stack and, as a separate operation, push them onto the value stack in a different order. Over time with multiple threads pushing and popping, the nodes are likely to get shuffled around quite a bit. This is why a linked listed is still necessary even though allocation is contiguous.
The reverse of lstack_init()
is simple, and it’s assumed concurrent
access has terminated. The stack is no longer valid, at least not
until lstack_init()
is used again. This one is declared inline
and
put in the header.
static inline void
stack_free(lstack_t *lstack)
{
free(lstack->node_buffer);
}
To read an atomic value we need to use atomic_load()
. Give it a
pointer to an atomic value, it dereferences the pointer and returns
the value. This is used in another inline function for reading the
size of the stack.
static inline size_t
lstack_size(lstack_t *lstack)
{
return atomic_load(&lstack->size);
}
For operating on the two stacks there will be two internal, static
functions, push
and pop
. These deal directly in nodes, accepting
and returning them, so they’re not suitable to expose in the API
(users aren’t meant to be aware of nodes). This is the most complex
part of lock-free stacks. Here’s pop()
.
static struct lstack_node *
pop(_Atomic struct lstack_head *head)
{
struct lstack_head next, orig = atomic_load(head);
do {
if (orig.node == NULL)
return NULL; // empty stack
next.aba = orig.aba + 1;
next.node = orig.node->next;
} while (!atomic_compare_exchange_weak(head, &orig, next));
return orig.node;
}
It’s centered around the new C11 stdatomic.h
function
atomic_compare_exchange_weak()
. This is an atomic operation more
generally called compare-and-swap (CAS). On x86 there’s an
instruction specifically for this, cmpxchg
. Give it a pointer to the
atomic value to be updated (head
), a pointer to the value it’s
expected to be (orig
), and a desired new value (next
). If the
expected and actual values match, it’s updated to the new value. If
not, it reports a failure and updates the expected value to the latest
value. In the event of a failure we start all over again, which
requires the while
loop. This is an optimistic strategy.
The “weak” part means it will sometimes spuriously fail where the
“strong” version would otherwise succeed. In exchange for more
failures, calling the weak version is faster. Use the weak version
when the body of your do ... while
loop is fast and the strong
version when it’s slow (when trying again is expensive), or if you
don’t need a loop at all. You usually want to use weak.
The alternative to CAS is load-link/store-conditional. It’s a
stronger primitive that doesn’t suffer from the ABA problem described
next, but it’s also not available on x86-64. On other platforms, one
or both of atomic_compare_exchange_*()
will be implemented using
LL/SC, but we still have to code for the worst case (CAS).
The aba
field is here to solve the ABA problem by counting
the number of changes that have been made to the stack. It will be
updated atomically alongside the pointer. Reasoning about the ABA
problem is where I got stuck last time writing this article.
Suppose aba
didn’t exist and it was just a pointer being swapped.
Say we have two threads, A and B.
Thread A copies the current head
into orig
, enters the loop body
to update next.node
to orig.node->next
, then gets preempted
before the CAS. The scheduler pauses the thread.
Thread B comes along performs a pop()
changing the value pointed
to by head
. At this point A’s CAS will fail, which is fine. It
would reconstruct a new updated value and try again. While A is
still asleep, B puts the popped node back on the free node stack.
Some time passes with A still paused. The freed node gets re-used
and pushed back on top of the stack, which is likely given that
nodes are allocated FIFO. Now head
has its original value again,
but the head->node->next
pointer is pointing somewhere completely
new! This is very bad because A’s CAS will now succeed despite
next.node
having the wrong value.
A wakes up and it’s CAS succeeds. At least one stack value has been lost and at least one node struct was leaked (it will be on neither stack, nor currently being held by a thread). This is the ABA problem.
The core problem is that, unlike integral values, pointers have meaning beyond their intrinsic numeric value. The meaning of a particular pointer changes when the pointer is reused, making it suspect when used in CAS. The unfortunate effect is that, by itself, atomic pointer manipulation is nearly useless. They’ll work with append-only data structures, where pointers are never recycled, but that’s it.
The aba
field solves the problem because it’s incremented every time
the pointer is updated. Remember that this internal stack struct is
two pointers wide? That’s 16 bytes on a 64-bit system. The entire 16
bytes is compared by CAS and they all have to match for it to succeed.
Since B, or other threads, will increment aba
at least twice (once
to remove the node, and once to put it back in place), A will never
mistake the recycled pointer for the old one. There’s a special
double-width CAS instruction specifically for this purpose,
cmpxchg16
. This is generally called DWCAS. It’s available on most
x86-64 processors. On Linux you can check /proc/cpuinfo
for support.
It will be listed as cx16
.
If it’s not available at compile-time this program won’t link. The
function that wraps cmpxchg16
won’t be there. You can tell GCC to
assume it’s there with the -mcx16
flag. The same rule here applies
to C++11’s new std::atomic.
There’s still a tiny, tiny possibility of the ABA problem still cropping up. On 32-bit systems A may get preempted for over 4 billion (2^32) stack operations, such that the ABA counter wraps around to the same value. There’s nothing we can do about this, but if you witness this in the wild you need to immediately stop what you’re doing and go buy a lottery ticket. Also avoid any lightning storms on the way to the store.
Another problem in pop()
is dereferencing orig.node
to access its
next
field. By the time we get to it, the node pointed to by
orig.node
may have already been removed from the stack and freed. If
the stack was using malloc()
and free()
for allocations, it may
even have had free()
called on it. If so, the dereference would be
undefined behavior — a segmentation fault, or worse.
There are three ways to deal with this.
Garbage collection. If memory is automatically managed, the node will never be freed as long as we can access it, so this won’t be a problem. However, if we’re interacting with a garbage collector we’re not really lock-free.
Hazard pointers. Each thread keeps track of what nodes it’s currently accessing and other threads aren’t allowed to free nodes on this list. This is messy and complicated.
Never free nodes. This implementation recycles nodes, but they’re
never truly freed until lstack_free()
. It’s always safe to
dereference a node pointer because there’s always a node behind it.
It may point to a node that’s on the free list or one that was even
recycled since we got the pointer, but the aba
field deals with
any of those issues.
Reference counting on the node won’t work here because we can’t get to
the counter fast enough (atomically). It too would require
dereferencing in order to increment. The reference counter could
potentially be packed alongside the pointer and accessed by a DWCAS,
but we’re already using those bytes for aba
.
Push is a lot like pop.
static void
push(_Atomic struct lstack_head *head, struct lstack_node *node)
{
struct lstack_head next, orig = atomic_load(head);
do {
node->next = orig.node;
next.aba = orig.aba + 1;
next.node = node;
} while (!atomic_compare_exchange_weak(head, &orig, next));
}
It’s counter-intuitive, but adding a few microseconds of sleep after CAS failures would probably increase throughput. Under high contention, threads wouldn’t take turns clobbering each other as fast as possible. It would be a bit like exponential backoff.
The API push and pop functions are built on these internal atomic functions.
int
lstack_push(lstack_t *lstack, void *value)
{
struct lstack_node *node = pop(&lstack->free);
if (node == NULL)
return ENOMEM;
node->value = value;
push(&lstack->head, node);
atomic_fetch_add(&lstack->size, 1);
return 0;
}
Push removes a node from the free stack. If the free stack is empty it reports an out-of-memory error. It assigns the value and pushes it onto the value stack where it will be visible to other threads. Finally, the stack size is incremented atomically. This means there’s an instant where the stack size is listed as one shorter than it actually is. However, since there’s no way to access both the stack size and the stack itself at the same instant, this is fine. The stack size is really only an estimate.
Popping is the same thing in reverse.
void *
lstack_pop(lstack_t *lstack)
{
struct lstack_node *node = pop(&lstack->head);
if (node == NULL)
return NULL;
atomic_fetch_sub(&lstack->size, 1);
void *value = node->value;
push(&lstack->free, node);
return value;
}
Remove the top node, subtract the size estimate atomically, put the node on the free list, and return the pointer. It’s really simple with the primitive push and pop.
The lstack repository linked at the top of the article includes a demo that searches for patterns in SHA-1 hashes (sort of like Bitcoin mining). It fires off one worker thread for each core and the results are all collected into the same lock-free stack. It’s not really exercising the library thoroughly because there are no contended pops, but I couldn’t think of a better example at the time.
The next thing to try would be implementing a C11, bounded, lock-free queue. It would also be more generally useful than a stack, particularly for common consumer-producer scenarios.
]]>qsort()
.
It sorts homogeneous arrays of arbitrary type. The interface is exactly
what you’d expect given the constraints of the language.
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
It takes a pointer to the first element of the array, the number of
members, the size of each member, and a comparator function. The
comparator has to operate on void *
pointers because C doesn’t have
templates or generics or anything like that. That’s two interfaces
where type safety is discarded: the arguments passed to qsort()
and
again when it calls the comparator function.
One of the significant flaws of this interface is the lack of context
for the comparator. C doesn’t have closures, which in other languages
would cover this situation. If the sort function depends on some
additional data, such as in Graham scan where points are
sorted relative to a selected point, the extra information needs to be
smuggled in through a global variable. This is not reentrant and
wouldn’t be safe in a multi-threaded environment. There’s a GNU
extension here, qsort_r()
, that takes an additional context
argument, allowing for reentrant comparators.
Quicksort has some really nice properties. It’s in-place, so no temporary memory needs to be allocated. If implemented properly it only consumes O(log n) space, which is the stack growth during recursion. Memory usage is localized, so it plays well with caching.
That being said, qsort()
is also a classic example of an API naming
mistake. Few implementations actually use straight
quicksort. For example, glibc’s qsort()
is merge sort (in
practice), and the other major libc implementations use a hybrid
approach. Programs using their language’s sort function
shouldn’t be concerned with how it’s implemented. All the matters is
the interface and whether or not it’s a stable sort. OpenBSD made the
exact same mistake when they introduced arc4random()
,
which no longer uses RC4.
Since quicksort is an unstable sort — there are multiple possible
results when the array contains equivalent elements — this means
qsort()
is not guaranteed to be stable, even if internally the C
library is using a stable sort like merge sort. The C standard
library has no stable sort function.
The unfortunate side effect of unstable sorts is that they hurt
composability. For example, let’s say we have a person
struct like
this,
struct person {
const char *first, *last;
int age;
};
Here are a couple of comparators to sort either by name or by age. As
a side note, strcmp()
automatically works correctly with UTF-8 so
this program isn’t limited to old-fashioned ASCII names.
#include <string.h>
int compare_name(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int last = strcmp(pa->last, pb->last);
return last != 0 ? last : strcmp(pa->first, pb->first);
}
int compare_age(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
return pa->age - pb->age;
}
And since we’ll need it later, here’s a COUNT_OF
macro to get the
length of arrays at compile time. There’s a less error prone
version out there, but I’m keeping it simple.
#define COUNT_OF(x) (sizeof(x) / sizeof(0[x]))
Say we want to sort by name, then age. When using a stable sort, this is accomplished by sorting on each field separately in reverse order of preference: a composition of individual comparators. Here’s an attempt at using quicksort to sort an array of people by age then name.
struct person people[] = {
{"Joe", "Shmoe", 24},
{"John", "Doe", 30},
{"Alan", "Smithee", 42},
{"Jane", "Doe", 30}
};
qsort(people, COUNT_OF(people), sizeof(struct person), compare_name);
qsort(people, COUNT_OF(people), sizeof(struct person), compare_age);
But this doesn’t always work. Jane should come before John, but the original sort was completely lost in the second sort.
Joe Shmoe, 24
John Doe, 30
Jane Doe, 30
Alan Smithee, 42
This could be fixed by defining a new comparator that operates on both
fields at once, compare_age_name()
, and performing a single sort.
But what if later you want to sort by name then age? Now you need
compare_name_age()
. If a third field was added, there would need to
be 6 (3!) different comparator functions to cover all the
possibilities. If you had 6 fields, you’d need 720 comparators!
Composability has been lost to a combinatorial nightmare.
The GNU libc documentation claims that qsort()
can be made
stable by using pointer comparison as a fallback. That is, when
the relevant fields are equivalent, use their array position to
resolve the difference.
If you want the effect of a stable sort, you can get this result by writing the comparison function so that, lacking other reason distinguish between two elements, it compares them by their addresses.
This is not only false, it’s dangerous! Because elements may be sorted in-place, even in glibc, their position will change during the sort. The comparator will be using their current positions, not the starting positions. What makes it dangerous is that the comparator will return different orderings throughout the sort as elements are moved around the array. This could result in an infinite loop, or worse.
The most direct way to work around the unstable sort is to eliminate
any equivalencies. Equivalent elements can be distinguished by adding
an intrusive order
field which is set after each sort. The
comparators will fall back on this sort to maintain the original
ordering.
struct person {
const char *first, *last;
int age;
size_t order;
};
And the new comparators.
int compare_name_stable(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int last = strcmp(pa->last, pb->last);
if (last != 0)
return last;
int first = strcmp(pa->first, pb->first);
if (first != 0)
return first;
return pa->order - pb->order;
}
int compare_age_stable(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int age = pa->age - pb->age;
return age != 0 ? age : pa->order - pb->order;
}
The first sort doesn’t need to be stable, but there’s not much reason to keep around two definitions.
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_name_stable);
for (size_t i = 0; i < COUNT_OF(people); i++)
people[i].order = i;
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_age_stable);
And the result:
Joe Shmoe, 24
Jane Doe, 30
John Doe, 30
Alan Smithee, 42
Without defining any new comparators I can sort by name then age just
by swapping the calls to qsort()
. At the cost of an extra
bookkeeping field, the number of comparator functions needed as fields
are added is O(n) and not O(n!) despite using an unstable sort.
The solution to this problem is to run the password through a one-way cryptographic hash function before storing it in the database. When the database is compromised, it’s more difficult to work backwards to recover the passwords. Examples of one-way hash functions are MD5, SHA-1, and the SHA-2 family. Block ciphers can also be converted into hash functions (e.g. bcrypt from Blowfish), though it must be done carefully since block ciphers were generally designed with different goals in mind.
md5("foobar");
// => "3858f62230ac3c915f300c664312c63f"
sha1("foobar");
// => "8843d7f92416211de9ebb963ff4ce28125932878"
However, these particular functions (SHA-1 and SHA-2) alone are poor password hashing functions: they’re much too fast! An offline attacker can mount a rapid brute-force attack on these kinds of hashes. They also don’t include a salt, a unique, non-secret, per-hash value used as additional hash function input. Without this, an attacker could prepare the entire attack ahead of time — a rainbow table. Once the hashes are obtained, reversing them is just a matter of looking them up in the table.
Good password hashing needs to be slow and it needs support for salt. Examples of algorithms well-suited for this purpose are PBKDF2, bcrypt, and scrypt. These are the functions you’d want to use in a real application today. Each of these are also more generally key derivation functions. They can strengthen a relatively short human-memorable passphrase by running it through a long, slow procedure before making use of it as an encryption key. An brute-force attacker would need to perform this slow procedure for each individual guess.
Alternatively, if you’re stuck using a fast hash function anyway, it could be slowed down by applying the function thousands or even millions of times recursively to its own output. This is what I did in order to strengthen my GPG passphrase. However, you’re still left with the problem of applying the salt. The naive approach would be a plain string concatenation with the password, but this likely to be vulnerable to a length extension attack. The proper approach would be to use HMAC.
For my solution to the challenge, I wasn’t looking for something strong enough to do key derivation. I just need a slow hash function that properly handles a salt. Another important goal was to keep the solution small enough to post as a reddit comment, and I wanted to do it without using any external crypto libraries. If I’m using a library, I might as well just include/import/require PBKDF2 and make it a 2-liner, but that would be boring. I wanted it to be a reasonably short C program with no external dependencies other than standard libraries. Not counting the porcelain, the final result weighs in at 115 lines of C, so I think I achieved all my goals.
So what’s the smallest, modern cryptographic algorithm I’m aware of? That would be RC4, my favorite random generator! Unlike virtually every other cryptographic algorithm, it’s easy to commit to memory and to implement from scratch without any reference documentation. Similarly, this password hashing function can be implemented entirely from memory (if you can imagine yourself in some outlandish sitation where that would be needed).
Unfortunately, RC4 has had a lot of holes punched in it over the years. The initial output has been proven to be biased, leaking key material, and there’s even good reason to believe it may already be broken by nation state actors. Despite this, RC4 remains the most widely used stream cipher today due to its inclusion in TLS. Most importantly here, almost none of RC4’s weaknesses apply to this situation — we’re only using a few bytes of output — so it’s still a very strong algorithm. Besides, what I’m developing is a proof of concept, not something to be used in a real application. It would be interesting to see how long it takes for someone to break this (maybe even decades).
Before I dive into the details, I’ll link to the source repository. As of this writing there are C and Elisp implementations of the algorithm, and they will properly validate each other’s hashes. I call it RC4HASH.
Here are some example hashes for the password “foobar”. It’s different
each time because each has a unique salt. Notice the repeated byte
12
in the 5th byte position of the hash.
$ ./rc4hash -p foobar
c56cdbe512c922a2f9682cc0dfa21259e4924304e9e9b486c49d
$ ./rc4hash -p foobar
a1ea954b1296052a7cd766eb989bfd52915ab267733503ef3e8d
$ ./rc4hash -p foobar
5603de351288547e12b89585171f40cf480001b21dcfbd25f3f4
Each also validates as correct.
$ ./rc4hash -p foobar -v c56cdbe5...b486c49d
valid
$ ./rc4hash -p foobar -v a1ea954b...03ef3e8d
valid
$ ./rc4hash -p foobar -v 5603de35...bd25f3f4
valid
RC4 is a stream cipher, which really just means it’s a fancy random number generator. How can we turn this into a hash function? The content to be hashed can be fed to the key schedule algorithm in 256-byte chunks, as if it were a key. The key schedule is a cipher initialization stage that shuffles up the cipher state without generating output. To put all this in terms of C, here’s what the RC4 struct and initialization looks like: a 256-element byte array initialized with 0-255, and two array indexes.
struct rc4 {
uint8_t S[256];
uint8_t i, j;
};
void rc4_init(struct rc4 *k) {
k->i = k->j = 0;
for (int i = 0; i < 256; i++) {
k->S[i] = i;
}
}
The key schedule shuffles this state according to a given key.
#define SWAP(a, b) if (a ^ b) {a ^= b; b ^= a; a ^= b;}
void rc4_schedule(struct rc4 *k, const uint8_t *key, size_t length) {
int j = 0;
for (int i = 0; i < 256; i++) {
j = (j + k->S[i] + key[i % length]) % 256;
SWAP(k->S[i], k->S[j]);
}
}
Notice it doesn’t touch the array indexes. It can be called over and over with different key material to keep shuffling the state. This is how I’m going to mix the salt into the password. The key schedule will first be run on the salt, then again on the password.
To produce the hash output, emit the desired number of bytes from the cipher. Ta da! It’s now an RC4-based salted hash function.
void rc4_emit(struct rc4 *k, uint8_t *buffer, size_t count) {
for (size_t b = 0; b < count; b++) {
k->j += k->S[++k->i];
SWAP(k->S[k->i], k->S[k->j]);
buffer[b] = k->S[(k->S[k->i] + k->S[k->j]) & 0xFF];
}
}
/* Throwaway 64-bit hash example. Assumes strlen(passwd) <= 256. */
uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
struct rc4 rc4;
rc4_init(&rc4);
rc4_schedule(&rc4, salt, salt_len);
rc4_schedule(&rc4, passwd, strlen(passwd));
uint64_t hash;
rc4_emit(&rc4, &hash, sizeof(hash));
return hash;
}
Both password and salt are the inputs to hash function. In order to validate a password against a hash, we need to keep track of the salt. The easiest way to do this is to concatenate it to the hash itself, making it part of the hash. Remember, it’s not a secret value, so this is safe. For my solution, I chose to use a 32-bit salt and prepend it to 20 bytes of generator output, just like an initialization vector (IV). To validate a user, all we need is a hash and a password provided by the user attempting to authenticate.
Right now there’s a serious flaw. If you want to find it for yourself, stop reading here. It will need to get fixed before this hash function is any good.
It’s trivial to find a collision, with is the death knell for any cryptographic hash function. Certain kinds of passwords will collapse down to the simplest case.
hash("x", "salt", 4);
// => 8622913094354299445
hash("xx", "salt", 4);
// => 8622913094354299445
hash("xxx", "salt", 4);
// => 8622913094354299445
hash("xxxx", "salt", 4);
// => 8622913094354299445
hash("abc", "salt", 4);
// => 8860606953758435703
hash("abcabc", "salt", 4);
// => 8860606953758435703
Notice a pattern? Take a look at the RC4 key schedule function. Using modular arithmetic, the password wraps around repeating itself over 256 bytes. This means passwords with repeating patterns will mutate the cipher identically regardless of the number of repeats, so they result in the same hash. A password “abcabcabc” will be accepted as “abc”.
The fix is to avoid wrapping the password. Instead, the RC4 generator, seeded only by the salt, is used to pad the password out to 256 bytes without repeating.
uint64_t hash(const char *passwd, const char *salt, size_t salt_len) {
struct rc4 rc4;
rc4_init(&rc4);
rc4_schedule(&rc4, salt, salt_len);
uint8_t padded[256];
memcpy(padded, passwd, strlen(passwd));
rc4_emit(&rc4, padded + strlen(passwd), 256 - strlen(passwd));
rc4_schedule(&rc4, padded, sizeof(padded));
uint64_t hash;
rc4_emit(&rc4, &hash, sizeof(hash));
return hash;
}
This should also help mix the RC4 state up a bit more before generating the output. I’m no cryptanalyst, though, so I don’t know if it’s worth much.
The next problem is that this is way too fast! It shuffles bytes around for a few microseconds and it’s done. So far it also doesn’t address the problems of biases in RC4’s initial output. We’ll kill two birds with one stone for this one.
To fix this we’ll add an adaptive difficulty factor. It will be a value that determines how much work will be done to compute the hash. It’s adaptive because the system administrator can adjust it at any time without affecting previous hashes. To accomplish this, like the salt, the difficulty factor will be appended to the hash output. All the required information will come packaged together in the hash.
The difficulty factor comes into play in two areas. First, it determines how many times the key schedule is run. This is the same modification CipherSaber-2 uses in order to strengthen RC4’s weak key schedule. However, rather than run it on the order of 20 times, our hash function will be running it hundreds of thousands of times. Second, the difficulty will also determine how many initial bytes of output are discarded before we start generating the hash.
I decided on an unsigned 8-bit value for the difficulty. The number of
key schedules will be 1 shifted left by this number of bits (i.e.
pow(2, difficulty)
). This makes the minimum number of key schedules
1, since any less doesn’t make sense. The number of bytes skipped is
the same bitshift, minus 1, times 64 ((pow(2, difficulty) - 1) * 64
),
the muliplication is so that it can skip large swaths output.
Therefore the implementation so far has a difficulty of zero: one key
schedule round and zero bytes of output skipped.
The dynamic range of the difficulty factor (0-255) puts the the time needed on a modern computer to compute an RC4HASH between a few microseconds (0) to the billions of years (255). That should be a more than sufficient amount of future proofing, especially considering that we’re using RC4, which will likely be broken before the difficulty factor ever tops out.
I won’t show the code to do this since that’s how it’s implemented in the final version, so go look at the repository instead. The final hash is 26 bytes long: a 208-bit hash. The first 4 bytes are the salt (grabbed from /dev/urandom in my implementations), the byte is the difficulty factor, and the final 21 bytes are RC4 output.
In the example hashes above, the 12
constant byte is the difficulty
factor. The default difficulty factor is 18 (0x12
). I’ve considered
XORing this with some salt-seeded RC4 output just to make the hash
look nice, but that just seems like arbitrary complexity for no real
gains. With the default difficulty, it takes almost a second for my
computers to compute the hash.
I believe RC4HASH should be quite resistant to GPGPU attacks. RC4 is software oriented, involving many random array reads and writes rather than SIMD-style operations. GPUs are really poor at this sort of thing, so they should take a significant performance hit when running RC4HASH.
For those interested in breaking RC4HASH, here are a couple of hashes
of English language passphrases. Each is about one short sentence in
length ([a-zA-Z !.,]+
). I’m not keeping track of the sentences I used,
so the only way to get them will be to break the hash, even for me.
f0622dde127f9ab5aaee710aa4bfb17a224f7e6e93745f7ae948
8ee9cdec12feabed5c2fde0a51a2381b522f5d2bd483717d4a96
If you can find a string that validates with these hashes, especially if it’s not the original passphrase, you win! I don’t have any prizes in mind right now, but perhaps some Bitcoin would be in order if your attack is interesting.
]]>How many times should a random number from [0, 1]
be drawn to have
it sum over 1?
If you want to figure it out for yourself, stop reading now and come back when you’re done.
The answer is e. When I came across this question I took
the lazy programmer route and, rather than work out the math, I
estimated the answer using the Monte Carlo method. I used the language
I always use for these scratchpad computations: Emacs Lisp. All I need
to do is switch to the *scratch*
buffer and start hacking. No
external program needed.
The downside is that Elisp is incredibly slow. Fortunately, Elisp is so similar to Common Lisp that porting to it is almost trivial. My preferred Common Lisp implementation, SBCL, is very, very fast so it’s a huge speed upgrade with little cost, should I need it. As far as I know, SBCL is the fastest Common Lisp implementation.
Even though Elisp was fast enough to determine that the answer is probably e, I wanted to play around with it. This little test program doubles as a way to estimate the value of e, similar to estimating pi. The more trial runs I give it the more accurate my answer will get — to a point.
Here’s the Common Lisp version. (I love the loop macro, obviously.)
(defun trial ()
(loop for count upfrom 1
sum (random 1.0) into total
until (> total 1)
finally (return count)))
(defun monte-carlo (n)
(loop repeat n
sum (trial) into total
finally (return (/ total 1.0 n))))
Using SBCL 1.0.57.0.debian on an Intel Core i7-2600 CPU, once everything’s warmed up this takes about 9.4 seconds with 100 million trials.
(time (monte-carlo 100000000))
Evaluation took:
9.423 seconds of real time
9.388587 seconds of total run time (9.380586 user, 0.008001 system)
99.64% CPU
31,965,834,356 processor cycles
99,008 bytes consed
2.7185063
Since this makes for an interesting benchmark I gave it a whirl in JavaScript,
function trial() {
var count = 0, sum = 0;
while (sum <= 1) {
sum += Math.random();
count++;
}
return count;
}
function monteCarlo(n) {
var total = 0;
for (var i = 0; i < n; i++) {
total += trial();
}
return total / n;
}
I ran this on Chromium 24.0.1312.68 Debian 7.0 (180326) which uses V8, currently the fastest JavaScript engine. With 100 million trials, this only took about 2.7 seconds!
monteCarlo(100000000); // ~2.7 seconds, according to Skewer
// => 2.71850356
Whoa! It beat SBCL! I was shocked. Let’s try using C as a baseline. Surely C will be the fastest.
#include <stdio.h>
#include <stdlib.h>
int trial() {
int count = 0;
double sum = 0;
while (sum <= 1.0) {
sum += rand() / (double) RAND_MAX;
count++;
}
return count;
}
double monteCarlo(int n) {
int i, total = 0;
for (i = 0; i < n; i++) {
total += trial();
}
return total / (double) n;
}
int main() {
printf("%f\n", monteCarlo(100000000));
return 0;
}
I used the highest optimization setting on the compiler.
$ gcc -ansi -W -Wall -Wextra -O3 temp.c
$ time ./a.out
2.718359
real 0m3.782s
user 0m3.760s
sys 0m0.000s
Incredible! JavaScript was faster than C! That was completely unexpected.
Both the Common Lisp and C code could probably be carefully tweaked to improve performance. In Common Lisp’s case I could attach type information and turn down safety. For C I could use more compiler flags to squeeze out a bit more performance. Then maybe they could beat JavaScript.
In contrast, as far as I can tell the JavaScript code is already as optimized as it can get. There just aren’t many knobs to tweak. Note that minifying the code will make no difference, especially since I’m not measuring the parsing time. Except for the functions themselves, the variables are all local, so they are never “looked up” at run-time. Their name length doesn’t matter. Remember, in JavaScript global variables are expensive, because they’re (generally) hash table lookups on the global object at run-time. For any decent compiler, local variables are basically precomputed memory offsets — very fast.
The function names themselves are global variables, but the V8 compiler appears to eliminate this cost (inlining?). Wrapping the entire thing in another function, turning the two original functions into local variables, makes no difference in performance.
While Common Lisp and C may be able to beat JavaScript if time is invested in optimizing them — something to be done rarely — in a casual implementation of this algorithm, JavaScript beats them both. I find this really exciting.
]]>Really, it’s really not worth downloading but I’m putting a link here for my own archival purposes.
I didn’t quite understand what I was doing so I screwed up the math. All the vector computations were done independently. Integration was done by Euler method — a sin I continue to commit regularly to this day but now I’m at least aware of the limitations. Despite this, it was still accurate enough to look interesting.
Probably the most advanced thing to come out of it, and something I did do correctly, was the display. I worked out my own graphics engine to project three-dimensional star coordinates onto the two-dimensional drawing surface, re-inventing perspective projection.
As I said, I recently came across it again while digging around my digital archives. Now that I’m a professional developer I wondered how much faster I could do the same thing with just a few hours of coding. I did it in C and my implementation was about an order of magnitude faster. Not as much as I hoped, but it’s something!
It’s still Euler method integration, the bodies are still point masses, and there are no collisions so there’s numerical instability when they get close. However, I did get the vector math right! My goal was to make something that looked interesting rather than an accurate simulation, so all of this is alright.
I only wrote the simulation, not a display. To display the output I just had GNU Octave plot it for me, which I turned into videos. This first video is a static view of the origin of the coordinate system. If you watch (or skip) all the way to the end you’ll see that the galaxy drifts out of view. This is due to a bias in the random number generator — the galaxy’s mass was lopsided.
After seeing this drift I added dynamic pan and zoom, so that the camera follows the action. It’s a bit excessive at the beginning (the camera is too dynamic) and the end (the camera is too far out).
I bit more tweaking of the galaxy start state (normal distribution, adding initial velocities) and the camera and I got this interesting result. The galaxy initially bunches into two globs, which then merge.
I wouldn’t have bothered with a post about this but I think these videos turned out to be interesting.
]]>I’ve been programming in C for seven years now but it seems there’s
always something new for me to learn about it. The book cleared up
some incomplete concepts I had about C, particularly the relationship
between pointers and arrays as well as operator precedence — the
reason why function pointers look so weird. By the end I re-gained an
appreciation for the simplicity and power of C. All of the examples in
the book are written without heap allocation (no malloc()
), just
static memory, and it manages to get by with rather few limitations.
As I was reading I realized a handful of “tricky” questions that I wouldn’t have been able to answer with confidence before reading the book. If you’re a C developer, pause and reflect just after each chunk of example code and try to answer the question as correctly as you can. Pretend you’re a compiler and think about what you need to do in each situation.
What is the output of this program?
#include <stdio.h>
int main()
{
register int foo;
printf("%p\n", &foo);
return 0;
}
The register
keyword hints to the compiler that the automatic
variable should be stored in a register rather than memory, making
access to the variable faster. This is only a hint so the compiler is
free to ignore it.
In the example we take a pointer to the variable. However, we declared this variable to be stored in a register. Addresses only point to locations in memory so registers can’t be addressed by a pointer. While the compiler can ignore the optimization hint and provide an address, this is ultimately an inconsistent request. The compiler will produce and error and the code will not compile.
Is this program valid?
struct {
int foo, bar;
} baz;
int *example()
{
return &baz.foo;
}
Here we’re creating a struct called (Update: I
misunderstood. This is allowed.) Overall, structs are really limited in
K&R C: they can’t be function arguments, nor can they be returned from
functions, baz
and take a pointer to one of
its fields. According to K&R C, this is invalid.nor can pointers be taken to their fields. Only
pointers to structs are first-class. They acknowledged that this was
limiting and said they planned on fixing it in the future.
Fortunately, this was fixed with ANSI C and structs are first-class objects. This means the above program is valid in ANSI C.
How about this one?
struct {
int foo : 4;
} baz;
int *example()
{
return &baz.foo;
}
The foo
field is a 4-bit wide bit-field — smaller than a single
byte. Pointers can only address whole bytes, so this is
invalid. Even if foo
was 8 or 32 bits wide (full/aligned bytes
on modern architectures) this would still be invalid.
We want to average two pointers to get a pointer in-between them. Is this reasonable code?
char *foo()
{
char *start = "hello";
char *end = start + 5;
return (start + end) / 2;
}
A thoughtful programmer should notice that adding together pointers is likely to be disastrous. Pointers tend to be very large, addressing high areas of memory. Adding two pointers together is very likely to lead to an overflow. When I posed this question to Brian, he realized this and came up with this solution to avoid the overflow.
return start / 2 + end / 2;
However, this is still invalid. As a complete precaution for overflowing pointer arithmetic, pointer addition is forbidden and neither of these will compile. Pointer subtraction is perfectly valid, so it can be done like so.
return (end - start) / 2 + start;
Subtracting two pointers produces an integer. Adding integers to pointers is not only valid but also essential, so this is only a restriction about adding pointers together.
Is this valid?
void foo()
{
char hello[] = "hello";
char *foo = hello;
}
hello
is an array of char
s and foo
is a pointer to a char
. In
general, arrays are interchangeable with pointers of the same type so
this is valid. Now how about this one?
void foo()
{
char hello[6];
char *foo = "hello";
hello = foo;
}
Here we’ve inverted the relationship are are trying to assign the
array as a pointer. This is invalid. Arrays are like pointer
constants in that they can’t be used as lvalues — they can’t be
reassigned to point to somewhere else. The closest you can get is to
copy the contents of foo
into hello
.
I think that about sums my questions. I (foolishly) didn’t write them down as I came up with them and this is everything I can remember.
]]>git clone git://github.com/skeeto/perlin-noise.git
In short, Perlin noise is based on a grid of randomly-generated gradient vectors which describe how the arbitrarily-dimensional “surface” is sloped at that point. The noise at the grid points is always 0, though you’d never know it. When sampling the noise at some point between grid points, a weighted interpolation of the surrounding gradient vectors is calculated. Vectors are reduced to a single noise value by dot product.
Rather than waste time trying to explain it myself, I’ll link to an existing, great tutorial: The Perlin noise math FAQ. There’s also the original presentation by Ken Perlin, Making Noise, which is more concise but harder to grok.
When making my own implementation, I started by with Octave. It’s my “go to language” for creating a prototype when I’m doing something with vectors or matrices since it has the most concise syntax for these things. I wrote a two-dimensional generator and it turned out to be a lot simpler than I thought it would be!
Because it’s 2D, there are four surrounding grid points to consider
and these are all hard-coded. This leads to an interesting property:
there are no loops. The code is entirely vectorized, which makes it
quite fast. It actually keeps up with my generalized Java solution
(next) when given a grid of points, such as from meshgrid()
.
The grid gradient vectors are generated on the fly by a hash function. The integer x and y positions of the point are hashed using a bastardized version of Robert Jenkins’ 96 bit mix function (the one I used in my infinite parallax starfield) to produce a vector. This turned out to be the trickiest part to write, because any weaknesses in the hash function become very apparent in the resulting noise.
Using Octave, this took two seconds to generate on my laptop. You can’t really tell by looking at it, but, as with all Perlin noise, there is actually a grid pattern.
I then wrote a generalized version, perlin.m
, that can generate
arbitrarily-dimensional noise. This one is a lot shorter, but it’s not
vectorized, can only sample one point at a time, and is incredibly
slow. For a hash function, I use Octave’s hashmd5()
, so this one
won’t work in Matlab (which provides no hash function
whatsoever). However, it is a lot shorter!
%% Returns the Perlin noise value for an arbitrary point.
function v = perlin(p)
v = 0;
%% Iterate over each corner
for dirs = [dec2bin(0:(2 ^ length(p) - 1)) - 48]'
q = floor(p) + dirs'; % This iteration's corner
g = qgradient(q); % This corner's gradient
m = dot(g, p - q);
t = 1.0 - abs(p - q);
v += m * prod(3 * t .^ 2 - 2 * t .^ 3);
end
end
%% Return the gradient at the given grid point.
function v = qgradient(q)
v = zeros(size(q));
for i = 1:length(q);
v(i) = hashmd5([i q]) * 2.0 - 1.0;
end
end
It took Octave an entire day to generate this “fire” video, which is ridiculously long. An old graphics card could probably do this in real time.
This was produced by viewing a slice of 3D noise. For animation, the viewing area moves in two dimensions (z and y). One dimension makes the fire flicker, the other makes it look like it’s rising. A simple gradient was applied to the resulting noise to fade away towards the top.
I wanted to achieve this same effect faster, so next I made a generalized Java implementation, which is the bulk of the repository. I wrote my own Vector class (completely unlike Java’s depreciated Vector but more like Apache Commons Math’s RealVector), so it looks very similar to the Octave version. It’s much, much faster than the generalized Octave version. It doesn’t use a hash function for gradients — instead randomly generating them as needed and keeping track of them for later with a Map.
I wanted to go faster yet, so next I looked at OpenCL for the first time. OpenCL is an API that allows you to run C-like programs on your graphics processing unit (GPU), among other things. I was sticking to Java so I used lwjgl’s OpenCL bindings. In order to use this code you’ll need an OpenCL implementation available on your system, which, unfortunately, is usually proprietary. My OpenCL noise generator only generates 3D noise.
Why use the GPU? GPUs have a highly-parallel structure that makes them faster than CPUs at processing large blocks of data in parallel. This is really important when it comes to computer graphics, but it can be useful for other purposes as well, like generating Perlin noise.
I had to change my API a little to make this effective. Before, to generate noise samples, I passed points in individually to PerlinNoise. To properly parallelize this for OpenCL, an entire slice is specified by setting its width, height, step size, and z-level. This information, along with pre-computed grid gradients, is sent to the GPU.
This is all in the opencl
branch in the repository. When run, it
will produce a series of slices of 3D noise in a manner similar to the
fire example above. For comparison, it will use the CPU by default,
generating a series of simple-*.png
. Give the program one argument,
“opencl”, and it will use OpenCL instead, generating a series of
opencl-*.png
. You should notice a massive increase in speed when
using OpenCL. In fact, it’s even faster than this. The vast majority
of the time is spent creating these output PNG images. When I disabled
image output for both, OpenCL was 200 times faster than the
(single-core) CPU implementation, still spending a significant amount
of time just loading data off the GPU.
And finally, I turned the OpenCL output into a video,
That’s pretty cool!
I still don’t really have a use for Perlin noise, especially not under constraints that require I use OpenCL to generate it. The big thing I got out of this project was my first experience with OpenCL, something that really is useful at work.
]]>The solution they were aiming for was to create a pair of virtual serial ports. The filter software would read data in on the real serial port, output the filtered data into a virtual serial port which would be virtually connected to a second virtual serial port. The analysis software would then read from this second serial port. They couldn’t figure out how to set this up, short of buying a couple of USB/serial port adapters and plugging them into each other.
It turns out this is very easy to do on Unix-like systems. POSIX
defines two functions, posix_openpt(3)
and ptsname(3)
. The first
one creates a pseudo-terminal — a virtual serial port — and returns
a “master” file descriptor used to talk to it. The second provides
the name of the pseudo-terminal device on the filesystem, usually
named something like /dev/pts/5
.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
int main()
{
int fd = posix_openpt(O_RDWR | O_NOCTTY);
printf("%s\n", ptsname(fd));
/* ... read and write to fd ... */
return 0;
}
The printed device name can be opened by software that’s expecting to access a serial port, such as minicom, and it can be communicated with as if by a pipe. This could be useful in testing a program’s serial port communication logic virtually.
The reason for the unusually long name is because the function wasn’t
added to POSIX until 1998 (Unix98). They were probably afraid of name
collisions with software already using openpt()
as a function
name. The GNU C Library provides an extension getpt(3)
, which is
just shorthand for the above.
int fd = getpt();
Pseudo-terminal functionality was available much earlier, of
course. It could be done through the poorly designed openpty(3)
,
added in BSD Unix.
int openpty(int *amaster, int *aslave, char *name,
const struct termios *termp,
const struct winsize *winp);
It accepts NULL
for the last three arguments, allowing the user to
ignore them. What makes it so bad is that string name
. The user
would pass it a chunk of allocated space and hope it was long enough
for the file name. If not, openpty()
would overwrite the end of the
string and trash some memory. It’s highly unlikely to ever exceed
something like 32 bytes, but it’s still a correctness problem.
The newer ptsname()
is only slightly better however. It returns a
string that doesn’t need to be free()
d, because it’s static
memory. However, that means the function is not re-entrant; it has
issues in multi-threaded programs, since that string could be trashed
at any instant by another call to ptsname()
. Consider this case,
int fd0 = getpt();
int fd1 = getpt();
printf("%s %s\n", ptsname(fd0), ptsname(fd1));
ptsname()
will be returning the same char *
pointer each time it’s
called, merely filling the pointed-to space before returning. Rather
than printing two different device filenames, the above would print
the same filename twice. The GNU C Library provides an extension to
correct this flaw, as ptsname_r()
, where the user provides the
memory as before but also indicates its maximum size.
To make a one-way virtual connection between our pseudo-terminals, create two of them and do the typical buffer thing between the file descriptors (for succinctness, no checking for errors),
while (1) {
char buffer;
int in = read(pt0, &buffer, 1);
write(pt1, &buffer, in);
}
Making a two-way connection would require the use of threads or
select(2)
, but it wouldn’t be much more complicated.
While all this was new and interesting to me, it didn’t help my dad at all because they’re using Windows. These functions don’t exist there and creating virtual serial ports is a highly non-trivial, less-interesting process. Buying the two adapters and connecting them together is my recommended solution for Windows.
]]>I've learned how to use the curses/ncurses library recently, and I decided to experiment with threading while using ncurses. This posed another learning opportunity: a chance to learn more about GNU Pth, a non-preemptive, userspace threading library. It's a really cool trick to get threading into any C program on any platform. I've used POSIX threads, Pthreads, before, so Pth is totally new to me.
The idea was this: a ticking timestamp string that I can move around the screen with the arrow keys. One thread will keep the clock up to date, and the other listens for user input and changes the clock coordinates.
The ncurses function for getting a key from the user
is getch()
. By default, it blocks waiting for the user to
press a key, which is returned. Unfortunately, this doesn't interact
with Pth well at all.
Pth threads are userspace threads rather than system threads. This means the operating system kernel is completely unaware of them, so they are managed by a userspace scheduler and the Pth threads all run inside a single system thread. Because of this, Pth threads can never take advantage of hardware parallelism. This disadvantage comes with the advantage of portability (hence the name "portable threads"). It can be used on systems that provide no threading support.
Pth threads are also non-preemptive. This means the thread currently in control must eventually choose to yield control to other threads, cooperating with them. Preemptive threads take control from each other, so they never have to yield. Fortunately, the Pth library sneaks in implicit yielding, so you usually don't have to worry about this when using Pth. You can generally treat the Pth threads as if they were preemptive.
As I was saying, Pth wasn't behaving well with ncurses. When I
called getch()
my entire program, all threads, were
getting blocked too, which defeats the whole purpose of threading. I
switched to Pthreads to see if it was a mistake on my part, or an
issue with Pth. Pthreads was working just fine, so I had to figure out
what I was doing wrong with Pth.
I did manage to get Pth to behave the same was as the Pthreads, but it
took two significant extra changes. Here's a code listing. There's
also a mutex synchronized draw_clock()
function not shown
here. I've written it so that I could easily switch back and forth
between the two threading libraries.
Notice the two (bolded) differences in the code between Pth and
Pthreads, in the USE_PTH
sections. The Pth implementation
uses the halfdelay()
function, which
tells getch()
to be less blocking. The argument tells it
to return with an error if nothing happens within one tenth of a
second. This means our main polling loop will execute about 10 times a
second when nothing is happening.
The second change is an explicit yield, because Pth doesn't place any implicit yields in the loop. Without the yield the same problem remains.
I'm not happy with either of these
changes. Making getch()
behave like non-blocking input is
very hackish. If I extend my program I'll always have to be careful
when I call getch()
, since it's (essentially)
non-blocking. I'll also have to make sure I always yield when polling
with getch()
. So how do I fix this? First, let's see why
we need that explicit yield. Pth is supposed to be hiding that from
me.
How does Pth insert implicit yielding? I've been aware of Pth for years, and I've always wondered about this, but never bothered to look. I dug into the Pth sources.
Pth inserts yields before some of the common blocking operations. It
has its own definitions for functions such as read()
,
write()
, fork()
,
and system()
. This is where the implicit yielding is
injected. It steals your calls to these functions, using its own
functions. Here's the relevant section of pth.h
, where
the "soft" version of this happens (the "hard" version
uses syscall()
, operating at a lower level),
Its own functions wrap the real deal, but suspend themselves on an
awaiting event and yield. Any time the scheduler runs, it polls for
these events, using select()
in the case of input/output,
and will wake these threads back up if the required event occurs.
The problem with getch()
is that Pth doesn't know about
it, so it doesn't get a chance to handle it properly. After taking a
look at the implementation for pth_read()
, I
fixed getch()
for my program,
It tells the Pth scheduler that the thread wants to be suspended until
there's something to read on stdin
(file descriptor
0
). This prevents the system thread that everyone is
counting on from blocking. With this redefinition I went back and
removed the two Pth additions, halfdelay()
and the yield,
and it now behaves exactly the same way as the Pthreads version. Fixed!
If you use any other libraries, you'll need to do this for any long-blocking function that Pth doesn't already catch.
If you want to see this in action, here's the full
source: clock.c
. You can
choose the threading library to use,
gcc -lncurses -lpth -DUSE_PTH clock.c -o clock_pth gcc -lncurses -pthread clock.c -o clock_pthread
Update: I've just discovered that Debian and Debian-based
systems
have implicit yeilding disabled by default. You may have needed to
add this before the pth.h
include.
This is now in the linked source.
]]>Here's something I only learned recently, since it came up when working on Wisp. In C function pointers are incompatible with normal pointers. For example, this is unportable C,
This is because function pointers are a different size than other
pointers on some architectures, or even within the same architecture
with different models (x86's compact and medium models). If the
compiler in such a scenario allowed this, the pointer may be truncated
and would likely point to the wrong place. It wasn't until I added
the -pedantic
flag to gcc that it started warning me
about situations like the above. The -W -Wall
flags are
silent here.
The relevant part of the ANSI C standard lists the following as a common, but unportable, extension to the language,
A pointer to an object or to void may be cast to a pointer to a function, allowing data to be invoked as a function. A pointer to a function may be cast to a pointer to an object or to void, allowing a function to be inspected or modified (for example, by a debugger).
There is a discussion, including an example, on Stack Overflow: Can the Size of Pointers Vary Depending on what's Pointed To?. It also links to this comp.lang.c FAQ question Question 4.13 suggesting the use of a union, which is exactly what I did in Wisp.
I bet this issue only comes up very rarely. How often do you have to store a function pointer in a void pointer? It subverts the type system and is generally a bad idea. I had to do it in Wisp as part of its value polymorphism, which is why it bit me. This is probably why gcc doesn't get very picky over it.
This also means function pointers have less support than normal
pointers. For example, printing pointers
with printf()
's %p
won't work, since it
expects a void
pointer, so there's no printing them. You
can't sort them with qsort()
. You can even treat the
function pointer as a blob of data to manipulate manually since
there's no safe way to make a regular pointer to it. Really, almost
any C library function that accepts pointers won't work with function
pointers.
So if you want a tricky, unfair, interview question this could be one!
]]>If you are a crypto-anarchist like me, you should definitely take a look at CipherSaber. It is an extremely simple encryption protocol that even beginner programmers can implement. The protocol can also easily be memorized and quickly implemented from memory on the fly. In the case that cryptography was completely outlawed, CipherSaber would be a useful tool in allowing its users to continue to communicate privately.
I think the name is just perfect and captures everything CipherSaber is about. Here is the description right from the CipherSaber page,
In George Lucas' Star Wars trilogy, Jedi Knights were expected to make their own light sabers. The message was clear: a warrior confronted by a powerful empire bent on totalitarian control must be self-reliant.
CipherSaber is based on the arcfour stream cipher, but goes beyond it by defining the use of an initialization vector (IV) and how it is stored with the ciphertext. There are actually two versions: CipherSaber-1 and CipherSaber-2. The second one exists because of vulnerabilities in the first. The difference between them is small.
You want to make sure you generate a long enough passphrase for your encryption key. A normal password isn't good enough because an adversary will be able to throw all his available processing power at your ciphertext. Using Diceware would be a good idea here.
Here is the protocol.
Generate a 10-byte random IV. This need not be done using a very strong random number generator. It is only important that the same IV is not used more than once. Concatenate a secret user selected key (i.e. passphrase) with the IV and use that concatenation as the key for an arcfour cipher. Encrypt the message using the cipher. Concatenate the IV and the arcfour ciphertext to create the CipherSaber ciphertext.
To decipher, remove the first ten bytes of the ciphertext and use it as an IV. Concatenate the secret passphrase with the IV, and use it as the key for an arcfour cipher. Decrypt the remaining ciphertext with the arcfour cipher.
Because of vulnerabilities in the arcfour cipher, CipherSaber-2 is an updated version that runs the arcfour key scheduler at least 20 times. The exact number of times is a secret that the sender and receiver must agree on. Notice that CipherSaber-1 is CipherSaber-2 with only 1 key schedule iteration.
Using a large number of iterations could be considered a form of key strengthening. An adversary who is making a brute force attack on the ciphertext has that much more work to do for each passphrase trial.
You should really implement your own, but here is one of my implementations, written in C. I put it in with the rest of my arcfour stuff. Get it with git,
git clone git://github.com/skeeto/arcfour.git
You can use it as a reference to make sure your first implementation is correct. You can use these two ciphertexts to test your implementation as well,
ciphersaber.png.cs
ciphersaber.png.cs2
This is the diagram image above (ciphersaber.png) encrypted with the key "nullprogram". The first one is CipherSaber-1 and the second is CipherSaber-2 with 20 key schedule iterations.
]]>I was reading through a website of "computer stupidities" today when I came across this,
This was quickly dismissed as being an obvious beginner mistake. I
don't think this can be dismissed so quickly without thinking it
through for a moment. Yes, in the example above we will never reach
the last condition where we return z
, but consider the
following,
The same quick dismissal might drop the last "faz" print statement as being an impossible condition. Can you think of a situation where the program would print "faz"?
Our final condition will be reached if a
or
b
is equal to NAN
, which is defined by the
IEEE floating-point standard. It is available in C99 from
math.h
. A NAN
in any of the comparisons
above will evaluate to false.
So don't be so quick to dismiss code like this.
]]>I was at my fiancee's parent's house over Fourth of July weekend. Her family likes to leave plenty of reading material right by the toilet, which is something fairly new to me. They take their time on the john quite seriously.
While I was in there I saw a large book of Sudoku puzzles. Since the toilet is a good spot to think (I like to call it my " thinking chair"), I thought out an algorithm for solving Sudokus. I then left the bathroom and implemented it in order to verify that it worked.
The method is trial-and-error, which it does recursively: fill in the next available spot with a valid number as defined by the rules (cannot have the same number in a column, row, or partition), and recurse. The function reports success (true) when a solution was found, or failure (false), which means we try the next available number. If no more valid numbers are available for testing at the current position, then the puzzle is not solvable (we made an error at a previous position), so we stop recursing and return failure.
More formally,
Note that the recursion depth does not exceed 81, as it only recurses once per blank square. The "game tree" is broad rather than deep. It doesn't have to duplicate the puzzle matrix in memory either because all operations can be done in place.
Here is the implementation in C I typed up just after I left the bathroom,
I assumed that it would be slow solving the puzzles, having to search a wide tree, but it turns out to be very fast. It solves normal human-solvable puzzles in a couple of milliseconds. Wikipedia has a near-worst case Sudoku that is designed to make algorithms like mine perform their worst.
On my laptop, my program solves this in 15 seconds, which means that it should take no more than 15 seconds to solve any given Sudoku puzzle. This provides me a nice upper limit.
There is a way to "defeat" this particular puzzle. For example, say an attacker was trying to perform a denial-of-service (DoS) attack on your Sudoku solver by giving it puzzles like this one (making your server spend lots of time solving only a few puzzles). However, these puzzles assume a certain guessing order. By simply randomizing the order of guessing, both in choosing positions and the order that numbers are guessed, the attacker will have a much harder time creating a difficult puzzle. The worst case could very well be the best case. This is very similar to how Perl randomizes its hash array hash functions.
Now suppose we kept our guess order random then "solved" an empty Sudoku puzzle. What we have is a solution to a randomly generated Sudoku. To turn it into a puzzle, we just back it off a bit. A Sudoku is only supposed to have a single unambiguous solution, so we can only back off until just before the point where two solutions becomes possible. If you imagine a solution tree, this would be backing up a branch until you hit a fork.
Normally, Sudokus are symmetric (in the matrix sense), but completely randomizing the position guessing order won't achieve this. To make this work, the randomizing process can be adjusted to only select random points on the upper triangle (including the diagonal). For each point it selects not on the diagonal, the mirror point is automatically selected next. This will preserve symmetry when generating puzzles.
One issue remains: there seems to be no way to control the difficulty of the puzzles it generates. Maybe a number of open spaces left behind is a good metric? This will require some further study (and another post!).
]]>A co-worker asked me a question today about C/C++ pointers,
If a pointer is declared inside a function with no explicit initialization, can I assume that the pointer is initialized to
NULL
?
We were down in the lab and, therefore, he had no Internet access to look it up himself, which is why he asked. When I code C, it is just a sort of mental habit to not use a non-static function variable without first initializing it, but is this accurate? I knew the answer was "no", but I wanted to be able to explain the "why".
Anyway, I quickly recalled some of my experimental C programs and thought carefully about the mechanics of what is going on behind-the-scenes, allowing me to confidently give him a "no" answer. I then threw this together in a few seconds to prove it,
When you compile it, make sure you don't use the optimization options
(-O
, -O2
, or -O3
for gcc
) because they change the inner-workings of the
program. It might do things like make those functions inline (so they
won't be on the stack as I am intending), or even toss
out a()
, as it appears to do nothing. The compiler sees
that, even though I "used" variables in a()
by casting
them to void
, nothing is really happening,
so a()
can be ignored. We can probably get around this
with a tacked
on
volatile
declaration, which you might see a lot of in a
micro-controller program. In a micro-controller, some memory addresses
are mapped to registers external to the software, so, from the
compiler's point of view, access to these locations may look like
nothing is really happening. Optimizing away variables that point to
these memory locations will lead to an incorrect binary, so your robot
or laser guided shark or whatever won't work.
Anyway, compiling with optimization will break my example! So don't do it here.
When compiling, you should get some warnings about using uninitialized variables, which is kind of the point of my example. Ignore it. That warning alone gives away the answer to the main question, really, but this example is a bit more fun!
Before you run it, study it and think about what the output should
look like. When a()
is called, its stack frame goes into
the call stack, which contains the two declared variables. These
variables are then assigned as part of the function
execution. When a()
returns, the frame is popped off the
stack. Then b()
is called, and, as the variable
declarations are exactly the same, it will fit right over top
of a()
's old stack frame, and its variables will line
up. x
and y
are not assigned any value, so
they pick up whatever junk was lying around, which happens to be the
values assigned in a()
.
When you run the program, this is the output,
0x12345ff, -63454.000000
The pointer is not initialized
to NULL
. If x
is passed back uninitialized
under the assumption that a NULL
is being passed, some
other poor function that handles the return value may dereference it,
resulting in possibly
some
nasal demons, but most likely an annoying segmentation
fault. Worse, this error may occur far, far away from where the actual
problem is, and even worse than that, only sometimes (depending on the
state of the call stack at just the right moment).
Note here that I am talking about non-static function variable declarations. Global variables and static function variables will not be on the stack. They are placed in a fixed location (in the data segment), and their values are implicitly initialized to 0 at compile time.
]]>The 3n + 1 conjecture, also known as the Collatz conjecture, is based around this recursive function,
The conjecture is this,
This process will eventually reach the number 1, regardless of which positive integer is chosen initially.
The way I am defining this may not be entirely accurate, as I took a shortcut to make it a bit simpler. I am not a mathematician (IANAM) — but sometimes I pretend to be one. For a really solid definition, click through to the Wikipedia article in the link above.
A sample run, starting at 7, would look like this: 7, 22, 11,
34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1
. The sequence
starting at 7 contains 17 numbers. So 7 has a cycle-length of
17. Currently, there is no known positive integer that does not
eventually lead to 1. If the conjecture is true, then none exists to
be found.
I first found out about the problem when I saw it on UVa Online Judge. UVa Online Judge is a system that has a couple thousand programming problems to do. Users can submit solution programs written in C, C++, Java, or Pascal. For normal submissions, the fastest program wins.
Anyway, the way UVa Online Judge runs this problem is by providing the
solution program pairs of integers on stdin
as text. The
integers define an inclusive range of integers over which the program
must return the length of the longest Collatz cycle-length for all the
integers inside that range. They don't tell you which ranges they are
checking, except that all integers will be less than 1,000,000 and the
sequences will never overflow a 32-bit integer (allowing shortcuts to
be made to increase performance).
The simple approach would be defining a function that returns the cycle length (Lua programming language),
function collatz_len (n) local c = 1 while n > 1 do c = c + 1 if math.mod(n, 2) == 0 then n = n / 2 else n = 3 * n + 1 end end return c end
Then we have a function check over a range (assuming n <= m here),
function check_range (n, m) local largest = 0 for i = n, m do local len = collatz_len (i) if len > largest then largest = len end end return largest end
And top it off with the i/o. (I am just learning Lua, so I hope I did this part properly!)
while not io.stdin.eof do n, m = io.stdin:read("*number", "*number") -- check for eof if n == nil or m == nil then break end print (n .. " " .. m .. " " .. check_range(n, m)) end
Notice anything extremely inefficient? We are doing the same work over
and over again! Take, for example, this range: 7, 22. When we start
with 7, we get the sequence shown above: 7, 22, 11, 34, 17, 52,
26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1
. Eight of these numbers
are part of the range that we are looking at. When we get up to 22, we
are going to walk down the same range again, less the 7. To make
things more efficient, we apply
some
dynamic programming and store previous calculated cycle-lengths in
an array. Once we get to a value we already calculated, we just look
it up.
I used dynamic programming in my submission, which I wrote up in C. You can grab my source here. It fills in a large array (1000000 entries) as values are found, so no cycle-length is calculated twice. When I submitted this program, it ranked 60 out of about 300,000 entries. There are probably a number of tweaks that can increase performance, such as increasing the size of the array, but I didn't care much about inching closer to the top. I would bet that the very top entries did some trial-and-error and determined what ranges are tested, using the results to seed their program accordingly. You could take my code and submit it yourself, but that wouldn't be very honest, would it?
So why am I going through all of this describing such a simple problem? Well, it is because of this neat feature of Lua that applies well to this problem. Lua is kind of like Lisp. In Lisp, everything is a list ("list processing" --> Lisp). In Lua, (almost) everything is an associative array (Maybe they should have called it Assp? Or Hashp? I am kidding.) An object is a hash with fields containing function references. There is even some syntactic sugar to help this along.
The cool thing is that we can create a hash with default entries that reference a function that calculates the Collatz cycle-length of its key. Once the cycle-length is calculated, the function reference is replaced with the value, so the function is never called again from that point. The function only actually determines the next integer, then references the hash to get the cycle-length of that next integer.
Now this hash looks like it is infinitely large. This is really a form of lazy evaluation: no values are calculated until they are needed (this is one of my favorite things about Haskell). We don't need to explicitly ask for it to be calculated, either. We just go along looking up values in the array as if they were always there. Here is how you do it,
collatz_len = { 1 } setmetatable (collatz_len, { __index = function (name, n) if (math.mod (n, 2) == 0) then name[n] = name[n/2] + 1; else name[n] = name[3 * n + 1] + 1; end return name[n] end })
So we replace the collatz_len
function with this array
(and replace the call to an array reference) and we have applied
dynamic programming to our old program. If I run the two programs with
this sample input,
10 1000 1000 3000 300 500
and look at average running times, the dynamic programming version runs 87% faster than the original.
One problem with this, though, is the use of recursion. In Lua, it is really easy to hit recursion limits. For example, accessing element 10000 will cause the program to crash. This will probably get fixed someday, or in some implementation of Lua.
I thought there might be a way to do this in Perl, by changing the
default hash value from undef
to something else, but I
was mildly disappointed to find out that this is not true.
Here is the source for the original program and the one with dynamic programming (BSD licenced): collatz_simple.lua and collatz.lua
]]>
wbf2c converts an esoteric
programming language
called brainfuck
into C code which can be machine compiled. Several optimizations are
done to make the resulting program extremely fast an efficient. The
converter supports both a static (standard 30,000 cells) array or a
dynamically-sized array. It also supports many different cell types,
from the standard char
to multi-precision cells
using GMP.
The converter can also run several brainfuck programs on the same memory array at once by running each one in a thread. To make sure each brainfuck operation is atomic, each cell gets a mutex lock. The only other multi-threading brainfuck implementation I know of is Brainfork.
For an example of some brainfuck code I wrote,
+>+< [ [->>+>+<<<]>>> [-<<<+>>>]<< [->+>+<<]>> [-<<+>>]<< ]
This program fills the memory with the Fibonacci series. Make sure you
use the dynamically sized array, along with the bignum cell
type. After two or three seconds of running, my laptop (unmodified
Dell Inspiron 1000) can calculate and spit out a 140MB text file
containing the first 50,000 numbers in the series. I used
the -d
dump option to see this output.
Download information, as well as some more examples, including a multi-threaded one, are on the project website.
]]>This the the second part of my post about clusters. Finally, here are some more pretty pictures and video to look at! It is an extremely parallel Mandelbrot set generator written in C.
This image was generated by my program on a cluster at my university, and my favorite image generated so far.
The reason I built my own cluster was to run this program. The generator forks off an arbitrary number of jobs (defined by a config file) to generate a single fractal, or a fractal zoom sequence. The cluster then automatically moves these jobs around to different nodes, making the fractal generation fast.
I wrote it with two goals in mind. I wanted it to be parallel so that it could easily take advantage of a cluster. I also wanted it to not use any external libraries. This is because a cluster is often a shared resource. Programs and libraries may only be available if installed by an administrator, meaning that extra libraries like libpng may not be available. For inter-process communication the generator uses simple pipes. So all you need here is a POSIX interface to the operating system, rather than some MPI implementation.
I used Andy Owen’s handy bitmap library for writing out the bitmaps. I don’t know how I could have done without it!
The only thing you need in order to run the fractal generator is a C compiler and a POSIX interface (GNU/Linux, *BSD, and other Unix-like systems). Extra capabilities are available if gzip is installed.
I use GNU Xaos to find good locations in the set for zoom sequences. It lets you zoom in real-time. Once I find a good spot, I tell my generator to render some nice images as a zoom sequence to it. Sometime I hope to write in an algorithm for auto-zooming. This way the generator could create zoom sequences automatically for hours on end. Xaos already has this capability for its real-time zooming.
An interesting thing I discovered: for these fractals at least, the gzipped bitmaps are (barely) smaller than the equivalent PNG versions. For the image above, the PNG version (produced by ImageMagick defaults) is 11586185 bytes. The gzipped bitmap version is 11586074 bytes. Plus, gzipping is faster. On my laptop, BMP to PNG (ImageMagick’s convert) took 13.678s. Gzipping (default options) took 5.167s. I am as surprised as you are.
]]>I read about memory allocation pools in the Subversion manual and
decided to write one for fun. So here it is. Included is a small
driver I ran over night to test for memory leaks. lint
will complain
about memory leaks on this code, however. I also added thread safety,
but I don’t suggest you use it.
A memory allocation pool is good for speeding up a program that needs to make many small memory requests quickly (many system calls). Instead, one system call is made in place of many.
It is also useful for semi-automatic memory management. Let’s say you build some large tree structure somewhere in your program. If you want to free all this memory used by the tree, you will need to traverse it to take it down. This takes time and code. If you use a memory pool, you can free all of the memory at once by freeing the entire memory pool.
The pool works by allocating a large chunk of memory and dishes it out as requested (a subpool). If a request is too large to take out of the current chunk, it allocates another chunk twice as large as the previous one (another subpool). This doubling allows the pool to quickly scale up to whatever size is needed. Memory will still be allocated from the old pool until that pool has too many misses in a row (hard coded to 10 in my sources). Once this happens, the subpool remains untouched and your pool will have some slight internal fragmentation.
]]>The original idea for this project came from Sean Howard’s Gameplay Mechanic’s #012. The basic idea here is that image files are the second easiest type of data to share on the Internet (the first being text). Sharing anything other than images may be difficult, so why not store files within an image as the image data? This is not steganography as the data is not being hidden. In fact, the data is quite obvious because we are trying to make the data as compact as possible in the image.
My “PNG Archiver” is usable but should still be considered alpha
quality software. I am adding support for different types of PNGs
(currently it does 8-bit RGB only), but I have found that using the
libpng library gives me headaches. The archiver can actually only
store a single file (just as gzip doesn’t know what a file is). This
is because I do not want to duplicate the work of real file archivers
like tar
. To store multiple files, make a “png-tarball”.
The PNG Archiver stores a checksum in the image that allows it to verify that the data was received correctly. This also allows it to automatically scan the image for data. When it reads in a piece that fulfills the checksum it assumes that it found the data you are looking for. You can decorate the image with text or a border and the archiver should still find the data as long as you didn’t disturb it. (examples of this on the project page)
]]>