Let's write a setjmp

This article was discussed on Hacker News.

Yesterday I wrote that setjmp is handy and that it would be nice to have without linking the C standard library. It’s conceptually simple, after all. Today let’s explore some differently-portable implementation possibilities with distinct trade-offs. At the very least it should illuminate why setjmp sometimes requires the use of volatile.

First, a quick review: setjmp and longjmp are a form of non-local goto.

typedef void *jmp_buf[N];
int setjmp(jmp_buf);
void longjmp(jmp_buf, int);

Calling setjmp saves the execution context in a jmp_buf, and longjmp restores this context, returning the thread to this previous point of execution. This means setjmp returns twice: (1) after saving the context, and (2) from longjmp. To distinguish these cases, the first time it returns zero and the second time it returns the value passed to longjmp.

jmp_buf is an array of some platform-specific type and length. I’ll be using void pointers in this article because it’s a register-sized type that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal addresses which eased in working it out.

Using GCC intrinsics

Let’s start with the easiest option. GCC has two intrinsics doing all the hard work for us: __builtin_setjmp and __builtin_longjmp. Its worst case jmp_buf is length 5, but the most popular architectures only use the first 3 elements. Clang supports these intrinsics as well for GCC compatibility.

Be mindful that the semantics are slightly different from the standard C definition, namely that you cannot use longjmp from the same function as setjmp. It also doesn’t touch the signal mask. However, it’s easier to use and you don’t need to worry about volatile.

// NOTE to copy-pasters: semantics differ slightly from standard C
typedef void *jmp_buf[5];
#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp

If you only care about GCC and/or Clang, then that’s it! It works as-is on every supported target and nothing more is needed. As a bonus, it will be more efficient than the libc version, though I should hope that won’t matter in practice. These are so awesome and convenient that I’m already second-guessing myself: “Do I really need to support other compilers…?”

Using assembly

If I want to support more compilers I’ll need to write it myself. It’s also an excuse to dig into the details. The execution context is no more than an array of saved registers, and longjmp is merely restoring those registers. One of the registers is the instruction pointer, and setting the instruction pointer is called a jump.

Since we’re talking about registers, that means assembly. We’ll also need to know the target’s calling convention, so this really narrows things down. This implementation will target x86-64, a.k.a x64, Windows, but it will support MSVC as an additional compiler. So it’s a different kind of portability. I’ll start with GCC via w64devkit then massage it into something MSVC can use.

I mentioned before that setjmp returns twice. So to return a second time we just need to simulate a normal function return. Obviously that includes restoring the stack pointer like the ret instruction, but it means preserving all the non-volatile registers a callee is supposed to preserve. These will all go in the execution context.

The x64 calling convention specifies 9 non-volatile rsp, rsp, rbx, rdi, rsi, r12, r13, r14, and r15. We’ll also need the instruction pointer, rip, making it 10 total.

typedef void *jmp_buf[10];

setjmp assembly

The tricky issue is that we need to save the registers immediately inside setjmp before the compiler has manipulated them in a function prologue. That will take more than mere inline assembly. We’ll start with a naked function, which means that GCC will not create a prologue or epilogue. However, that means no local variables, and the function body will be limited to inline assembly, including a ret instruction for the epilogue.

__attribute__((naked))
int setjmp(jmp_buf buf)
{
    __asm(
        // ...
    );
}

The x64 calling convention uses rcx for the first pointer argument, so that’s where we’ll find buf. I’ve arbitrarily decided to store rip first, then the other registers in order. However, the current value of rip isn’t the one we need. The rip we need was just pushed on top of the stack by the caller. I’ll read that off the stack into a scratch register, rax, and then store it in the first element of buf.

    mov (%rsp), %rax
    mov %rax,  0(%rcx)

The stack pointer, rsp, is also indirect since I want the pointer just before rip was pushed, as it would be just after a ret. I use a lea, load effective address, to add 8 bytes (recall: stack grows down), placing the result in a scratch register, then write it into the second element of buf (i.e. 8 bytes into %rcx).

    lea 8(%rsp), %rax
    mov %rax,  8(%rcx)

Everything else is a matter of elbow grease.

    mov %rbp, 16(%rcx)
    mov %rbx, 24(%rcx)
    mov %rdi, 32(%rcx)
    mov %rsi, 40(%rcx)
    mov %r12, 48(%rcx)
    mov %r13, 56(%rcx)
    mov %r14, 64(%rcx)
    mov %r15, 72(%rcx)

With all work complete, return zero to the caller.

    xor %eax, %eax
    ret

Putting it altogether, and avoiding a -Wunused-variable:

__attribute__((naked,returns_twice))
int setjmp(jmp_buf buf)
{
    (void)buf;
    __asm(
        "mov (%rsp), %rax\n"
        "mov %rax,  0(%rcx)\n"
        "lea 8(%rsp), %rax\n"
        "mov %rax,  8(%rcx)\n"
        "mov %rbp, 16(%rcx)\n"
        "mov %rbx, 24(%rcx)\n"
        "mov %rdi, 32(%rcx)\n"
        "mov %rsi, 40(%rcx)\n"
        "mov %r12, 48(%rcx)\n"
        "mov %r13, 56(%rcx)\n"
        "mov %r14, 64(%rcx)\n"
        "mov %r15, 72(%rcx)\n"
        "xor %eax, %eax\n"
        "ret\n"
    );
}

Also take note of the returns_twice attribute. It informs GCC of this function’s unusual nature, saying the function doesn’t preserve most non-volatile registers, and induces -Wclobbered diagnostics. Technically this means we could get away with saving only rip, rsp, and rbp — exactly as __builtin_setjmp does — but we’ll need the others for MSVC anyway.

longjmp assembly

In longjmp we need to restore all those registers. For purely aesthetic reasons I’ve decided to do it in reverse order. Everything but rip is easy.

    mov 72(%rcx), %r15
    mov 64(%rcx), %r14
    mov 56(%rcx), %r13
    mov 48(%rcx), %r12
    mov 40(%rcx), %rsi
    mov 32(%rcx), %rdi
    mov 24(%rcx), %rbx
    mov 16(%rcx), %rbp
    mov  8(%rcx), %rsp

The instruction set doesn’t have direct access to rip. It will be a jmp instead of mov, but before jumping we’ll need to prepare the return value. The x64 calling convention says the second argument is passed in rdx, so move that to rax, then jmp to the caller. It’s only a 32-bit operand, C int, so edx instead of rdx.

    mov %edx, %eax
    jmp *0(%rcx)

Putting it all together, and adding the noreturn attribute:

__attribute__((naked,noreturn))
void longjmp(jmp_buf buf, int ret)
{
    (void)buf;
    (void)ret;
    __asm(
        "mov 72(%rcx), %r15\n"
        "mov 64(%rcx), %r14\n"
        "mov 56(%rcx), %r13\n"
        "mov 48(%rcx), %r12\n"
        "mov 40(%rcx), %rsi\n"
        "mov 32(%rcx), %rdi\n"
        "mov 24(%rcx), %rbx\n"
        "mov 16(%rcx), %rbp\n"
        "mov  8(%rcx), %rsp\n"
        "mov %edx, %eax\n"
        "jmp *0(%rcx)\n"
    );
}

The C standard says that if ret is zero then longjmp will return 1 from setjmp instead. I leave that detail as a reader exercise. Otherwise this is a complete, working setjmp. It works perfectly when I swap it in for setjmp.h in my u-config test suite.

Considering volatile

Now that you’ve seen the guts, let’s talk about volatile and why it’s necessary. Consider this function, example, which calls a work function that may return through setjmp (e.g. on failure).

void work(jmp_buf);

int example(void)
{
    int r = 0;
    jmp_buf buf;
    if (!setjmp(buf)) {
        // first return
        r = 1;
        work(buf);
    } else {
        // second return
    }
    return r;
}

It stores to r after the first setjmp return, then loads r after the second setjmp return. However, r may have been stored in the execution context. Since it’s used across function calls, it would be reasonable to store this variable in non-volatile register like ebx. If so, it will be restored to its value at the moment of the first call to setbuf, in which case the old r would be read after restoration by longjmp. If it’s not stored in a register, but on the stack, then on the second return the function will read the latest value out of the stack. In practice, if work returns through longjmp, this function may return either 0 or 1, probably determined by the optimization level.

The solution is to qualify r with volatile, which forces the compiler to store the variable on the stack and never cache it in a register.

    volatile int r = 0;

Though since our setbuf is marked returns_twice, GCC will never store r in a register across setjmp calls. This potentially hides a bug in the program that would occur under some other compilers, but GCC will (usually) warn about it.

Pure assembly and MSVC

MSVC doesn’t understand __attribute__ nor the inline assembly, so it cannot compile these functions. I could compile my setjmp with GCC and the rest of the program with MSVC, which means I need two compilers. Instead, I’ll move to pure assembly, assemble with GNU as (TODO: port to MASM?) so we’ll only need a tiny piece of the GNU toolchain.

	.global setjmp
setjmp:
        mov (%rsp), %rax
	mov %rax,  0(%rcx)
	lea 8(%rsp), %rax
	mov %rax,  8(%rcx)
	mov %rbp, 16(%rcx)
	mov %rbx, 24(%rcx)
	mov %rdi, 32(%rcx)
	mov %rsi, 40(%rcx)
	mov %r12, 48(%rcx)
	mov %r13, 56(%rcx)
	mov %r14, 64(%rcx)
	mov %r15, 72(%rcx)
	xor %eax, %eax
	ret

	.globl longjmp
longjmp:
	mov 72(%rcx), %r15
	mov 64(%rcx), %r14
	mov 56(%rcx), %r13
	mov 48(%rcx), %r12
	mov 40(%rcx), %rsi
	mov 32(%rcx), %rdi
	mov 24(%rcx), %rbx
	mov 16(%rcx), %rbp
	mov  8(%rcx), %rsp
	mov %edx, %eax
	jmp *0(%rcx)

Then some declarations in C:

typedef void *jmp_buf[10];
int setjmp(jmp_buf);
_Noreturn void longjmp(jmp_buf, int);

I’ll need to enable C11 for that _Noreturn in MSVC. Assemble, compile, and link:

$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj

That generally works! If I rename to xsetjmp and xlongjmp to avoid conflicting with the CRT definitions, drop them into the u-config test suite in place of setjmp.h, then compile with MSVC, it passes all tests using my alternate implementation in MSVC as well as GCC. Pretty cool!

Takeaway

I’m not sure if I’ll ever use the assembly, but writing this article led me to try the GCC intrinsics, and I’m so impressed I’m still thinking about ways I can use them. My main thought is out-of-memory situations in arena allocators, using a non-local exit to roll back to a savepoint, even if just to return an error. This is nicer than either terminating the program or handling OOM errors on every allocation. Very roughly:

typedef struct {
    size_t cap;
    size_t off;
    void *jmp_buf[5];
} Arena;

// Place an arena and savepoint an out-of-memory jump.
#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))->jmp_buf)

// Place a new arena at the front of the buffer.
Arena *place(void *mem, size_t size)
{
    assert(size >= sizeof(Arena));
    Arena *a = mem;
    a->cap = size;
    a->off = sizeof(Arena);
    return a;
}

void *alloc(Arena *a, size_t size)
{
    size_t avail = a->cap - a->off;
    if (avail < size) {
        __builtin_longjmp(a->jmp_buf, 1);
    }
    void *p = (char *)a + a->off;
    a->off += size;
    return p;
}

Usage would look like:

int compute(void *workmem, size_t memsize)
{
    Arena *arena;
    if (OOM(arena, workmem, memsize)) {
        // jumps here when out of memory
        return COMPUTE_OOM;
    }

    Thing *t = PUSHSTRUCT(arena, Thing);
    // ...

    return COMPUTE_OK;
}

More granular snapshots can be made further down the stack by allocating subarenas out of the main arena. I have yet to try this out in a practical program.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)