WebAssembly: How to allocate your allocator

An early, small hurdle diving into WebAssembly was allocating my allocator. On a server or desktop with virtual memory, the allocator asks the operating system to map fresh pages into its address space (sbrk, anonymous mmap, VirtualAlloc), which it then dynamically allocates to different purposes. In an embedded context, dynamic allocation memory is typically a fixed, static region chosen at link time. The WASM execution environment more resembles an embedded system, but both kinds of obtaining raw memory are viable and useful in different situations.

For the purposes of this discussion, the actual allocator isn’t important. It could be a simple arena allocator, or a more general purpose buddy allocator. It could even be garbage collected with Boehm GC. Though WebAssembly’s linear memory is a poor fit for such a conservative garbage collector. In a compact address space starting at zero, and which doesn’t include code, memory addresses will be small numbers, and less distinguishable from common integer values. There’s also the issue that the garbage collector cannot scan the WASM stack, which is hidden from WASM programs by design. Only the ABI stack is visible. So a garage collector requires cooperation from the compiler — essentially as a distinct calling convention — to spill all heap pointers on the ABI stack before function calls. WASM C and C++ toolchains do not yet support this in a practical capacity.

Exporting a static heap

Let’s start with the embedded case because it’s simpler, and reserve a dynamic memory region at link time. WebAssembly is just reached 8 years old, so it’s early, and as we keep discovering, WASM tooling is still immature. wasm-ld doesn’t understand linker scripts, and there’s no stable, low-level assembly language on which to build, e.g. to reserve space, define symbols, etc. WAT is too high level and inflexible for this purpose, as we’ll soon see. So our only option is to brute force it in a high-level language:

char heap[16<<20];  // 16MiB

Plugging it into an arena allocator:

typedef struct {
    char *beg;
    char *end;
} Arena;

Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    a.end = heap + sizeof(heap);
    return a;
}

Unfortunately heap isn’t generic memory, but a high-level variable with a specific, fixed type. That’s why it would have been nice to reserve the memory outside the high-level language. In practice this works fine so long as everything is aligned, but strictly speaking, allocating any variable except char from this arena involves incompatible loads and stores on a char array. Clang doesn’t document any inline assembly interface for WASM, but neither does Clang forbid it. That leaves just enough room to launder the pointer if you’re worried about this technicality:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = heap;
    asm ("" : "+r"(a.beg));  // launder
    a.end = a.beg + sizeof(heap);
    return a;
}

The +r means a.beg is both input and output. The address of the heap goes into the black box, and as far as the compiler is concerned, some mystery address comes out which, critically, has no effective type. The assembly block is empty (""), so it’s just a no-op, and we know (wink wink) it’s really the same address. Because the heap was “used” by the black box, Clang won’t optimize the heap out of existence beneath us. Also note that a.end was derived from the laundered pointer.

This static variable technique works well only in an exported memory configuration, which is what wasm-ld uses by default. When a module exports its memory, it indicates how much linear memory it requires on start, and the WASM runtime allocates and zero-initializes it at module initialization time. C and C++ toolchains depend on that runtime zeroing to initialize static and global variables, which are defined to be so initialized. Compilers generate code assuming these variables are zero initialized. This same paradigm is used for .bss sections in hosted environments.

In an imported memory configuration, linear memory is uninitialized. The memory may be re-used from, say, a destroyed module without zeroing, and may contain arbitrary data. In that case, C and C++ toolchains must zero the memory explicitly. It could potentially be done with a memory.fill instruction in the start section, but LLVM does not support start sections. Instead it uses an active data segment — a chunk of data copied into linear memory by the WASM runtime during initialization, before running the start function.

That is, when importing memory, LLVM actually stores all those those zeros in the WASM module so that the runtime can copy it into linear memory. WASM has no built-in compression, so your WASM module will be at least as large as your heap! Exporting or importing memory is determined at link-time, so at compile-time the compiler must assume the worst case. If you compile the example above, you get a 16MiB “object” file (in WASM format):

$ clang --target=wasm32 -c -O example.c
$ du -h example.o
16.0M   example.o

The WAT version of this file is 48MiB — clearly unsuitable as a low-level assembler. If linking with exported memory, wasm-ld discards all-zero active data segments. If using an imported memory configuration, it’s copied into the final image, producing a huge WASM image, though highly compressible. As a rule, avoid importing memory when using an LLVM toolchain. Regardless, large heaps created this way will have a significant compile-time cost.

$ time echo 'char heap[256<<20];' | clang --target=wasm32 -c -xc -
real    0m0.334s
user    0m0.013s
sys     0m0.262s

(If only Clang had some sort of “noinit” variable attribute in order to allow heap to be uninitialized…)

Growing a dynamic heap

WASM programs can grow linear memory using an sbrk-like memory.grow instruction. It operates in quantities of pages (64kB), and returns the old memory size. Because memory starts at zero, the old memory size is also the base address of the new allocation. Clang provides access to this instruction via an undocumented built-in:

size_t __builtin_wasm_memory_grow(int, size_t);

The first parameter selects a memory because someday there might be more than one. From this built-in we can define sbrk:

void *sbrk(ptrdiff_t size)
{
    size_t npages = (size + 0xffffu) >> 16;  // round up
    size_t old    = __builtin_wasm_memory_grow(0, npages);
    if (old == -1ul) {
        return 0;
    }
    return (void *)(old << 16);
}

To which Clang compiles (note the memory.grow):

(func $sbrk (param i32) (result i32)
  (select
    (i32.const 0)
    (i32.shl
      (local.tee 0
        (memory.grow
          (i32.shr_u
            (i32.add
              (local.get 0)
              (i32.const 65535))
            (i32.const 16))))
      (i32.const 16))
    (i32.eq
      (local.get 0)
      (i32.const -1))))

Applying that to create an arena like before:

Arena newarena(ptrdiff_t cap)
{
    Arena a = {0};
    a.beg = sbrk(cap);
    if (a.beg) {
        a.end = a.beg + cap;
    }
    return a;
}

Now we can choose the size of the arena, and we can use this to create multiple arenas (e.g. permanent, scratch, etc.). We could even continue growing the last-created arena in-place when it’s full.

If there was no memory.grow instruction, it could be implemented as a request through an imported function. The embedder using the WASM runtime can grow the memory on the module’s behalf in the same manner. But as that documentation indicates, either way growing the memory comes with a downside in the most common WASM runtimes, browsers: It “detaches” the memory from references, which complicates its use for the embedder. If a WASM module may grow its memory at any time, the embedder must reacquire the memory handle after every call. It’s not difficult, but it’s easy to forget, and mistakes are likely to go unnoticed until later.

Importing a dynamic heap

There’s a middle ground where a WASM module imports a dynamic-sized heap. That is, linear memory beyond the module’s base initialization. This might be the case, for instance, in a programming competition, where contestants submit WASM modules which must complete a task using the supplied memory. In that case we don’t reserve a static heap, so we’re not facing the storing-zeros issue. However, how do we “find” the memory? Linear memory layout will look something like so:

0 <-- stack | data | heap --> ?
|-----------------------------|

This diagram reflects the more sensible wasm-ld --stack-first layout, where the ABI stack overflows off the bottom end of memory. The heap is just excess memory beyond the data. To find the upper bound, WASM has a memory.size instruction to query linear memory size, which again Clang provides as an undocumented built-in:

size_t __builtin_wasm_memory_size(int);

Like before, this returns the result in number of 64k pages. That’s the high end. How do we find the low end? Similar to __stack_pointer, the linker creates a __heap_base constant, which is the address delineating data and heap in the diagram above. To use it, we need to declare it:

extern char __heap_base[];

Notice how it’s an array, not a pointer. It doesn’t hold an address, it is an address. In an ELF context this would called an absolute symbol. That’s everything we need to find the bounds of the heap:

Arena getarena(void)
{
    Arena a = {0};
    a.beg = __heap_base;
    a.end = (char *)(__builtin_wasm_memory_size(0) << 16);
    return a;
}

Then we continue forward using whatever memory the embedder deigned to provide. Hopefully it’s enough!

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

null program

Chris Wellons

wellons@nullprogram.com (PGP)
~skeeto/public-inbox@lists.sr.ht (view)