A Showerthoughts Fortune File

I have created a fortune file for the all-time top 10,000 /r/Showerthoughts posts, as of October 2016. As a word of warning: Many of these entries are adult humor and may not be appropriate for your work computer. These fortunes would be categorized as “offensive” (fortune -o).

Download: showerthoughts (1.3 MB)

The copyright status of this file is subject to each of its thousands of authors. Since it’s not possible to contact many of these authors — some may not even still live — it’s obviously never going to be under an open source license (Creative Commons, etc.). Even more, some quotes are probably from comedians and such, rather than by the redditor who made the post. I distribute it only for fun.

Installation

To install this into your fortune database, first process it with strfile to create a random-access index, showerthoughts.dat, then copy them to the directory with the rest.

$ strfile showerthoughts
"showerthoughts.dat" created
There were 10000 strings
Longest string: 343 bytes
Shortest string: 39 bytes

$ cp showerthoughts* /usr/share/games/fortunes/

Alternatively, fortune can be told to use this file directly:

$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
        ―AndItsDeepToo, Aug 2016

If you didn’t already know, fortune is an old unix utility that displays a random quotation from a quotation database — a digital fortune cookie. I use it as an interactive login shell greeting on my ODROID-C2 server:

if shopt -q login_shell; then
    fortune ~/.fortunes
fi

How was it made?

Fortunately I didn’t have to do something crazy like scrape reddit for weeks on end. Instead, I downloaded the pushshift.io submission archives, which is currently around 70 GB compressed. Each file contains one month’s worth of JSON data, one object per submission, one submission per line, all compressed with bzip2.

Unlike so many other datasets, especially when it’s made up of arbitrary inputs from millions of people, the format of the /r/Showerthoughts posts is surprisingly very clean and requires virtually no touching up. It’s some really fantastic data.

A nice feature of bzip2 is concatenating compressed files also concatenates the uncompressed files. Additionally, it’s easy to parallelize bzip2 compression and decompression, which gives it an edge over xz. I strongly recommend using lbzip2 to decompress this data, should you want to process it yourself.

cat RS_*.bz2 | lbunzip2 > everything.json

jq is my favorite command line tool for processing JSON (and rendering fractals). To filter all the /r/Showerthoughts posts, it’s a simple select expression. Just mind the capitalization of the subreddit’s name. The -c tells jq to keep it one per line.

cat RS_*.bz2 | \
    lbunzip2 | \
    jq -c 'select(.subreddit == "Showerthoughts")' \
    > showerthoughts.json

However, you’ll quickly find that jq is the bottleneck, parsing all that JSON. Your cores won’t be exploited by lbzip2 as they should. So I throw grep in front to dramatically decrease the workload for jq.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts")'
    > showerthoughts.json

This will let some extra things through, but it’s a superset. The -a option is necessary because the data contains some null bytes. Without it, grep switches into binary mode and breaks everything. This is incredibly frustrating when you’ve already waited half an hour for results.

To further reduce the workload further down the pipeline, I take advantage of the fact that only four fields will be needed: title, score, author, and created_utc. The rest can — and should, for efficiency’s sake — be thrown away where it’s cheap to do so.

cat *.bz2 | \
    lbunzip2 | \
    grep -a Showerthoughts | \
    jq -c 'select(.subreddit == "Showerthoughts") |
               {title, score, author, created_utc}' \
    > showerthoughts.json

This gathers all 1,199,499 submissions into a 185 MB JSON file (as of this writing). Most of these submissions are terrible, so the next step is narrowing it to the small set of good submissions and putting them into the fortune database format.

It turns out reddit already has a method for finding the best submissions: a voting system. Just pick the highest scoring posts. Through experimentation I arrived at 10,000 as the magic cut-off number. After this the quality really starts to drop off. Over time this should probably be scaled up with the total number of submissions.

I did both steps at the same time using a bit of Emacs Lisp, which is particularly well-suited to the task:

This Elisp program reads one JSON object at a time and sticks each into a AVL tree sorted by score (descending), then timestamp (ascending), then title (ascending). The AVL tree is limited to 10,000 items, with the lowest items being dropped. This was a lot faster than the more obvious approach: collecting everything into a big list, sorting it, and keeping the top 10,000 items.

Formatting

The most complicated part is actually paragraph wrapping the submissions. Most are too long for a single line, and letting the terminal hard wrap them is visually unpleasing. The submissions are encoded in UTF-8, some with characters beyond simple ASCII. Proper wrapping requires not just Unicode awareness, but also some degree of Unicode rendering. The algorithm needs to recognize grapheme clusters and know the size of the rendered text. This is not so trivial! Most paragraph wrapping tools and libraries get this wrong, some counting width by bytes, others counting width by codepoints.

Emacs’ M-x fill-paragraph knows how to do all these things — only for a monospace font, which is all I needed — and I decided to leverage it when generating the fortune file. Here’s an example that paragraph-wraps a string:

(defun string-fill-paragraph (s)
  (with-temp-buffer
    (insert s)
    (fill-paragraph)
    (buffer-string)))

For the file format, items are delimited by a % on a line by itself. I put the wrapped content, followed by a quotation dash, the author, and the date. A surprising number of these submissions have date-sensitive content (“on this day X years ago”), so I found it was important to include a date.

April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
        ―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
        ―MKLV, Aug 2015
%

There’s the potential that a submission itself could end with a lone % and, with a bit of bad luck, it happens to wrap that onto its own line. Fortunately this hasn’t happened yet. But, now that I’ve advertised it, someone could make such a submission, popular enough for the top 10,000, with the intent to personally trip me up in a future update. I accept this, though it’s unlikely, and it would be fairly easy to work around if it happened.

The strfile program looks for the % delimiters and fills out a table of file offsets. The header of the .dat file indicates the number strings along with some other metadata. What follows is a table of 32-bit file offsets.

struct {
    uint32_t str_version;  /* version number */
    uint32_t str_numstr;   /* # of strings in the file */
    uint32_t str_longlen;  /* length of longest string */
    uint32_t str_shortlen; /* shortest string length */
    uint32_t str_flags;    /* bit field for flags */
    char str_delim;        /* delimiting character */
}

Note that the table doesn’t necessarily need to list the strings in the same order as they appear in the original file. In fact, recent versions of strfile can sort the strings by sorting the table, all without touching the original file. Though none of this important to fortune.

Now that you know how it all works, you can build your own fortune file from your own inputs!

Portable Structure Access with Member Offset Constants

Suppose you need to write a C program to access a long sequence of structures from a binary file in a specified format. These structures have different lengths and contents, but also a common header identifying its type and size. Here’s the definition of that header (no padding):

struct event {
    uint64_t time;   // unix epoch (microseconds)
    uint32_t size;   // including this header (bytes)
    uint16_t source;
    uint16_t type;
};

The size member is used to find the offset of the next structure in the file without knowing anything else about the current structure. Just add size to the offset of the current structure.

The type member indicates what kind of data follows this structure. The program is likely to switch on this value.

The actual structures might look something like this (in the spirit of X-COM). Note how each structure begins with struct event as header. All angles are expressed using binary scaling.

#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20
#define EVENT_TYPE_SUSPICIOUS_SIGNAL   30

struct observer {
    struct event event;
    uint32_t latitude;   // binary scaled angle
    uint32_t longitude;  //
    uint16_t source_id;  // later used for event source
    uint16_t name_size;  // not including null terminator
    char name[];
};

struct ufo_sighting {
    struct event event;
    uint32_t azimuth;    // binary scaled angle
    uint32_t elevation;  //
};

struct suspicious_signal {
    struct event event;
    uint16_t num_channels;
    uint16_t sample_rate;  // Hz
    uint32_t num_samples;  // per channel
    int16_t samples[];
};

If all integers are stored in little endian byte order (least significant byte first), there’s a strong temptation to lay the structures directly over the data. After all, this will work correctly on most computers.

struct event header;
fread(buffer, sizeof(header), 1, file);
switch (header.type) {
    // ...
}

This code will not work correctly when:

  1. The host machine doesn’t use little endian byte order, though this is now uncommon. Sometimes developers will attempt to detect the byte order at compile time and use the preprocessor to byte-swap if needed. This is a mistake.

  2. The host machine has different alignment requirements and so introduces additional padding to the structure. Sometimes this can be resolved with a non-standard #pragma pack.

Integer extraction functions

Fortunately it’s easy to write fast, correct, portable code for this situation. First, define some functions to extract little endian integers from an octet buffer (uint8_t). These will work correctly regardless of the host’s alignment and byte order.

static inline uint16_t
extract_u16le(const uint8_t *buf)
{
    return (uint16_t)buf[1] << 8 |
           (uint16_t)buf[0] << 0;
}

static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return (uint32_t)buf[3] << 24 |
           (uint32_t)buf[2] << 16 |
           (uint32_t)buf[1] <<  8 |
           (uint32_t)buf[0] <<  0;
}

static inline uint64_t
extract_u64le(const uint8_t *buf)
{
    return (uint64_t)buf[7] << 56 |
           (uint64_t)buf[6] << 48 |
           (uint64_t)buf[5] << 40 |
           (uint64_t)buf[4] << 32 |
           (uint64_t)buf[3] << 24 |
           (uint64_t)buf[2] << 16 |
           (uint64_t)buf[1] <<  8 |
           (uint64_t)buf[0] <<  0;
}

The big endian version is identical, but with shifts in reverse order.

A common concern is that these functions are a lot less efficient than they could be. On x86 where alignment is very relaxed, each could be implemented as a single load instruction. However, on GCC 4.x and earlier, extract_u32le compiles to something like this:

extract_u32le:
        movzx   eax, [rdi+3]
        sal     eax, 24
        mov     edx, eax
        movzx   eax, [rdi+2]
        sal     eax, 16
        or      eax, edx
        movzx   edx, [rdi]
        or      eax, edx
        movzx   edx, [rdi+1]
        sal     edx, 8
        or      eax, edx
        ret

It’s tempting to fix the problem with the following definition:

// Note: Don't do this.
static inline uint32_t
extract_u32le(const uint8_t *buf)
{
    return *(uint32_t *)buf;
}

It’s unportable, it’s undefined behavior, and worst of all, it might not work correctly even on x86. Fortunately I have some great news. On GCC 5.x and above, the correct definition compiles to the desired, fast version. It’s the best of both worlds.

extract_u32le:
        mov     eax, [rdi]
        ret

It’s even smart about the big endian version:

static inline uint32_t
extract_u32be(const uint8_t *buf)
{
    return (uint32_t)buf[0] << 24 |
           (uint32_t)buf[1] << 16 |
           (uint32_t)buf[2] <<  8 |
           (uint32_t)buf[3] <<  0;
}

Is compiled to exactly what you’d want:

extract_u32be:
        mov     eax, [rdi]
        bswap   eax
        ret

Or, even better, if your system supports movbe (gcc -mmovbe):

extract_u32be:
        movbe   eax, [rdi]
        ret

Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m betting it will eventually learn how to do this, too.

Member offset constants

For this next technique, that struct event from above need not actually be in the source. It’s purely documentation. Instead, let’s define the structure in terms of member offset constants — a term I just made up for this article. I’ve included the integer types as part of the name to aid in their correct use.

#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14

Given a buffer, the integer extraction functions, and these offsets, structure members can be plucked out on demand.

uint8_t *buf;
// ...
uint64_t time   = extract_u64le(buf + EVENT_U64LE_TIME);
uint32_t size   = extract_u32le(buf + EVENT_U32LE_SIZE;
uint16_t source = extract_u16le(buf + EVENT_U16LE_SOURCE);
uint16_t type   = extract_u16le(buf + EVENT_U16LE_TYPE);

On x86 with GCC 5.x, each member access will be inlined and compiled to a one-instruction extraction. As far as performance is concerned, it’s identical to using a structure overlay, but this time the C code is clean and portable. A slight downside is the lack of type checking on member access: it’s easy to mismatch the types and accidentally read garbage.

Memory mapping and iteration

There’s a real advantage to memory mapping the input file and using its contents directly. On a system with a huge virtual address space, such as x86-64 or AArch64, this memory is almost “free.” Already being backed by a file, paging out this memory costs nothing (i.e. it’s discarded). The input file can comfortably be much larger than physical memory without straining the system.

Unportable structure overlay can take advantage of memory mapping this way, but has the previously-described issues. An approach with member offset constants will take advantage of it just as well, all while remaining clean and portable.

I like to wrap the memory mapping code into a simple interface, which makes porting to non-POSIX platforms, such Windows, easier. Caveat: This won’t work with files whose size exceeds the available contiguous virtual memory of the system — a real problem for 32-bit systems.

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

uint8_t *
map_file(const char *path, size_t *length)
{
    int fd = open(path, O_RDONLY);
    if (fd == -1)
        return 0;

    struct stat stat;
    if (fstat(fd, &stat) == -1) {
        close(fd);
        return 0;
    }

    *length = stat.st_size;  // TODO: possible overflow
    uint8_t *p = mmap(0, *length, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

void
unmap_file(uint8_t *p, size_t length)
{
    munmap(p, length);
}

Next, here’s an example that iterates over all the structures in input_file, in this case counting each. The size member is extracted in order to stride to the next structure.

size_t length;
uint8_t *data = map_file(input_file, &length);
if (!data)
    FATAL();

size_t event_count = 0;
uint8_t *p = data;
while (p < data + length) {
    event_count++;
    uint32_t size = extract_u32le(p + EVENT_U32LE_SIZE);
    if (size > length - (p - data))
        FATAL();  // invalid size
    p += size;
}
printf("I see %zu events.\n", event_count);

unmap_file(data, length);

This is the basic structure for navigating this kind of data. A deeper dive would involve a switch inside the loop, extracting the relevant members for whatever use is needed.

Fast, correct, simple. Pick three.

A Magnetized Needle and a Steady Hand

Now they’ve gone an done it. An unidentified agency has spread a potent computer virus across all the world’s computers and deleted the binaries for every copy of every software development tool. Even the offline copies — it’s that potent.

Most of the source code still exists, even for the compilers, and most computer systems will continue operating without disruption, but no new software can be developed unless it’s written byte by byte in raw machine code. Only real programmers can get anything done.

The world’s top software developers have been put to work bootstrapping a C compiler (and others) completely from scratch so that we can get back to normal. Without even an assembler, it’s a slow, tedious process.

In the mean time, rather than wait around for the bootstrap work to complete, the rest of us have been assigned individual programs hit by the virus. For example, many basic unix utilities have been wiped out, and the bootstrap would benefit from having them. Having different groups tackle each missing program will allow the bootstrap effort to move forward somewhat in parallel. At least that’s what the compiler nerds told us. The real reason is that they’re tired of being asked if they’re done yet, and these tasks will keep the rest of us quietly busy.

Fortunately you and I have been assigned the easiest task of all: We’re to write the true command from scratch. We’ll have to figure it out byte by byte. The target is x86-64 Linux, which means we’ll need the following documentation:

  1. Executable and Linking Format (ELF) Specification. This is the binary format used by modern Unix-like systems, including Linux. A more convenient way to access this document is man 5 elf.

  2. Intel 64 and IA-32 Architectures Software Developer’s Manual (Volume 2). This fully documents the instruction set and its encoding. It’s all the information needed to write x86 machine code by hand. The AMD manuals would work too.

  3. System V Application Binary Interface: AMD64 Architecture Processor Supplement. Only a few pieces of information are needed from this document, but more would be needed for a more substantial program.

  4. Some magic numbers from header files.

Manual Assembly

The program we’re writing is true, whose behavior is documented as “do nothing, successfully.” All command line arguments are ignored and no input as read. The program only needs to perform the exit system call, immediately terminating the process.

According to the ABI document (3) Appendix A, the registers for system call arguments are: rdi, rsi, rdx, r10, r8, r9. The system call number goes in rax. The exit system call takes only one argument, and that argument will be 0 (success), so rdi should be set to zero. It’s likely that it’s already zero when the program starts, but the ABI document says its contents are undefined (§3.4), so we’ll set it explicitly.

For Linux on x86-64, the system call number for exit is 60, (/usr/include/asm/unistd_64.h), so rax will be set to 60, followed by syscall.

    xor  edi, edi
    mov  eax, 60
    syscall

There’s no assembler available to turn this into machine code, so it has to be assembled by hand. For that we need the Intel manual (2).

The first instruction is xor, so look up that mnemonic in the manual. Like most x86 mnemonics, there are many different opcodes and many ways to encode the same operation. In this case, there are 22 options.

The operands are two 32-bit registers, so there are two options available (opcodes 0x31 and 0x33):

31 /r      XOR r/m32, r32
33 /r      XOR r32, r/m32

The “r/m32” means the operand can be either a register or the address of a 32-bit region of memory. With two register operands, both encodings are equally valid, both have the same length (2 bytes), and neither is canonical, so the decision is entirely arbitrary. Let’s pick the first one, opcode 0x31, since its listed first.

The “/r” after the opcode means the register-only operand (“r32” in both cases) will be specified in the ModR/M byte. This is the byte that immediately follows the opcode and specifies one of two of the operands.

The ModR/M byte is broken into three parts: mod (2 bits), reg (3 bits), r/m (3 bits). This gets a little complicated, but if you stare at Table 2-1 in the Intel manual for long enough it eventually makes sense. In short, two high bits (11) for mod indicates we’re working with a register rather than a load. Here’s where we’re at for ModR/M:

11 ??? ???

The order of the x86 registers is unintuitive: ax, cx, dx, bx, sp, bp, si, di. With 0-indexing, that gives di a value of 7 (111 in binary). With edi as both operands, this makes ModR/M:

11 111 111

Or, in hexadecimal, FF. And that’s it for this instruction. With the opcode (0x31) and the ModR/M byte (0xFF):

31 FF

The encoding for mov is a bit different. Look it up and match the operands. Like before, there are two possible options:

B8+rd id   MOV r32, imm32
C7 /0 id   MOV r/m32, imm32

In the B8+rd notation means the 32-bit register operand (rd for “register double word”) is added to the opcode instead of having a ModR/M byte. It’s followed by a 32-bit immediate value (id for “integer double word”). That’s a total of 5 bytes.

The “/0” in second means 0 goes in the “reg” field of ModR/M, and the whole instruction is followed by the 32-bit immediate (id). That’s a total of 6 bytes. Since this is longer, we’ll use the first encoding.

So, that’s opcode 0xB8 + 0, since eax is register number 0, followed by 60 (0x3C) as a little endian, 4-byte value. Here’s the encoding for the second instruction:

B8 3C 00 00 00

The final instruction is a cakewalk. There are no operands, it comes in only one form of two operands.

0F 05   SYSCALL

So the encoding for this instruction is:

0F 05

Putting it all together the program is 9 bytes:

31 FF B8 3C 00 00 00 0F 05

Aren’t you glad you don’t normally have to assemble entire programs by hand?

Constructing the ELF

Back in the old days you may have been able to simply drop these bytes into a file and execute it. That’s how DOS COM programs worked. But this definitely won’t work if you tried it on Linux. Binaries must be in the Executable and Linking Format (ELF). This format tells the loader how to initialize the program in memory and how to start it.

Fortunately for this program we’ll only need to fill out two structures: the ELF header and one program header. The binary will be the ELF header, followed immediately by the program header, followed immediately by the program.

To fill this binary out, we’d use whatever method the virus left behind for writing raw bytes to a file. For now I’ll assume the echo command is still available, and we’ll use hexadecimal \xNN escapes to write raw bytes. If this isn’t available, you might need to use the magnetic needle and steady hand method, or the butterflies.

The very first structure in an ELF file must be the ELF header, from the ELF specification (1):

    typedef struct {
        unsigned char e_ident[EI_NIDENT];
        uint16_t      e_type;
        uint16_t      e_machine;
        uint32_t      e_version;
        ElfN_Addr     e_entry;
        ElfN_Off      e_phoff;
        ElfN_Off      e_shoff;
        uint32_t      e_flags;
        uint16_t      e_ehsize;
        uint16_t      e_phentsize;
        uint16_t      e_phnum;
        uint16_t      e_shentsize;
        uint16_t      e_shnum;
        uint16_t      e_shstrndx;
    } ElfN_Ehdr;

No other data is at a fixed location because this header specifies where it can be found. If you’re writing a C program in the future, once compilers have been bootstrapped back into existence, you can access this structure in elf.h.

The ELF header

The EI_NIDENT macro is 16, so e_ident is 16 bytes. The first 4 bytes are fixed: 0x7F, E, L, F.

The 5th byte is called EI_CLASS: a 32-bit program (ELFCLASS32 = 1) or a 64-bit program (ELFCLASS64 = 2). This will be a 64-bit program (2).

The 6th byte indicates the integer format (EI_DATA). The one we want for x86-64 is ELFDATA2LSB (1), two’s complement, little-endian.

The 7th byte is the ELF version (EI_VERSION), always 1 as of this writing.

The 8th byte is the ABI (ELF_OSABI), which in this case is ELFOSABI_SYSV (0).

The 9th byte is the version (EI_ABIVERSION), which is just 0 again.

The rest is zero padding.

So writing the ELF header:

echo -ne '\x7FELF\x02\x01\x01\x00' > true
echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The next field is the e_type. This is an executable program, so it’s ET_EXEC (2). Other options are object files (ET_REL = 1), shared libraries (ET_DYN = 3), and core files (ET_CORE = 4).

echo -ne '\x02\x00' >> true

The value for e_machine is EM_X86_64 (0x3E). This value isn’t in the ELF specification but rather the ABI document (§4.1.1). On BSD this is instead named EM_AMD64.

echo -ne '\x3E\x00' >> true

For e_version it’s always 1, like in the header.

echo -ne '\x01\x00\x00\x00' >> true

The e_entry field will be 8 bytes because this is a 64-bit ELF. This is the virtual address of the program’s entry point. It’s where the loader will pass control and so it’s where we’ll load the program. The typical entry address is somewhere around 0x400000. For a reason I’ll explain shortly, our entry point will be 120 bytes (0x78) after that nice round number, at 0x40000078.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The e_phoff field holds the offset of the program header table. The ELF header is 64 bytes (0x40) and this structure will immediately follow. It’s also 8 bytes.

echo -ne '\x40\x00\x00\x00\x00\x00\x00\x00' >> true

The e_shoff header holds the offset of the section table. In an executable program we don’t need sections, so this is zero.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The e_flags field has processor-specific flags, which in our case is just 0.

echo -ne '\x00\x00\x00\x00' >> true

The e_ehsize holds the size of the ELF header, which, as I said, is 64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_phentsize is the size of one program header, which is 56 bytes (0x38).

echo -ne '\x38\x00' >> true

The e_phnum field indicates how many program headers there are. We only need the one: the segment with the 9 program bytes, to be loaded into memory.

echo -ne '\x01\x00' >> true

The e_shentsize is the size of a section header. We’re not using this, but we’ll do our due diligence. These are 64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_shnum field is the number of sections (0).

echo -ne '\x00\x00' >> true

The e_shstrndx is the index of the section with the string table. It doesn’t exist, so it’s 0.

echo -ne '\x00\x00' >> true

The program header

Next is our program header.

    typedef struct {
        uint32_t   p_type;
        uint32_t   p_flags;
        Elf64_Off  p_offset;
        Elf64_Addr p_vaddr;
        Elf64_Addr p_paddr;
        uint64_t   p_filesz;
        uint64_t   p_memsz;
        uint64_t   p_align;
    } Elf64_Phdr;

The p_type field indicates the segment type. This segment will hold the program and will be loaded into memory, so we want PT_LOAD (1). Other kinds of segments set up dynamic loading and such.

echo -ne '\x01\x00\x00\x00' >> true

The p_flags field gives the memory protections. We want executable (PF_X = 1) and readable (PF_R = 4). These are ORed together to make 5.

echo -ne '\x05\x00\x00\x00' >> true

The p_offset is the file offset for the content of this segment. This will be the program we assembled. It will immediately follow the this header. The ELF header was 64 bytes, plus a 56 byte program header, which is 120 (0x78).

echo -ne '\x78\x00\x00\x00\x00\x00\x00\x00' >> true

The p_vaddr is the virtual address where this segment will be loaded. This is the entry point from before. A restriction is that this value must be congruent with p_offset modulo the page size. That’s why the entry point was offset by 120 bytes.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The p_paddr is unused for this platform.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The p_filesz is the size of the segment in the file: 9 bytes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_memsz is the size of the segment in memory, also 9 bytes. It might sound redundant, but these are allowed to differ, in which case it’s either truncated or padded with zeroes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_align indicates the segment’s alignment. We don’t care about alignment.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

Append the program

Finally, append the program we assembled at the beginning.

echo -ne '\x31\xFF\xB8\x3C\x00\x00\x00\x0F\x05' >> true

Set it executable (hopefully chmod survived!):

chmod +x true

And test it:

./true && echo 'Success'

Here’s the whole thing as a shell script:

Is the C compiler done bootstrapping yet?

Baking Data with Serialization

Suppose you want to bake binary data directly into a program’s executable. It could be image pixel data (PNG, BMP, JPEG), a text file, or some sort of complex data structure. Perhaps the purpose is to build a single executable with no extraneous data files — easier to install and manage, though harder to modify. Or maybe you’re lazy and don’t want to worry about handling the various complications and errors that arise when reading external data: Where to find it, and what to do if you can’t find it or can’t read it. This article is about two different approaches I’ve used a number of times for C programs.

The linker approach

The simpler, less portable option is to have the linker do it. Both the GNU linker and the gold linker (ELF only) can create object files from arbitrary files using the --format (-b) option set to binary (raw data). It’s combined with --relocatable (-r) to make it linkable with the rest of the program. MinGW supports all of this, too, so it’s fairly portable so long as you stick to GNU Binutils.

For example, to create an object file, my_msg.o with the contents of the text file my_msg.txt:

$ ld -r -b binary -o my_file.o my_msg.txt

The object file will have three symbols, each named after the input file. Unfortunately there’s no control over the symbol names, section (.data), alignment, or protections (e.g. read-only). You’re completely at the whim of the linker, short of objcopy tricks.

$ nm my_msg.o
000000000000000e D _binary_my_msg_txt_end
000000000000000e A _binary_my_msg_txt_size
0000000000000000 D _binary_my_msg_txt_start

To access these in C, declare them as global variables like so:

extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;

The size symbol, _binary_my_msg_txt_size, is misleading. The “A” from nm means it’s an absolute symbol, not relocated. It doesn’t refer to an integer that holds the size of the raw data. The value of the symbol itself is the size of the data. That is, take the address of it and cast it to an integer.

size_t size = (size_t)&_binary_my_msg_txt_size;

Alternatively — and this is my own preference — just subtract the other two symbols. It’s cleaner and easier to understand.

size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;

Here’s the “Hello, world” for this approach (hello.c).

#include <stdio.h>

extern char _binary_my_msg_txt_start[];
extern char _binary_my_msg_txt_end[];
extern char _binary_my_msg_txt_size;

int
main(void)
{
    size_t size = _binary_my_msg_txt_end - _binary_my_msg_txt_start;
    fwrite(_binary_my_msg_txt_start, size, 1, stdout);
    return 0;
}

The program has to use fwrite() rather than fputs() because the data won’t necessarily be null-terminated. That is, unless a null is intentionally put at the end of the text file itself.

And for the build:

$ cat my_msg.txt
Hello, world!
$ ld -r -b binary -o my_msg.o my_msg.txt
$ gcc -o hello hello.c my_msg.o
$ ./hello
Hello, world!

If this was binary data, such as an image file, the program would instead read the array as if it were a memory mapped file. In fact, that’s what it really is: the raw data memory mapped by the loader before the program started.

How about a data structure dump?

This could be taken further to dump out some kinds of data structures. For example, this program (table_gen.c) fills out a table of the first 90 Fibonacci numbers and dumps it to standard output.

#include <stdio.h>

#define TABLE_SIZE 90

long long table[TABLE_SIZE] = {1, 1};

int
main(void)
{
    for (int i = 2; i < TABLE_SIZE; i++)
        table[i] = table[i - 1] + table[i - 2];
    fwrite(table, sizeof(table), 1, stdout);
    return 0;
}

Build and run this intermediate helper program as part of the overall build.

$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.bin
$ ld -r -b binary -o table.o table.bin

And then the main program (print_fib.c) might look like:

#include <stdio.h>

extern long long _binary_table_bin_start[];
extern long long _binary_table_bin_end[];

int
main(void)
{
    long long *start = _binary_table_bin_start;
    long long *end   = _binary_table_bin_end;
    for (long long *x = start; x < end; x++)
        printf("%lld\n", *x);
    return 0;
}

However, there are some good reasons not to use this feature in this way:

  1. The format of table.bin is specific to the host architecture (byte order, size, padding, etc.). If the host is the same as the target then this isn’t a problem, but it will prohibit cross-compilation.

  2. The linker has no information about the alignment requirements of the data. To the linker it’s just a byte buffer. In the final program the long long array will not necessarily aligned properly for its type, meaning the above program might crash. The Right Way is to never dereference the data directly but rather memcpy() it into a properly-aligned variable, just as if the data was an unaligned buffer.

  3. The data structure cannot use any pointers. Pointer values are meaningless to other processes and will be no different than garbage.

Towards a more portable approach

There’s an easy way to address all three of these problems and eliminate the reliance on GNU linkers: serialize the data into C code. It’s metaprogramming, baby.

In the Fibonacci example, change the fwrite() in table_gen.c to this:

    printf("int table_size = %d;\n", TABLE_SIZE);
    printf("long long table[] = {\n");
    for (int i = 0; i < TABLE_SIZE; i++)
        printf("    %lldLL,\n", table[i]);
    printf("};\n");

The output of the program becomes text:

int table_size = 90;
long long table[] = {
    1LL,
    1LL,
    2LL,
    3LL,
    /* ... */
    1779979416004714189LL,
    2880067194370816120LL,
}

And print_fib.c is changed to:

#include <stdio.h>

extern int table_size;
extern long long table[];

int
main(void)
{
    for (int i = 0; i < table_size; i++)
        printf("%lld\n", table[i]);
    return 0;
}

Putting it all together:

$ gcc -std=c99 -o table_gen table_gen.c
$ ./table_gen > table.c
$ gcc -std=c99 -o print_fib print_fib.c table.c

Any C compiler and linker could do all of this, no problem, making it more portable. The intermediate metaprogram isn’t a barrier to cross compilation. It would be compiled for the host (typically identified through HOST_CC) and the rest is compiled for the target (e.g. CC).

The output of table_gen.c isn’t dependent on any architecture, making it cross-compiler friendly. There are also no alignment problems because it’s all visible to compiler. The type system isn’t being undermined.

Dealing with pointers

The Fibonacci example doesn’t address the pointer problem — it has no pointers to speak of. So let’s step it up to a trie using the trie from the previous article. As a reminder, here it is:

#define TRIE_ALPHABET_SIZE  4
#define TRIE_TERMINAL_FLAG  (1U << 0)

struct trie {
    struct trie *next[TRIE_ALPHABET_SIZE];
    struct trie *p;
    int i;
    unsigned flags;
};

Dumping these structures out raw would definitely be useless since they’re almost entirely pointer data. So instead, fill out an array of these structures, referencing the array itself to build up the pointers (later filled in by either the linker or the loader). This code uses the in-place breadth-first traversal technique from the previous article.

void
trie_serialize(struct trie *t, const char *name)
{
    printf("struct trie %s[] = {\n", name);
    struct trie *head = t;
    struct trie *tail = t;
    t->p = NULL;
    size_t count = 0;
    while (head) {
        printf("    {​{");
        for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
            struct trie *next = head->next[i];
            const char *comma = i ? ", " : "";
            if (next) {
                /* Add child to the queue. */
                tail->p = next;
                tail = next;
                next->p = NULL;
                /* Print the pointer to the child. */
                printf("%s%s + %zu", comma, name, ++count);
            } else {
                printf("%s0", comma);
            }
        }
        printf("}, 0, 0, %u},\n", head->flags & TRIE_TERMINAL_FLAG);
        head = head->p;
    }
    printf("};\n");
}

Remember that list of strings from before?

AAAAA
ABCD
CAA
CAD
CDBD

Which looks like this?

That serializes to this C code:

struct trie root[] = {
    {{root + 1, 0, root + 2, 0}, 0, 0, 0},
    {{root + 3, root + 4, 0, 0}, 0, 0, 0},
    {{root + 5, 0, 0, root + 6}, 0, 0, 0},
    {{root + 7, 0, 0, 0}, 0, 0, 0},
    {{0, 0, root + 8, 0}, 0, 0, 0},
    {{root + 9, 0, 0, root + 10}, 0, 0, 0},
    {{0, root + 11, 0, 0}, 0, 0, 0},
    {{root + 12, 0, 0, 0}, 0, 0, 0},
    {{0, 0, 0, root + 13}, 0, 0, 0},
    {{0, 0, 0, 0}, 0, 0, 1},
    {{0, 0, 0, 0}, 0, 0, 1},
    {{0, 0, 0, root + 14}, 0, 0, 0},
    {{root + 15, 0, 0, 0}, 0, 0, 0},
    {{0, 0, 0, 0}, 0, 0, 1},
    {{0, 0, 0, 0}, 0, 0, 1},
    {{0, 0, 0, 0}, 0, 0, 1},
};

This trie can be immediately used at program startup without initialization, and it can even have new nodes inserted into it. It’s not without its downsides, particularly because it’s a trie:

  1. It’s really going to blow up the size of the binary, especially when it holds lots of strings. These nodes are anything but compact.

  2. If the code is compiled to be position-independent (-fPIC), each of those nodes is going to hold multiple dynamic relocations, further exploding the size of the binary and preventing the trie from being shared between processes. It’s 24 bytes per relocation on x86-64. This will also slow down program start up time. With just a few thousand strings, the simple test program was taking 5x longer to start (25ms instead of 5ms) than with an empty trie.

  3. Even without being position-independent, the linker will have to resolve all the compile-time relocations. I was able to overwhelm linker and run it out of memory with just some tens of thousands of strings. This would make for a decent linker stress test.

This technique obviously doesn’t scale well with trie data. You’re better off baking in the flat string list and building the trie at run time — though you could compute the exact number of needed nodes at compile time and statically allocate them (in .bss). I’ve personally had much better luck with other sorts of lookup tables. It’s a useful tool for the C programmer’s toolbelt.

Zero-allocation Trie Traversal

As part of a demonstration in an upcoming article, I wrote a simple trie implementation. A trie is a search tree where the keys are a sequence of symbols (i.e. strings). Strings with a common prefix share an initial path down the trie, and the keys themselves are stored implicitly by the structure of the trie. It’s commonly used as a sorted set or, when values are associated with nodes, an associative array.

This wasn’t my first time writing a trie. The curse of programming in C is rewriting the same data structures and algorithms over and over. It’s the problem C++ templates are intended to solve. This rewriting isn’t always bad since each implementation is typically customized for its specific use, often resulting in greater performance and a smaller resource footprint.

Every time I’ve rewritten a trie, my implementation is a little bit better than the last. This time around I discovered an approach for traversing, both depth-first and breadth-first, an arbitrarily-sized trie without memory allocation. I’m definitely not the first to discover something like this. There’s Deutsch-Schorr-Waite pointer reversal for binary graphs (1965) — which I originally learned from reading the Scheme 9 from Outer Space garbage collector source — and Morris in-order traversal (1979) for binary trees. The former requires two extra tag bits per node and the latter requires no modifications at all.

What’s a trie?

But before I go further, some background. A trie can come in many shapes and sizes, but in the simple case each node of a trie has as many pointers as its alphabet. For illustration purposes, imagine a trie for strings of only four characters: A, B, C, and D. Each node is essentially four pointers.

#define TRIE_ALPHABET_SIZE  4
#define TRIE_STATIC_INIT    {.flags = 0}
#define TRIE_TERMINAL_FLAG  (1U << 0)

struct trie {
    struct trie *next[TRIE_ALPHABET_SIZE];
    unsigned flags;
};

It includes a flags field, where a single bit tracks whether or not a node is terminal — that is, a key terminates at this node. Terminal nodes are not necessarily leaf nodes, which is the case when one key is a prefix of another key. I could instead have used a 1-bit bit-field (e.g. int is_terminal : 1;) but I don’t like bit-fields.

A trie with the following keys, inserted in any order:

AAAAA
ABCD
CAA
CAD
CDBD

Looks like this (terminal nodes illustrated as small black squares):

The root of the trie is the empty string, and each child represents a trie prefixed with one of the symbols from the alphabet. This is a nice recursive definition, and it’s tempting to write recursive functions to process it. For example, here’s a recursive insertion function.

int
trie_insert_recursive(struct trie *t, const char *s)
{
    if (!*s) {
        t->flags |= TRIE_TERMINAL_FLAG;
        return 1;
    }

    int i = *s - 'A';
    if (!t->next[i]) {
        t->next[i] = malloc(sizeof(*t->next[i]));
        if (!t->next[i])
            return 0;
        *t->next[i] = (struct trie)TRIE_STATIC_INIT;
    }
    return trie_insert_recursive(t->next[i], s + 1);
}

If the string is empty (!*s), mark the current node as terminal. Otherwise recursively insert the substring under the appropriate child. That’s a tail call, and any optimizing compiler would optimize this call into a jump back to the beginning of of the function (tail-call optimization), reusing the stack frame as if it were a simple loop.

If that’s not good enough, such as when optimization is disabled for debugging and the recursive definition is blowing the stack, this is trivial to convert to a safe, iterative function. I prefer this version anyway.

int
trie_insert(struct trie *t, const char *s)
{
    for (; *s; s++) {
        int i = *s - 'A';
        if (!t->next[i]) {
            t->next[i] = malloc(sizeof(*t->next[i]));
            if (!t->next[i])
                return 0;
            *t->next[i] = (struct trie)TRIE_STATIC_INIT;
        }
        t = t->next[i];
    }
    t->flags |= TRIE_TERMINAL_FLAG;
    return 1;
}

Finding a particular prefix in the trie iteratively is also easy. This would be used to narrow the trie to a chosen prefix before iterating over the keys (e.g. find all strings matching a prefix).

struct trie *
trie_find(struct trie *t, const char *s)
{
    for (; *s; s++) {
        int i = *s - 'A';
        if (!t->next[i])
            return NULL;
        t = t->next[i];
    }
    return t;
}

Depth-first traversal is stack-oriented. The stack represents the path through the graph, and each new vertex is pushed into this stack as it’s visited. A recursive traversal function can implicitly use the call stack for storing this information, so no additional data structure is needed.

The downside is that the call is no longer tail-recursive, so a large trie will blow the stack. Also, the caller needs to provide a callback function because the stack cannot unwind to return a value: The stack has important state on it. Here’s a typedef for the callback.

typedef void (*trie_visitor)(const char *key, void *arg);

And here’s the recursive depth-first traversal function. The top-level caller passes the same buffer for buf and bufend, which must be at least as large as the largest key. The visited key will be written to this buffer and passed to the visitor.

void
trie_dfs_recursive(struct trie *t,
                   char *buf,
                   char *bufend,
                   trie_visitor v,
                   void *arg)
{
    if (t->flags & TRIE_TERMINAL_FLAG) {
        *bufend = 0;
        v(buf, arg);
    }

    for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
        if (t->next[i]) {
            *bufend = 'A' + i;
            trie_dfs_recursive(t->next[i], buf, bufend + 1, v, arg);
        }
    }
}

Heap-allocated Traversal Stack

Moving the traversal stack to the heap would eliminate the stack overflow problem and it would allow control to return to the caller. This is going to be a lot of code for an article, but bear with me.

First define an iterator object. The stack will need two pieces of information: which node did we come from (p) and through which pointer (i). When a node has been exhausted, this will allow return to the parent. The root field tracks when traversal is complete.

struct trie_iter {
    struct trie *root;
    char *buf;
    char *bufend;
    struct {
        struct trie *p;
        int i;
    } *stack;
};

A special value of -1 in i means it’s the first visit for this node and it should be visited by the callback if it’s terminal.

The iterator is initialized with trie_iter_init. The max indicates the maximum length of any key. A more elaborate implementation could automatically grow the stack to accommodate (e.g. realloc()), but I’m keeping it as simple as possible.

int
trie_iter_init(struct trie_iter *it, struct trie *t, size_t max)
{
    it->root = t;
    it->stack = malloc(sizeof(*it->stack) * max);
    if (!it->stack)
        return 0;
    it->buf = it->bufend = malloc(max);
    if (!it->buf) {
        free(it->stack);
        return 0;
    }
    it->stack->p = t;
    it->stack->i = -1;
    return 1;
}

void
trie_iter_destroy(struct trie_iter *it)
{
    free(it->stack);
    it->stack = NULL;
    free(it->buf);
    it->buf = NULL;
}

And finally the complicated part. This uses the allocated stack to explore the trie in a loop until it hits a terminal, at which point it returns. A further call continues the traversal from where it left off. It’s like a hand-coded generator. With the way it’s written, the caller is obligated to follow through with the entire iteration before destroying the iterator, but this would be easy to correct.

int
trie_iter_next(struct trie_iter *it)
{
    for (;;) {
        struct trie *current = it->stack->p;
        int i = it->stack->i++;

        if (i == -1) {
            /* Return result if terminal node. */
            if (current->flags & TRIE_TERMINAL_FLAG) {
                *it->bufend = 0;
                return 1;
            }
            continue;
        }

        if (i == TRIE_ALPHABET_SIZE) {
            /* End of current node. */
            if (current == it->root)
                return 0;  // back at root, done
            it->stack--;
            it->bufend--;
            continue;
        }

        if (current->next[i]) {
            /* Push on next child node. */
            *it->bufend = 'A' + i;
            it->stack++;
            it->bufend++;
            it->stack->p = current->next[i];
            it->stack->i = -1;
        }
    }
}

This is much nicer for the caller since there’s no control inverse.

struct trie_iter it;
trie_iter_init(&it, &trie_root, KEY_MAX);
while (trie_iter_next(&it)) {
    // ... do something with it.buf ...
}
trie_iter_destroy(&it);

There are a few downsides to this:

  1. Initialization could fail (not checked in the example) since it allocates memory.

  2. Either the caller has to keep track of the maximum key length, or the iterator grows the stack automatically, which would mean iteration could fail at any point in the middle.

  3. In order to destroy the trie, it needs to be traversed: Freeing memory first requires allocating memory. If the program is out of memory, it cannot destroy the trie to clean up before handling the situation, nor to make more memory available. It’s not good for resilience.

Wouldn’t it be nice to traverse the trie without memory allocation?

Modifying the Trie

Rather than allocate a separate stack, the stack can be allocated across the individual nodes of the trie. Remember those p and i fields from before? Put them on the trie.

struct trie_v2 {
    struct trie_v2 *next[TRIE_ALPHABET_SIZE];
    struct trie_v2 *p;
    int i;
    unsigned flags;
};

This automatically scales with the size of the trie, so there will always be enough of this stack. With the stack “pre-allocated” like this, traversal requires no additional memory allocation.

The iterator itself becomes a little simpler. It cannot fail and it doesn’t need a destructor.

struct trie_v2_iter {
    struct trie_v2 *current;
    char *buf;
};

void
trie_v2_iter_init(struct trie_v2_iter *it, struct trie_v2 *t, char *buf)
{
    t->p = NULL;
    t->i = -1;
    it->current = t;
    it->buf = buf;
}

The iteration function itself is almost identical to before. Rather than increment a stack pointer, it uses p to chain the nodes as a linked list.

int
trie_v2_iter_next(struct trie_v2_iter *it)
{
    for (;;) {
        struct trie_v2 *current = it->current;
        int i = it->current->i++;

        if (i == -1) {
            /* Return result if terminal node. */
            if (current->flags & TRIE_TERMINAL_FLAG) {
                *it->buf = 0;
                return 1;
            }
            continue;
        }

        if (i == TRIE_ALPHABET_SIZE) {
            /* End of current node. */
            if (!current->p)
                return 0;
            it->current = current->p;
            it->buf--;
            continue;
        }

        if (current->next[i]) {
            /* Push on next child node. */
            *it->buf = 'A' + i;
            it->buf++;
            it->current = current->next[i];
            it->current->p = current;
            it->current->i = -1;
        }

    }
}

During traversal the iteration pointers look something like this:

This is not without its downsides:

  1. Traversal is not re-entrant nor thread-safe. It’s not possible to run multiple in-place iterators side by side on the same trie since they’ll clobber each other.

  2. It uses more memory — O(n) rather than O(max-key-length) — and sits on this extra memory for its entire lifetime.

Breadth-first Traversal

The same technique can be used for breadth-first search, which is queue-oriented rather than stack-oriented. The p pointers are instead chained into a queue, with a head and tail pointer variable for each end. As each node is visited, its children are pushed into the queue linked list.

This isn’t good for visiting keys by name. buf was itself a stack and played nicely with depth-first traversal, but there’s no easy way to build up a key in a buffer breadth-first. So instead here’s a function to destroy a trie breadth-first.

void
trie_v2_destroy(struct trie_v2 *t)
{
    struct trie_v2 *head = t;
    struct trie_v2 *tail = t;
    while (head) {
        for (int i = 0; i < TRIE_ALPHABET_SIZE; i++) {
            struct trie_v2 *next = head->next[i];
            if (next) {
                next->p = NULL;
                tail->p = next;
                tail = next;
            }
        }
        struct trie_v2 *dead = head;
        head = head->p;
        free(dead);
    }
}

During its traversal the p pointers link up like so:

Further Research

In my real code there’s also a flag to indicate the node’s allocation type: static or heap. This allows a trie to be composed of nodes from both kinds of allocations while still safe to destroy. It might also be useful to pack a reference counter into this space so that a node could be shared by more than one trie.

For a production implementation it may be worth packing i into the flags field since it only needs a few bits, even with larger alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the p field could be eliminated and instead one of the child pointers is temporarily reversed. With these changes, this technique would fit into the original struct trie without changes, eliminating the extra memory usage.

Update: Over on Hacker News, psi-squared has interesting suggestions such as leaving the traversal pointers intact, particularly in the case of a breadth-first search, which, until the next trie modification, allows for concurrent follow-up traversals.

Emacs, Dynamic Modules, and Joysticks

Two months ago Emacs 25 was released and introduced a new dynamic module feature. Emacs can now load shared libraries built against Emacs’ module API, defined in emacs-module.h. What’s interesting about this API is that it doesn’t require linking against Emacs or any sort of library. Instead, at run time Emacs supplies the module’s initialization function with function pointers for the entire API.

As a demonstration, in this article I’ll build an Emacs joystick interface (Linux only) using a dynamic module. It will allow Emacs to read events from any joystick on the system. All the source code is here:

It includes a calibration interface (M-x joydemo) within Emacs:

Currently, Emacs’ emacs-module.h header is the entirety of the module documentation. It’s a bit thin and leaves ambiguities that requires some reading of the Emacs source code. Even reading the source, it’s not clear which behaviors are a reliable part of the interface. For example, if there’s a pending non-local exit, it’s safe for a function to return NULL since the return value is never inspected (Emacs 25.1), but will this always be the case? While mistakes are unforgiving (a hard crash), the API is mostly intuitive and it’s been pretty easy to feel my way around it.

Dynamic Module Types

All Emacs values — integers, floats, cons cells, vectors, strings, etc. — are represented as the polymorphic, pointer-valued type, emacs_value. Despite being a pointer, NULL is not a valid value, as convenient as that would be. The API includes functions for creating and extracting the fundamental types: integers, floats, strings. Almost all other object types can only be accessed by making Lisp function calls to regular Emacs functions from the module.

Modules also introduce a brand new Emacs object type: a user pointer. These are non-readable, opaque pointer values returned by modules, typically representing a handle to some resource, be it a memory block, database connection, or a joystick. These objects include a finalizer function pointer — which, surprisingly, is not permitted to be NULL — and their lifetime is managed by Emacs’ garbage collector.

User pointers are a somewhat dangerous feature since there’s little to stop Emacs Lisp code from misusing them. A Lisp program can take a user pointer from one module and pass it to a function in a different module. Since it’s just a pointer, there’s no way to type check it. At best, a module could maintain a table of all its live pointers, checking all user pointer arguments against the table before dereferencing. But I don’t expect this to be normal practice.

Module Initialization

After loading the module through the platform’s mechanism, the first thing Emacs does is check for the symbol plugin_is_GPL_compatible. While tacky, this is not surprising given the culture around Emacs.

Next it calls emacs_module_init(), passing it the first function pointer. From this, the module can get a Lisp environment and start doing Emacs things, such as binding module functions to Lisp symbols.

Here’s a complete “Hello, world!” example:

#include "emacs-module.h"

int plugin_is_GPL_compatible;

int
emacs_module_init(struct emacs_runtime *ert)
{
    emacs_env *env = ert->get_environment(ert);
    emacs_value message = env->intern(env, "message");
    const char hi[] = "Hello, world!";
    emacs_value string = env->make_string(env, hi, sizeof(hi) - 1);
    env->funcall(env, message, 1, &string);
    return 0;
}

In a real module, it’s common to create function objects for native functions, then fetch the fset symbol and make a Lisp call on it to bind the newly-created function object to a name. You’ll see this in action later.

Joystick API

The joystick API will closely resemble Linux’s own joystick API, making for a fairly thin wrapper. It’s so thin that Emacs almost doesn’t even need a dynamic module. This is because, on Linux, joysticks are just files under /dev/input/. Want to see the input events on the first joystick? Just read /dev/input/js0. So Plan 9.

Emacs already knows how to read files, but these virtual files are a little too special for that. The header linux/joystick.h defines a struct js_event:

struct js_event {
    uint32_t time;  /* event timestamp in milliseconds */
    int16_t value;
    uint8_t type;
    uint8_t number; /* axis/button number */
};

The idea is to read from the joystick device into this structure. The first several reads are initialization that define the axes and buttons of the joystick and their initial state. Further events are queued up for the file descriptor. This all means that the file can’t just be opened each time joystick input is needed. It has to be held open for the duration, and is typically configured non-blocking.

The Emacs package will be called joymacs and there will be three functions:

(joymacs-open N)
(joymacs-close JOYSTICK)
(joymacs-read JOYSTICK EVENT-VECTOR)

joymacs-open

The joymacs-open function will take an integer, opening the Nth joystick (/dev/input/jsN). It will create a file descriptor for the joystick device, returning it as a user pointer. Think of it as a sort of “joystick handle.” Now, it could instead return the file descriptor as an integer, but the user pointer has two significant benefits:

  1. The resource will be garbage collected. If the caller loses track of a file descriptor returned as an integer, the joystick device will be held open until Emacs shuts down, using up one of Emacs’ file descriptors. By putting it in a user pointer, the garbage collector will have the module to release the file descriptor if the user loses track of it.

  2. It should be difficult for the user to make a dangerous call. Emacs Lisp can’t create user pointers — they only come from modules — and so the module is less likely to get passed the wrong thing. In the case of joystick-close, the module will be calling close(2) on the argument. We definitely don’t want to make that system call on file descriptors owned by Emacs. Further, since user pointers are mutable, the module can ensure it doesn’t call close(2) twice.

Here’s the implementation for joymacs-open. I’ll over over each part in detail.

static emacs_value
joymacs_open(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)ptr;
    (void)n;
    int id = env->extract_integer(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    char buf[64];
    int buflen = sprintf(buf, "/dev/input/js%d", id);
    int fd = open(buf, O_RDONLY | O_NONBLOCK);
    if (fd == -1) {
        emacs_value signal = env->intern(env, "file-error");
        emacs_value message = env->make_string(env, buf, buflen);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }
    return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);
}

The C function name doesn’t matter to Emacs. It’s static because it doesn’t even matter if the function visible to Emacs. It will get the function pointer later as part of initialization.

This is the prototype for all functions callable by Emacs Lisp, regardless of its arity. It has four arguments:

  1. It gets an environment, env, through which to call back into Emacs.

  2. It gets n, the number of arguments. This is guaranteed to be the correct number of arguments, as specified later when creating the function object, so only variadic functions need to inspect this argument.

  3. The Lisp arguments are passed as an array of values, args. There’s no type declaration when declaring a function object, so these may be of the wrong type. I’ll go over how to deal with this.

  4. Finally, it gets an arbitrary pointer, supplied at function object creation time. This allows the module to create closures, but will usually be ignored.

The first thing the function does is extract its integer argument. This is actually an intmax_t, but I don’t think anyone has that many USB ports. An int will suffice.

    int id = env->extract_integer(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;

As for not underestimating fools, what if the user passed a value that isn’t an integer? Will the world come crashing down? Fortunately Emacs checks that in extract_integer and, if there’s a mismatch, sets a pending error signal in the environment. This is really great because checking types directly in the module is a real pain the ass. So, before committing to anything further, such as opening a file, I check for this signal and bail out early if necessary. In Emacs 25.1 it’s safe to return NULL since the return value will be completely ignored, but I’d rather hedge my bets.

By the way, the nil here is a global variable set in initialization. You don’t just get that for free!

The next step is opening the joystick device, read-only and non-blocking. The non-blocking is vital because the module would otherwise hang Emacs later if there are no events (well, except for the read being quickly interrupted by a POSIX signal).

    char buf[64];
    int buflen = sprintf(buf, "/dev/input/js%d", id);
    int fd = open(buf, O_RDONLY | O_NONBLOCK);

If the joystick fails to open (e.g. it doesn’t exist, or the user lacks permission), manually set an error signal for a non-local exit. I chose the file-error signal and I’m just using the filename as the signal data.

    if (fd == -1) {
        emacs_value signal = env->intern(env, "file-error");
        emacs_value message = env->make_string(env, buf, buflen);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }

Otherwise create the user pointer. No need to allocate any memory; just stuff it in the pointer itself. If the user mistakenly passes it to another module, it will sure be in for a surprise when it tries to dereference it.

    return env->make_user_ptr(env, fin_close, (void *)(intptr_t)fd);

The fin_close() function is defined as:

static void
fin_close(void *fdptr)
{
    int fd = (intptr_t)fdptr;
    if (fd != -1)
        close(fd);
}

The garbage collector will call this function when the user pointer is lost. If the user closes it early with joymacs-close, that function will set the user pointer to -1, an invalid file descriptor, so that it doesn’t get closed a second time here.

joymacs-close

Here’s joymacs-close, which is a bit simpler.

static emacs_value
joymacs_close(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)ptr;
    (void)n;
    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    if (fd != -1) {
        close(fd);
        env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
    }
    return nil;
}

Again, it starts by extracting its argument, relying on Emacs to do the check:

    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;

If the user pointer hasn’t been closed yet, then close it and strip out the file descriptor to prevent further closes.

    if (fd != -1) {
        close(fd);
        env->set_user_ptr(env, args[0], (void *)(intptr_t)-1);
    }

joymacs-read

The joymacs-read function is doing something a little unusual for an Emacs Lisp function. It takes two arguments: the joystick handle and a 5-element vector. Instead of returning the event in some representation, it fills the vector with the event details. The are two reasons for this:

  1. The API has no function for creating vectors … though the module could get the make-symbol vector and call it to create a vector.

  2. The idiom for event pumps is for the caller to supply a buffer to the pump. This has better performance by avoiding lots of unnecessary allocations, especially since events tend to be message-like objects with a short, well-defined extent.

Here’s the full definition:

static emacs_value
joymacs_read(emacs_env *env, ptrdiff_t n, emacs_value *args, void *ptr)
{
    (void)n;
    (void)ptr;
    int fd = (intptr_t)env->get_user_ptr(env, args[0]);
    if (env->non_local_exit_check(env) != emacs_funcall_exit_return)
        return nil;
    struct js_event e;
    int r = read(fd, &e, sizeof(e));
    if (r == -1 && errno == EAGAIN) {
        /* No more events. */
        return nil;
    } else if (r == -1) {
        /* An actual read error (joystick unplugged, etc.). */
        emacs_value signal = env->intern(env, "file-error");
        const char *error = strerror(errno);
        size_t len = strlen(error);
        emacs_value message = env->make_string(env, error, len);
        env->non_local_exit_signal(env, signal, message);
        return nil;
    } else {
        /* Fill out event vector. */
        emacs_value v = args[1];
        emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
        emacs_value value;
        if (type == button)
            value = e.value ? t : nil;
        else
            value =  env->make_float(env, e.value / (double)INT16_MAX);
        env->vec_set(env, v, 0, env->make_integer(env, e.time));
        env->vec_set(env, v, 1, type);
        env->vec_set(env, v, 2, value);
        env->vec_set(env, v, 3, env->make_integer(env, e.number));
        env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
        return args[1];
    }
}

As before, extract the first argument and check for a signal. Then call read(2) to get an event. If the read fails with EAGAIN, it’s not a real failure. There are just no more events, so return nil.

    struct js_event e;
    int r = read(fd, &e, sizeof(e));
    if (r == -1 && errno == EAGAIN) {
        /* No more events. */
        return nil;
    }

If the read failed with something else — perhaps the joystick was unplugged — signal an error. The strerror(3) string is used for the signal data.

    if (r == -1) {
        /* An actual read error (joystick unplugged, etc.). */
        emacs_value signal = env->intern(env, "file-error");
        const char *error = strerror(errno);
        emacs_value message = env->make_string(env, error, strlen(error));
        env->non_local_exit_signal(env, signal, message);
        return nil;
    }

Otherwise fill out the event vector. If the second argument isn’t a vector, or if it’s too short, the signal will automatically get raised by Emacs. The module can keep plowing through the vec_set() calls safely since it’s not committing to anything.

        /* Fill out event vector. */
        emacs_value v = args[1];
        emacs_value type = e.type & JS_EVENT_BUTTON ? button : axis;
        emacs_value value;
        if (type == button)
            value = e.value ? t : nil;
        else
            value =  env->make_float(env, e.value / (double)INT16_MAX);
        env->vec_set(env, v, 0, env->make_integer(env, e.time));
        env->vec_set(env, v, 1, type);
        env->vec_set(env, v, 2, value);
        env->vec_set(env, v, 3, env->make_integer(env, e.number));
        env->vec_set(env, v, 4, e.type & JS_EVENT_INIT ? t : nil);
        return args[1];

The Linux event struct has four fields and the function fills out five values of the vector. This is because the type field has a bit flag indicating initialization events. This is split out into an extra t/nil value. It also normalizes axis values and converts button values into t/nil, which makes more sense for Emacs Lisp. The event itself is returned since it’s a truthy value and it’s convenient for the caller.

The astute programmer might notice that the negative side of the axis could go just below -1.0, since INT16_MIN has one extra value over INT16_MAX (two’s complement). It doesn’t seem to be documented, but the joystick drivers I’ve seen never exactly return INT16_MIN, so this is in fact the correct way to normalize it.

Initialization

All that’s left is the initialization function. First declare some global variables to keep track of frequently-used symbols.

static emacs_value nil;
static emacs_value t;
static emacs_value button;
static emacs_value axis;

These are interned at the very beginning of initialization. The symbols :button and :axis are given global references so that the garbage collector doesn’t rip them out from under the module. It’s unclear from the API, but the make_global_ref() function returns the object being referenced. I trust that the t and nil symbols will never be garbage collected, so these don’t need global references.

    nil = env->intern(env, "nil");
    t = env->intern(env, "t");
    button = env->make_global_ref(env, env->intern(env, ":button"));
    axis = env->make_global_ref(env, env->intern(env, ":axis"));

    emacs_value fset = env->intern(env, "fset");

It also grabs fset locally since it will soon be needed.

Finally, bind the functions. The second and third arguments to make_function are the minimum and maximum number of arguments, which may look familiar. The last argument is that closure pointer I mentioned at the beginning.

    emacs_value args[2];
    args[0] = env->intern(env, "joymacs-open");
    args[1] = env->make_function(env, 1, 1, joymacs_open, doc, 0);
    env->funcall(env, fset, 2, args);

If the module is to be loaded with require like any other package, it needs to provide: (provide 'joymacs).

    emacs_value provide = env->intern(env, "provide");
    emacs_value joymacs = env->intern(env, "joymacs");
    env->funcall(env, provide, 1, &joymacs);

And that’s it!

The source repository now includes a port to Windows (XInput). If you’re on Linux or Windows, have Emacs 25 with modules enabled, and a joystick is plugged in, then make run in the repository should bring up Emacs running a joystick calibration demonstration. The module can’t poke at Emacs when events are ready, so instead there’s a timer that polls the module for events.

I’d like to someday see an Emacs Lisp game well-suited for a joystick.

An Array of Pointers vs. a Multidimensional Array

In a C program, suppose I have a table of color names of similar length. There are two straightforward ways to construct this table. The most common would be an array of char *.

char *colors_ptr[] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The other is a two-dimensional char array.

char colors_2d[][7] = {
    "red",
    "orange",
    "yellow",
    "green",
    "blue",
    "violet"
};

The initializers are identical, and the syntax by which these tables are used is the same, but the underlying data structures are very different. For example, suppose I had a lookup() function that searches the table for a particular color.

int
lookup(const char *color)
{
    int ncolors = sizeof(colors) / sizeof(colors[0]);
    for (int i = 0; i < ncolors; i++)
        if (strcmp(colors[i], color) == 0)
            return i;
    return -1;
}

Thanks to array decay — array arguments are implicitly converted to pointers (§6.9.1-10) — it doesn’t matter if the table is char colors[][7] or char *colors[]. It’s a little bit misleading because the compiler generates different code depending on the type.

Memory Layout

Here’s what colors_ptr, a jagged array, typically looks like in memory.

The array of six pointers will point into the program’s string table, usually stored in a separate page. The strings aren’t in any particular order and will be interspersed with the program’s other string constants. The type of the expression colors_ptr[n] is char *.

On x86-64, suppose the base of the table is in rax, the index of the string I want to retrieve is rcx, and I want to put the string’s address back into rax. It’s one load instruction.

mov   rax, [rax + rcx*8]

Contrast this with colors_2d: six 7-byte elements in a row. No pointers or addresses. Only strings.

The strings are in their defined order, packed together. The type of the expression colors_2d[n] is char [7], an array rather than a pointer. If this was a large table used by a hot function, it would have friendlier cache characteristics — both in locality and predictability.

In the same scenario before with x86-64, it takes two instructions to put the string’s address in rax, but neither is a load.

imul  rcx, rcx, 7
add   rax, rcx

In this particular case, the generated code can be slightly improved by increasing the string size to 8 (e.g. char colors_2d[][8]). The multiply turns into a simple shift and the ALU no longer needs to be involved, cutting it to one instruction. This looks like a load due to the LEA (Load Effective Address), but it’s not.

lea   rax, [rax + rcx*8]

Relocation

There’s another factor to consider: relocation. Nearly every process running on a modern system takes advantage of a security feature called Address Space Layout Randomization (ASLR). The virtual address of code and data is randomized at process load time. For shared libraries, it’s not just a security feature, it’s essential to their basic operation. Libraries cannot possibly coordinate their preferred load addresses with every other library on the system, and so must be relocatable.

If the program is compiled with GCC or Clang configured for position independent code — -fPIC (for libraries) or -fpie + -pie (for programs) — extra work has to be done to support colors_ptr. Those are all addresses in the pointer array, but the compiler doesn’t know what those addresses will be. The compiler fills the elements with temporary values and adds six relocation entries to the binary, one for each element. The loader will fill out the array at load time.

However, colors_2d doesn’t have any addresses other than the address of the table itself. The loader doesn’t need to be involved with each of its elements. Score another point for the two-dimensional array.

On x86-64, in both cases the table itself typically doesn’t need a relocation entry because it will be RIP-relative (in the small code model). That is, code that uses the table will be at a fixed offset from the table no matter where the program is loaded. It won’t need to be looked up using the Global Offset Table (GOT).

In case you’re ever reading compiler output, in Intel syntax the assembly for putting the table’s RIP-relative address in rax looks like so:

;; NASM:
lea    rax, [rel address]
;; Some others:
lea    rax, [rip + address]

Or in AT&T syntax:

lea    address(%rip), %rax

Virtual Memory

Besides (trivially) more work for the loader, there’s another consequence to relocations: Pages containing relocations are not shared between processes (except after fork()). When loading a program, the loader doesn’t copy programs and libraries to memory so much as it memory maps their binaries with copy-on-write semantics. If another process is running with the same binaries loaded (e.g. libc.so), they’ll share the same physical memory so long as those pages haven’t been modified by either process. Modifying the page creates a unique copy for that process.

Relocations modify parts of the loaded binary, so these pages aren’t shared. This means colors_2d has the possibility of being shared between processes, but colors_ptr (and its entire page) definitely does not. Shucks.

This is one of the reasons why the Procedure Linkage Table (PLT) exists. The PLT is an array of function stubs for shared library functions, such as those in the C standard library. Sure, the loader could go through the program and fill out the address of every library function call, but this would modify lots and lots of code pages, creating a unique copy of large parts of the program. Instead, the dynamic linker lazily supplies jump addresses for PLT function stubs, one per accessed library function.

However, as I’ve written it above, it’s unlikely that even colors_2d will be shared. It’s still missing an important ingredient: const.

Const

They say const isn’t for optimization but, darnit, this situation keeps coming up. Since colors_ptr and colors_2d are both global, writable arrays, the compiler puts them in the same writable data section of the program, and, in my test program, they end up right next to each other in the same page. The other relocations doom colors_2d to being a local copy.

Fortunately it’s trivial to fix by adding a const:

const char colors_2d[][7] = { /* ... */ };

Writing to this memory is now undefined behavior, so the compiler is free to put it in read-only memory (.rodata) and separate from the dirty relocations. On my system, this is close enough to the code to wind up in executable memory.

Note, the equivalent for colors_ptr requires two const qualifiers, one for the array and another for the strings. (Obviously the const doesn’t apply to the loader.)

const char *const colors_ptr[] = { /* ... */ };

String literals are already effectively const, though the C specification (unlike C++) doesn’t actually define them to be this way. But, like setting your relationship status on Facebook, declaring it makes it official.

It’s just micro-optimization

These little details are all deep down the path of micro-optimization and will rarely ever matter in practice, but perhaps you learned something broader from all this. This stuff fascinates me.

Small-Size Optimization in C

I’ve worked on many programs that frequently require small, short-lived buffers for use as a temporary workspace, perhaps to construct a string or array. In C this is often accomplished with arrays of automatic storage duration (i.e. allocated on the stack). This is dirt cheap — much cheaper than a heap allocation — and, unlike a typical general-purpose allocator, involves no thread contention. However, the catch that there may be no hard bound to the buffer. For correctness, the scratch space must scale appropriately to match its input. Whatever arbitrary buffer size I pick may be too small.

A widespread extension to C is the alloca() pseudo-function. It’s like malloc(), but allocates memory on the stack, just like an automatic variable. The allocation is automatically freed when the function (not its scope!) exits, even with a longjmp() or other non-local exit.

void *alloca(size_t size);

Besides its portability issues, the most dangerous property is the complete lack of error detection. If size is too large, the program simply crashes, or worse.

For example, suppose I have an intern() function that finds or creates the canonical representation/storage for a particular string. My program needs to intern a string composed of multiple values, and will construct a temporary string to do so.

const char *intern(const char *);

const char *
intern_identifier(const char *prefix, long id)
{
    size_t size = strlen(prefix) + 32;
    char *buffer = alloca(size);
    sprintf(buffer, "%s%ld", prefix, id);
    return intern(buffer);
}

I expect the vast majority of these prefix strings to be very small, perhaps on the order of 10 to 80 bytes, and this function will handle them extremely efficiently. But should this function get passed a huge prefix, perhaps by a malicious actor, the program will misbehave without warning.

A portable alternative to alloca() is variable-length arrays (VLA), introduced in C99. Arrays with automatic storage duration need not have a fixed, compile-time size. It’s just like alloca(), having exactly the same dangers, but at least it’s properly scoped. It was rejected for inclusion in C++11 due to this danger.

const char *
intern_identifier(const char *prefix, long id)
{
    char buffer[strlen(prefix) + 32];
    sprintf(buffer, "%s%ld", prefix, id);
    return intern(buffer);
}

There’s a middle-ground to this, using neither VLAs nor alloca(). Suppose the function always allocates a small, fixed size buffer — essentially a free operation — but only uses this buffer if it’s large enough for the job. If it’s not, a normal heap allocation is made with malloc().

const char *
intern_identifier(const char *prefix, long id)
{
    char temp[256];
    char *buffer = temp;
    size_t size = strlen(prefix) + 32;
    if (size > sizeof(temp))
        if (!(buffer = malloc(size)))
            return NULL;
    sprintf(buffer, "%s%ld", prefix, id);
    const char *result = intern(buffer);
    if (buffer != temp)
        free(buffer);
    return result;
}

Since the function can now detect allocation errors, this version has an error condition. Though, intern() itself would presumably return NULL for its own allocation errors, so this is probably transparent to the caller.

We’ve now entered the realm of small-size optimization. The vast majority of cases are small and will therefore be very fast, but we haven’t given up on the odd large case either. In fact, it’s been made a little bit worse (via the unnecessary small allocation), selling it out to make the common case fast. That’s sound engineering.

Visual Studio has a pair of functions that nearly automate this solution: _malloca() and _freea(). It’s like alloca(), but allocations beyond a certain threshold go on the heap. This allocation is freed with _freea(), which does nothing in the case of a stack allocation.

void *_malloca(size_t);
void _freea(void *);

I said “nearly” because Microsoft screwed it up: instead of returning NULL on failure, it generates a stack overflow structured exception (for a heap allocation failure).

I haven’t tried it yet, but I bet something similar to malloca() / freea() could be implemented using a couple of macros.

Toward Structured Small-Size Optimization

CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the talks. I don’t like developing in C++, but I always learn new, interesting concepts from this conference, many of which apply directly to C. I look forward to Chandler Carruth’s talks the most, having learned so much from his past talks. I recommend these all:

After writing this article, I saw Nicholas Ormrod’s talk, The strange details of std::string at Facebook, which is also highly relevant.

Chandler’s talk this year was the one on hybrid data structures. I’d already been mulling over small-size optimization for months, and the first 5–10 minutes of his talk showed me I was on the right track. In his talk he describes LLVM’s SmallVector class (among others), which is basically a small-size-optimized version of std::vector, which, due to constraints on iterators under std::move() semantics, can’t itself be small-size optimized.

I picked up a new trick from this talk, which I’ll explain in C’s terms. Suppose I have a dynamically growing buffer “vector” of long values. I can keep pushing values into the buffer, doubling the storage in size each time it fills. I’ll call this one “simple.”

struct vec_simple {
    size_t size;
    size_t count;
    long *values;
};

Initialization is obvious. Though for easy overflow checks, and for another reason I’ll explain later, I’m going to require the starting size, hint, to be a power of two. It returns 1 on success and 0 on error.

int
vec_simple_init(struct vec_simple *v, size_t hint)
{
    assert(hint && (hint & (hint - 1)) == 0);  // power of 2
    v->size = hint;
    v->count = 0;
    v->values = malloc(sizeof(v->values[0]) * v->size);
    return !!v->values;
}

Pushing is straightforward, using realloc() when the buffer fills, returning 0 for integer overflow or allocation failure.

int
vec_simple_push(struct vec_simple *v, long x)
{
    if (v->count == v->size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = v->size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; // overflow
        void *new_values = realloc(v->values, new_size * value_size);
        if (!new_values)
            return 0; // out of memory
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

And finally, cleaning up. I hadn’t thought about this before, but if the compiler manages to inline vec_simple_free(), that NULL pointer assignment will probably get optimized out, possibly even in the face of a use-after-free bug.

void
vec_simple_free(struct vec_simple *v)
{
    free(v->values);
    v->values = 0;  // trap use-after-free bugs
}

And finally an example of its use (without checking for errors).

long
example(long (*f)(void *), void *arg)
{
    struct vec_simple v;
    vec_simple_init(&v, 16);
    long n;
    while ((n = f(arg)) > 0)
        vec_simple_push(&v, n);
    // ... process vector ...
    vec_simple_free(&v);
    return result;
}

If the common case is only a handful of long values, and this function is called frequently, we’re doing a lot of heap allocation that could be avoided. Wouldn’t it be nice to put all that on the stack?

Applying Small-Size Optimization

Modify the struct to add this temp field. It’s probably obvious what I’m getting at here. This is essentially the technique in SmallVector.

struct vec_small {
    size_t size;
    size_t count;
    long *values;
    long temp[16];
};

The values field is initially pointed at the small buffer. Notice that unlike the “simple” vector above, this initialization function cannot fail. It’s one less thing for the caller to check. It also doesn’t take a hint since the buffer size is fixed.

void
vec_small_init(struct vec_small *v)
{
    v->size = sizeof(v->temp) / sizeof(v->temp[0]);
    v->count = 0;
    v->values = v->temp;
}

Pushing gets a little more complicated. If it’s the first time the buffer has grown, the realloc() has to be done “manually” with malloc() and memcpy().

int
vec_small_push(struct vec_small *v, long x)
{
    if (v->count == v->size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = v->size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; // overflow

        void  *new_values;
        if (v->temp == v->values) {
            /* First time heap allocation. */
            new_values = malloc(new_size * value_size);
            if (new_values)
                memcpy(new_values, v->temp, sizeof(v->temp));
        } else {
            new_values = realloc(v->values, new_size * value_size);
        }

        if (!new_values)
            return 0; // out of memory
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

Finally, only call free() if the buffer was actually allocated on the heap.

void
vec_small_free(struct vec_small *v)
{
    if (v->values != v->temp)
        free(v->values);
    v->values = 0;
}

If 99% of these vectors never exceed 16 elements, then 99% of the time the heap isn’t touched. That’s much better than before. The 1% case is still covered, too, at what is probably an insignificant cost.

An important difference to SmallVector is that they parameterize the small buffer’s size through the template. In C we’re stuck with fixed sizes or macro hacks. Or are we?

Using a Caller-Provided Buffer

This time remove the temporary buffer, making it look like the simple vector from before.

struct vec_flex {
    size_t size;
    size_t count;
    long *values;
};

The user will provide the initial buffer, which will presumably be an adjacent, stack-allocated array, but whose size is under the user’s control.

void
vec_flex_init(struct vec_flex *v, long *init, size_t nmemb)
{
    assert(nmemb > 1); // we need that low bit!
    assert(nmemb && (nmemb & (nmemb - 1)) == 0); // power of 2
    v->size = nmemb | 1;
    v->count = 0;
    v->values = init;
}

The power of two size, greater than one, means the size will always be an even number. Why is this important? There’s one piece of information missing from the struct: Is the buffer currently heap allocated or not? That’s just one bit of information, but adding just one more bit to the struct will typically pad it out another 31 or 63 more bits. What a waste! Since I’m not using the lowest bit of the size (always being an even number), I can smuggle it in there. Hence the nmemb | 1, the 1 indicating that it’s not heap allocated.

When pushing, the actual_size is extracted by clearing the bottom bit (size & ~1) and the indicator bit is extracted with a 1 bit mask (size & 1). The bit is cleared by virtue of not intentionally setting it again.

int
vec_flex_push(struct vec_flex *v, long x)
{
    size_t actual_size = v->size & ~(size_t)1; // clear bottom bit
    if (v->count == actual_size) {
        size_t value_size = sizeof(v->values[0]);
        size_t new_size = actual_size * 2;
        if (!new_size || value_size > (size_t)-1 / new_size)
            return 0; /* overflow */

        void *new_values;
        if (v->size & 1) {
            /* First time heap allocation. */
            new_values = malloc(new_size * value_size);
            if (new_values)
                memcpy(new_values, v->values, actual_size * value_size);
        } else {
            new_values = realloc(v->values, new_size * value_size);
        }

        if (!new_values)
            return 0; /* out of memory */
        v->size = new_size;
        v->values = new_values;
    }
    v->values[v->count++] = x;
    return 1;
}

Only free() when it’s been allocated, like before.

void
vec_flex_free(struct vec_flex *v)
{
    if (!(v->size & 1))
        free(v->values);
    v->values = 0;
}

And here’s what it looks like in action.

long
example(long (*f)(void *), void *arg)
{
    struct vec_flex v;
    long buffer[16];
    vec_flex_init(&v, buffer, sizeof(buffer) / sizeof(buffer[0]));
    long n;
    while ((n = f(arg)) > 0)
        vec_flex_push(&v, n);
    // ... process vector ...
    vec_flex_free(&v);
    return result;
}

If you were to log all vector sizes as part of profiling, and the assumption about their typical small number of elements was correct, you could easily tune the array size in each case to remove the vast majority of vector heap allocations.

Now that I’ve learned this optimization trick, I’ll be looking out for good places to apply it. It’s also a good reason for me to stop abusing VLAs.

null program

Chris Wellons

(PGP)