Portable Structure Access with Member Offset Constants

Suppose you need to write a C program to access a long sequence of structures from a binary file in a specified format. These structures have different lengths and contents, but also a common header identifying its type and size. Here’s the definition of that header (no padding):

struct event {
    uint64_t time;   // unix epoch (microseconds)
    uint32_t size;   // including this header (bytes)
    uint16_t source;
    uint16_t type;

The size member is used to find the offset of the next structure in the file without knowing anything else about the current structure. Just add size to the offset of the current structure.

The type member indicates what kind of data follows this structure. The program is likely to switch on this value.

The actual structures might look something like this (in the spirit of X-COM). Note how each structure begins with struct event as header. All angles are expressed using binary scaling.

#define EVENT_TYPE_OBSERVER            10
#define EVENT_TYPE_UFO_SIGHTING        20

struct observer {
    struct event event;
    uint32_t latitude;   // binary scaled angle
    uint32_t longitude;  //
    uint16_t source_id;  // later used for event source
    uint16_t name_size;  // not including null terminator
    char name[];

struct ufo_sighting {
    struct event event;
    uint32_t azimuth;    // binary scaled angle
    uint32_t elevation;  //

struct suspicious_signal {
    struct event event;
    uint16_t num_channels;
    uint16_t sample_rate;  // Hz
    uint32_t num_samples;  // per channel
    int16_t samples[];

If all integers are stored in little endian byte order (least significant byte first), there’s a strong temptation to lay the structures directly over the data. After all, this will work correctly on most computers.

struct event header;
fread(buffer, sizeof(header), 1, file);
switch (header.type) {
    // ...

This code will not work correctly when:

  1. The host machine doesn’t use little endian byte order, though this is now uncommon. Sometimes developers will attempt to detect the byte order at compile time and use the preprocessor to byte-swap if needed. This is a mistake.

  2. The host machine has different alignment requirements and so introduces additional padding to the structure. Sometimes this can be resolved with a non-standard #pragma pack.

Integer extraction functions

Fortunately it’s easy to write fast, correct, portable code for this situation. First, define some functions to extract little endian integers from an octet buffer (uint8_t). These will work correctly regardless of the host’s alignment and byte order.

static inline uint16_t
extract_u16le(const uint8_t *buf)
    return (uint16_t)buf[1] << 8 |
           (uint16_t)buf[0] << 0;

static inline uint32_t
extract_u32le(const uint8_t *buf)
    return (uint32_t)buf[3] << 24 |
           (uint32_t)buf[2] << 16 |
           (uint32_t)buf[1] <<  8 |
           (uint32_t)buf[0] <<  0;

static inline uint64_t
extract_u64le(const uint8_t *buf)
    return (uint64_t)buf[7] << 56 |
           (uint64_t)buf[6] << 48 |
           (uint64_t)buf[5] << 40 |
           (uint64_t)buf[4] << 32 |
           (uint64_t)buf[3] << 24 |
           (uint64_t)buf[2] << 16 |
           (uint64_t)buf[1] <<  8 |
           (uint64_t)buf[0] <<  0;

The big endian version is identical, but with shifts in reverse order.

A common concern is that these functions are a lot less efficient than they could be. On x86 where alignment is very relaxed, each could be implemented as a single load instruction. However, on GCC 4.x and earlier, extract_u32le compiles to something like this:

        movzx   eax, [rdi+3]
        sal     eax, 24
        mov     edx, eax
        movzx   eax, [rdi+2]
        sal     eax, 16
        or      eax, edx
        movzx   edx, [rdi]
        or      eax, edx
        movzx   edx, [rdi+1]
        sal     edx, 8
        or      eax, edx

It’s tempting to fix the problem with the following definition:

// Note: Don't do this.
static inline uint32_t
extract_u32le(const uint8_t *buf)
    return *(uint32_t *)buf;

It’s unportable, it’s undefined behavior, and worst of all, it might not work correctly even on x86. Fortunately I have some great news. On GCC 5.x and above, the correct definition compiles to the desired, fast version. It’s the best of both worlds.

        mov     eax, [rdi]

It’s even smart about the big endian version:

static inline uint32_t
extract_u32be(const uint8_t *buf)
    return (uint32_t)buf[0] << 24 |
           (uint32_t)buf[1] << 16 |
           (uint32_t)buf[2] <<  8 |
           (uint32_t)buf[3] <<  0;

Is compiled to exactly what you’d want:

        mov     eax, [rdi]
        bswap   eax

Or, even better, if your system supports movbe (gcc -mmovbe):

        movbe   eax, [rdi]

Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m betting it will eventually learn how to do this, too.

Member offset constants

For this next technique, that struct event from above need not actually be in the source. It’s purely documentation. Instead, let’s define the structure in terms of member offset constants — a term I just made up for this article. I’ve included the integer types as part of the name to aid in their correct use.

#define EVENT_U64LE_TIME    0
#define EVENT_U32LE_SIZE    8
#define EVENT_U16LE_SOURCE  12
#define EVENT_U16LE_TYPE    14

Given a buffer, the integer extraction functions, and these offsets, structure members can be plucked out on demand.

uint8_t *buf;
// ...
uint64_t time   = extract_u64le(buf + EVENT_U64LE_TIME);
uint32_t size   = extract_u32le(buf + EVENT_U32LE_SIZE;
uint16_t source = extract_u16le(buf + EVENT_U16LE_SOURCE);
uint16_t type   = extract_u16le(buf + EVENT_U16LE_TYPE);

On x86 with GCC 5.x, each member access will be inlined and compiled to a one-instruction extraction. As far as performance is concerned, it’s identical to using a structure overlay, but this time the C code is clean and portable. A slight downside is the lack of type checking on member access: it’s easy to mismatch the types and accidentally read garbage.

Memory mapping and iteration

There’s a real advantage to memory mapping the input file and using its contents directly. On a system with a huge virtual address space, such as x86-64 or AArch64, this memory is almost “free.” Already being backed by a file, paging out this memory costs nothing (i.e. it’s discarded). The input file can comfortably be much larger than physical memory without straining the system.

Unportable structure overlay can take advantage of memory mapping this way, but has the previously-described issues. An approach with member offset constants will take advantage of it just as well, all while remaining clean and portable.

I like to wrap the memory mapping code into a simple interface, which makes porting to non-POSIX platforms, such Windows, easier. Caveat: This won’t work with files whose size exceeds the available contiguous virtual memory of the system — a real problem for 32-bit systems.

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

uint8_t *
map_file(const char *path, size_t *length)
    int fd = open(path, O_RDONLY);
    if (fd == -1)
        return 0;

    struct stat stat;
    if (fstat(fd, &stat) == -1) {
        return 0;

    *length = stat.st_size;  // TODO: possible overflow
    uint8_t *p = mmap(0, *length, PROT_READ, MAP_PRIVATE, fd, 0);
    return p != MAP_FAILED ? p : 0;

unmap_file(uint8_t *p, size_t length)
    munmap(p, length);

Next, here’s an example that iterates over all the structures in input_file, in this case counting each. The size member is extracted in order to stride to the next structure.

size_t length;
uint8_t *data = map_file(input_file, &length);
if (!data)

size_t event_count = 0;
uint8_t *p = data;
while (p < data + length) {
    uint32_t size = extract_u32le(p + EVENT_U32LE_SIZE);
    if (size > length - (p - data))
        FATAL();  // invalid size
    p += size;
printf("I see %zu events.\n", event_count);

unmap_file(data, length);

This is the basic structure for navigating this kind of data. A deeper dive would involve a switch inside the loop, extracting the relevant members for whatever use is needed.

Fast, correct, simple. Pick three.

blog comments powered by Disqus

null program

Chris Wellons