Chunking Optimizations: Let the Knife Do the Work

December 09, 2019

nullprogram.com/blog/2019/12/09/

There’s an old saying, let the knife do the work. Whether preparing food in the kitchen or whittling a piece of wood, don’t push your weight into the knife. Not only is it tiring, you’re much more likely to hurt yourself. Use the tool properly and little force will be required.

The same advice also often applies to compilers.

Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of data. The simplest approach would be to do it a byte at a time:

/* XOR src into dst */
void
xor512a(void *dst, void *src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}

Maybe you benchmark it or you look at the assembly output, and the results are disappointing. Your compiler did exactly what you asked of it and produced code that performs 64 single-byte XOR operations (GCC 9.2.0, x86-64, -Os):

xor512a:
        xor    eax, eax
.L0:    mov    cl, [rsi+rax]
        xor    [rdi+rax], cl
        inc    rax
        cmp    rax, 64
        jne    .L0
        ret

The target architecture has wide registers so it could be doing at least 8 bytes at a time. Since your compiler isn’t doing it, you decide to chunk the work into 8 byte blocks yourself in an attempt to manually implement a chunking operation. Here’s some real world code that does so:

/* WARNING: Broken, do not use! */
void
xor512b(void *dst, void *src)
{
    uint64_t *pd = dst;
    uint64_t *ps = src;
    for (int i = 0; i < 8; i++) {
        pd[i] ^= ps[i];
    }
}

You check the assembly output of this function, and it looks much better. It’s now processing 8 bytes at a time, so it should be about 8 times faster than before.

xor512b:
        xor    eax, eax
.L0:    mov    rcx, [rsi+rax*8]
        xor    [rdi+rax*8], rcx
        inc    rax
        cmp    rax, 8
        jne    .L0
        ret

Still, this machine has 16-byte wide registers (SSE2 xmm), so there could be another doubling in speed. Oh well, this is good enough, so you plug it into your program. But something strange happens: The output is now wrong!

int
main(void)
{
    uint32_t dst[32] = {
        1, 2, 3, 4, 5, 6, 7, 8,
        9, 10, 11, 12, 13, 14, 15, 16
    };
    uint32_t src[32] = {
        1, 4, 9, 16, 25, 36, 49, 64,
        81, 100, 121, 144, 169, 196, 225, 256,
    };
    xor512b(dst, src);
    for (int i = 0; i < 16; i++) {
        printf("%d\n", (int)dst[i]);
    }
}

Your program prints 1..16 as if xor512b() was never called. You check over everything a dozen times, and you can’t find anything wrong. Even crazier, if you disable optimizations then the bug goes away. It must be some kind of compiler bug!

Investigating a bit more, you learn that the -fno-strict-aliasing option also fixes the bug. That’s because this program violates C strict aliasing rules. An array of uint32_t was accessed as a uint64_t. As an important optimization, compilers are allowed to assume such variables do not alias and generate code accordingly. Otherwise every memory store could potentially modify any variable, which limits the compiler’s ability to produce decent code.

The original version is fine because char *, including both signed and unsigned, has a special exemption and may alias with anything. For the same reason, using char * unnecessarily can also make your programs slower.

What could you do to keep the chunking operation while not running afoul of strict aliasing? Counter-intuitively, you could use memcpy(). Copy the chunks into legitimate, local uint64_t variables, do the work, and copy the result back out.

void
xor512c(void *dst, void *src)
{
    for (int i = 0; i < 8; i++) {
        uint64_t buf[2];
        memcpy(buf + 0, (char *)dst + i*8, 8);
        memcpy(buf + 1, (char *)src + i*8, 8);
        buf[0] ^= buf[1];
        memcpy((char *)dst + i*8, buf, 8);
    }
}

Since memcpy() is a built-in function, your compiler knows its semantics and can ultimately elide all that copying. The assembly listing for xor512c is identical to xor512b, but it won’t go haywire when integrated into a real program.

It works and it’s correct, but you can still do much better than this!

Letting your compiler do the work

The problem is you’re forcing the knife and not letting it do the work. There’s a constraint on your compiler that hasn’t been considered: It must work correctly for overlapping inputs.

char buf[74] = {...};
xor512a(buf, buf + 10);

In this situation, the byte-by-byte and chunked versions of the function will have different results. That’s exactly why your compiler can’t do the chunking operation itself. However, you don’t care about this situation because the inputs never overlap.

Let’s revisit the first, simple implementation, but this time being smarter about it. The restrict keyword indicates that the inputs will not overlap, freeing your compiler of this unwanted concern.

void
xor512d(void *restrict dst, void *restrict src)
{
    unsigned char *pd = dst;
    unsigned char *ps = src;
    for (int i = 0; i < 64; i++) {
        pd[i] ^= ps[i];
    }
}

(Side note: Adding restrict to the manually chunked function, xor512b(), will not fix it. Using restrict can never make an incorrect program correct.)

Compiled with GCC 9.2.0 and -O3, the resulting unrolled code processes 16-byte chunks at a time (pxor):

xor512d:
        movdqu  xmm0, [rdi+0x00]
        movdqu  xmm1, [rsi+0x00]
        movdqu  xmm2, [rsi+0x10]
        movdqu  xmm3, [rsi+0x20]
        pxor    xmm0, xmm1
        movdqu  xmm4, [rdi+0x30]
        movups  [rdi+0x00], xmm0
        movdqu  xmm0, [rdi+0x10]
        pxor    xmm0, xmm2
        movups  [rdi+0x10], xmm0
        movdqu  xmm0, [rdi+0x20]
        pxor    xmm0, xmm3
        movups  [rdi+0x20], xmm0
        movdqu  xmm0, [rsi+0x30]
        pxor    xmm0, xmm4
        movups  [rdi+0x30], xmm0
        ret

Compiled with Clang 9.0.0 with AVX-512 enabled in the target (-mavx512bw), it does the entire operation in a single, big chunk!

xor512d:
        vmovdqu64   zmm0, [rdi]
        vpxorq      zmm0, zmm0, [rsi]
        vmovdqu64   [rdi], zmm0
        vzeroupper
        ret

“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.

Chunking Optimizations: Let the Knife Do the Work

Letting your compiler do the work

null program

Chris Wellons