## Chunking Optimizations: Let the Knife Do the Work

There’s an old saying, let the knife do the work. Whether preparing food in the kitchen or whittling a piece of wood, don’t push your weight into the knife. Not only is it tiring, you’re much more likely to hurt yourself. Use the tool properly and little force will be required.

The same advice also often applies to compilers.

Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of data. The simplest approach would be to do it a byte at a time:

``````/* XOR src into dst */
void
xor512a(void *dst, void *src)
{
unsigned char *pd = dst;
unsigned char *ps = src;
for (int i = 0; i < 64; i++) {
pd[i] ^= ps[i];
}
}
``````

Maybe you benchmark it or you look at the assembly output, and the results are disappointing. Your compiler did exactly what you asked of it and produced code that performs 64 single-byte XOR operations (GCC 9.2.0, x86-64, `-Os`):

``````xor512a:
xor    eax, eax
.L0:    mov    cl, [rsi+rax]
xor    [rdi+rax], cl
inc    rax
cmp    rax, 64
jne    .L0
ret
``````

The target architecture has wide registers so it could be doing at least 8 bytes at a time. Since your compiler isn’t doing it, you decide to chunk the work into 8 byte blocks yourself in an attempt to manually implement a chunking operation. Here’s some real world code that does so:

``````/* WARNING: Broken, do not use! */
void
xor512b(void *dst, void *src)
{
uint64_t *pd = dst;
uint64_t *ps = src;
for (int i = 0; i < 8; i++) {
pd[i] ^= ps[i];
}
}
``````

You check the assembly output of this function, and it looks much better. It’s now processing 8 bytes at a time, so it should be about 8 times faster than before.

``````xor512b:
xor    eax, eax
.L0:    mov    rcx, [rsi+rax*8]
xor    [rdi+rax*8], rcx
inc    rax
cmp    rax, 8
jne    .L0
ret
``````

Still, this machine has 16-byte wide registers (SSE2 `xmm`), so there could be another doubling in speed. Oh well, this is good enough, so you plug it into your program. But something strange happens: The output is now wrong!

``````int
main(void)
{
uint32_t dst[32] = {
1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16
};
uint32_t src[32] = {
1, 4, 9, 16, 25, 36, 49, 64,
81, 100, 121, 144, 169, 196, 225, 256,
};
xor512b(dst, src);
for (int i = 0; i < 16; i++) {
printf("%d\n", (int)dst[i]);
}
}
``````

Your program prints 1..16 as if `xor512b()` was never called. You check over everything a dozen times, and you can’t find anything wrong. Even crazier, if you disable optimizations then the bug goes away. It must be some kind of compiler bug!

Investigating a bit more, you learn that the `-fno-strict-aliasing` option also fixes the bug. That’s because this program violates C strict aliasing rules. An array of `uint32_t` was accessed as a `uint64_t`. As an important optimization, compilers are allowed to assume such variables do not alias and generate code accordingly. Otherwise every memory store could potentially modify any variable, which limits the compiler’s ability to produce decent code.

The original version is fine because `char *`, including both `signed` and `unsigned`, has a special exemption and may alias with anything. For the same reason, using `char *` unnecessarily can also make your programs slower.

What could you do to keep the chunking operation while not running afoul of strict aliasing? Counter-intuitively, you could use `memcpy()`. Copy the chunks into legitimate, local `uint64_t` variables, do the work, and copy the result back out.

``````void
xor512c(void *dst, void *src)
{
for (int i = 0; i < 8; i++) {
uint64_t buf[2];
memcpy(buf + 0, (char *)dst + i*8, 8);
memcpy(buf + 1, (char *)src + i*8, 8);
buf[0] ^= buf[1];
memcpy((char *)dst + i*8, buf, 8);
}
}
``````

Since `memcpy()` is a built-in function, your compiler knows its semantics and can ultimately elide all that copying. The assembly listing for `xor512c` is identical to `xor512b`, but it won’t go haywire when integrated into a real program.

It works and it’s correct, but you can still do much better than this!

### Letting your compiler do the work

The problem is you’re forcing the knife and not letting it do the work. There’s a constraint on your compiler that hasn’t been considered: It must work correctly for overlapping inputs.

``````char buf[74] = {...};
xor512a(buf, buf + 10);
``````

In this situation, the byte-by-byte and chunked versions of the function will have different results. That’s exactly why your compiler can’t do the chunking operation itself. However, you don’t care about this situation because the inputs never overlap.

Let’s revisit the first, simple implementation, but this time being smarter about it. The `restrict` keyword indicates that the inputs will not overlap, freeing your compiler of this unwanted concern.

``````void
xor512d(void *restrict dst, void *restrict src)
{
unsigned char *pd = dst;
unsigned char *ps = src;
for (int i = 0; i < 64; i++) {
pd[i] ^= ps[i];
}
}
``````

(Side note: Adding `restrict` to the manually chunked function, `xor512b()`, will not fix it. Using `restrict` can never make an incorrect program correct.)

Compiled with GCC 9.2.0 and `-O3`, the resulting unrolled code processes 16-byte chunks at a time (`pxor`):

``````xor512d:
movdqu  xmm0, [rdi+0x00]
movdqu  xmm1, [rsi+0x00]
movdqu  xmm2, [rsi+0x10]
movdqu  xmm3, [rsi+0x20]
pxor    xmm0, xmm1
movdqu  xmm4, [rdi+0x30]
movups  [rdi+0x00], xmm0
movdqu  xmm0, [rdi+0x10]
pxor    xmm0, xmm2
movups  [rdi+0x10], xmm0
movdqu  xmm0, [rdi+0x20]
pxor    xmm0, xmm3
movups  [rdi+0x20], xmm0
movdqu  xmm0, [rsi+0x30]
pxor    xmm0, xmm4
movups  [rdi+0x30], xmm0
ret
``````

Compiled with Clang 9.0.0 with AVX-512 enabled in the target (`-mavx512bw`), it does the entire operation in a single, big chunk!

``````xor512d:
vmovdqu64   zmm0, [rdi]
vpxorq      zmm0, zmm0, [rsi]
vmovdqu64   [rdi], zmm0
vzeroupper
ret
``````

“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht , or see existing discussions.