nullprogram.com/blog/2018/02/07/
This article is an expanded email I wrote in response to a question
from Frank Muller. He asked how I arrived at my solution to a
branchless UTF-8 decoder:
I mean, when you started, I’m pretty the initial solution was using
branches, right? Then, you’ve applied some techniques to eliminate
them.
A bottom-up approach that begins with branches and then proceeds to
eliminate them one at a time sounds like a plausible story. However,
this story is the inverse of how it actually played out. It began when I
noticed a branchless decoder could probably be done, then I put together
the pieces one at a time without introducing any branches. But what
sparked that initial idea?
The two prior posts reveal my train of thought at the time: a look at
the Blowfish cipher and a 64-bit PRNG shootout. My
layman’s study of Blowfish was actually part of an examination of a
number of different block ciphers. For example, I also read the NSA’s
Speck and Simon paper and then implemented the 128/128 variant
of Speck — a 128-bit key and 128-bit block. I didn’t take the
time to write an article about it, but note how the entire cipher —
key schedule, encryption, and decryption — is just 40 lines of code:
struct speck {
uint64_t k[32];
};
void
speck_init(struct speck *ctx, uint64_t x, uint64_t y)
{
ctx->k[0] = y;
for (uint64_t i = 0; i < 31; i++) {
x = (x >> 8) | (x << 56);
x += y;
x ^= i;
y = (y << 3) | (y >> 61);
y ^= x;
ctx->k[i + 1] = y;
}
}
void
speck_encrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
for (int i = 0; i < 32; i++) {
*x = (*x >> 8) | (*x << 56);
*x += *y;
*x ^= ctx->k[i];
*y = (*y << 3) | (*y >> 61);
*y ^= *x;
}
}
static void
speck_decrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
for (int i = 0; i < 32; i++) {
*y ^= *x;
*y = (*y >> 3) | (*y << 61);
*x ^= ctx->k[31 - i];
*x -= *y;
*x = (*x << 8) | (*x >> 56);
}
}
Isn’t that just beautiful? It’s so tiny and fast. Other than the
not-very-arbitrary selection of 32 rounds, and the use of 3-bit and
8-bit rotations, there are no magic values. One could fairly
reasonably commit this cipher to memory if necessary, similar to the
late RC4. Speck is probably my favorite block cipher right now,
except that I couldn’t figure out the key schedule for any of the
other key/block size variants.
Another cipher I studied, though in less depth, was RC5 (1994),
a block cipher by (obviously) Ron Rivest. The most novel part of RC5
is its use of data dependent rotations. This was a very deliberate
decision, and the paper makes this clear:
RC5 should highlight the use of data-dependent rotations, and
encourage the assessment of the cryptographic strength data-dependent
rotations can provide.
What’s a data-dependent rotation. In the Speck cipher shown above,
notice how the right-hand side of all the rotation operations is a
constant (3, 8, 56, and 61). Suppose that these operands were not
constant, instead they were based on some part of the value of the
block:
int r = *y & 0x0f;
*x = (*x >> r) | (*x << (64 - r));
Such “random” rotations “frustrate differential cryptanalysis” according
to the paper, increasing the strength of the cipher.
Another algorithm that uses data-dependent shift is the PCG family of
PRNGs. Honestly, the data-dependent “permutation” shift is the
defining characteristic of PCG. As a reminder, here’s the simplified PCG
from my shootout:
uint32_t
spcg32(uint64_t s[1])
{
uint64_t m = 0x9b60933458e17d7d;
uint64_t a = 0xd737232eeccdf7ed;
*s = *s * m + a;
int shift = 29 - (*s >> 61);
return *s >> shift;
}
Notice how the final shift depends on the high order bits of the PRNG
state. (This one weird trick from Melissa O’Neil will significantly
improve your PRNG. Xorshift experts hate her.)
I think this raises a really interesting question: Why did it take until
2014 for someone to apply a data-dependent shift to a PRNG? Similarly,
why are data-dependent rotations not used in many ciphers?
My own theory is that this is because many older instruction set
architectures can’t perform data-dependent shift operations efficiently.
Many instruction sets only have a fixed shift (e.g. 1-bit), or can
only shift by an immediate (e.g. a constant). In these cases, a
data-dependent shift would require a loop. These loops would be a ripe
source of side channel attacks in ciphers due to the difficultly of
making them operate in constant time. It would also be relatively slow
for video game PRNGs, which often needed to run on constrained
hardware with limited instruction sets. For example, the 6502 (Atari,
Apple II, NES, Commodore 64) and the Z80 (too many to list) can only
shift/rotate one bit at a time.
Even on an architecture with an instruction for data-dependent shifts,
such as the x86, those shifts will be slower than constant shifts, at
least in part due to the additional data dependency.
It turns out there are also some patent issues (ex. 1, 2).
Fortunately most of these patents have now expired, and one in
particular is set to expire this June. I still like my theory better.
To branchless decoding
So I was thinking about data-dependent shifts, and I had also noticed I
could trivially check the length of a UTF-8 code point using a small
lookup table — the first step in my decoder. What if I combined these: a
data-dependent shift based on a table lookup. This would become the last
step in my decoder. The idea for a branchless UTF-8 decoder was
essentially borne out of connecting these two thoughts, and then filling
in the middle.