Luhn algorithm using SWAR and SIMD

April 30, 2022

nullprogram.com/blog/2022/04/30/

Ever been so successful that credit card processing was your bottleneck? Perhaps you’ve wondered, “If only I could compute check digits three times faster using the same hardware!” Me neither. But if that ever happens someday, then this article is for you. I will show how to compute the Luhn algorithm in parallel using SIMD within a register, or SWAR.

If you want to skip ahead, here’s the full source, tests, and benchmark: luhn.c

The Luhn algorithm isn’t just for credit card numbers, but they do make a nice target for a SWAR approach. The major payment processors use 16 digit numbers — i.e. 16 ASCII bytes — and typical machines today have 8-byte registers, so the input fits into two machine registers. In this context, the algorithm works like so:

Consider the digits number as an array, and double every other digit starting with the first. For example, 6543 becomes 12, 5, 8, 3.
Sum individual digits in each element. The example becomes 3 (i.e. 1+2), 5, 8, 3.
Sum the array mod 10. Valid inputs sum to zero. The example sums to 9.

I will implement this algorithm in C with this prototype:

int luhn(const char *s);

It assumes the input is 16 bytes and only contains digits, and it will return the Luhn sum. Callers either validate a number by comparing the result to zero, or use it to compute a check digit when generating a number. (Read: You could use SWAR to rapidly generate valid numbers.)

The plan is to process the 16-digit number in two halves, and so first load the halves into 64-bit registers, which I’m calling hi and lo:

uint64_t hi =
    (uint64_t)(s[ 0]&255) <<  0 | (uint64_t)(s[ 1]&255) <<  8 |
    (uint64_t)(s[ 2]&255) << 16 | (uint64_t)(s[ 3]&255) << 24 |
    (uint64_t)(s[ 4]&255) << 32 | (uint64_t)(s[ 5]&255) << 40 |
    (uint64_t)(s[ 6]&255) << 48 | (uint64_t)(s[ 7]&255) << 56;
uint64_t lo =
    (uint64_t)(s[ 8]&255) <<  0 | (uint64_t)(s[ 9]&255) <<  8 |
    (uint64_t)(s[10]&255) << 16 | (uint64_t)(s[11]&255) << 24 |
    (uint64_t)(s[12]&255) << 32 | (uint64_t)(s[13]&255) << 40 |
    (uint64_t)(s[14]&255) << 48 | (uint64_t)(s[15]&255) << 56;

This looks complicated and possibly expensive, but it’s really just an idiom for loading a little endian 64-bit integer from a buffer. Breaking it down:

The input, *s, is char, which may be signed on some architectures. I chose this type since it’s the natural type for strings. However, I do not want sign extension, so I mask the low byte of the possibly-signed result by ANDing with 255. It’s as though *s was unsigned char.
The shifts assemble the 64-bit result in little endian byte order regardless of the host machine byte order. In other words, this will produce correct results even on big endian hosts.
I chose little endian since it’s the natural byte order for all the architectures I care about. Big endian hosts may pay a cost on this load (byte swap instruction, etc.). The rest of the function could just as easily be computed over a big endian load if I was primarily targeting a big endian machine instead.
I could have used unsigned long long (i.e. at least 64 bits) since no part of this function requires exactly 64 bits. I chose uint64_t since it’s succinct, and in practice, every implementation supporting long long also defines uint64_t.

Both GCC and Clang figure this all out and produce perfect code. On x86-64, just one instruction for each statement:

    mov  rax, [rdi+0]
    mov  rdx, [rdi+8]

Or, more impressively, loading both using a single instruction on ARM64:

    ldp  x0, x1, [x0]

The next step is to decode ASCII into numeric values. This is trivial and common in SWAR, and only requires subtracting '0' (0x30). So long as there is no overflow, this can be done lane-wise.

hi -= 0x3030303030303030;
lo -= 0x3030303030303030;

Each byte of the register now contains values in 0–9. Next, double every other digit. Multiplication in SWAR is not easy, but doubling just means adding the odd lanes to themselves. I can mask out the lanes that are not doubled. Regarding the mask, recall that the least significant byte is the first byte (little endian).

hi += hi & 0x00ff00ff00ff00ff;
lo += lo & 0x00ff00ff00ff00ff;

Each byte of the register now contains values in 0–18. Now for the tricky problem of folding the tens place into the ones place. Unlike 8 or 16, 10 is not a particularly convenient base for computers, especially since SWAR lacks lane-wide division or modulo. Perhaps a lane-wise binary-coded decimal could solve this. However, I have a better trick up my sleeve.

Consider that the tens place is either 0 or 1. In other words, we really only care if the value in the lane is greater than 9. If I add 6 to each lane, the 5th bit (value 16) will definitely be set in any lanes that were previously at least 10. I can use that bit as the tens place.

hi += (hi + 0x0006000600060006)>>4 & 0x0001000100010001;
lo += (lo + 0x0006000600060006)>>4 & 0x0001000100010001;

This code adds 6 to the doubled lanes, shifts the 5th bit to the least significant position in the lane, masks for just that bit, and adds it lane-wise to the total. Only applying this to doubled lanes is a style decision, and I could have applied it to all lanes for free.

The astute might notice I’ve strayed from the stated algorithm. A lane that was holding, say, 12 now hold 13 rather than 3. Since the final result of the algorithm is modulo 10, leaving the tens place alone is harmless, so this is fine.

At this point each lane contains values in 0–19. Now that the tens processing is done, I can combine the halves into one register with a lane-wise sum:

hi += lo;

Each lane contains values in 0–38. I would have preferred to do this sooner, but that would have complicated tens place handling. Even if I had rotated the doubled lanes in one register to even out the sums, some lanes may still have had a 2 in the tens place.

The final step is a horizontal sum reduction using the typical SWAR approach. Add the top half of the register to the bottom half, then the top half of what’s left to the bottom half, etc.

hi += hi >> 32;
hi += hi >> 16;
hi += hi >>  8;

Before the sum I said each lane was 0–38, so couldn’t this sum be as high as 304 (8x38)? It would overflow the lane, giving an incorrect result. Fortunately the actual range is 0–18 for normal lanes and 0–38 for doubled lanes. That’s a maximum of 224, which fits in the result lane without overflow. Whew! I’ve been tracking the range all along to guard against overflow like this.

Finally mask the result lane and return it modulo 10:

return (hi&255) % 10;

On my machine, SWAR is around 3x faster than a straightforward digit-by-digit implementation.

Usage examples

int is_valid(const char *s)
{
    return luhn(s) == 0;
}

void random_credit_card(char *s)
{
    sprintf(s, "%015llu0", rand64()%1000000000000000);
    s[15] = '0' + 10 - luhn(s);
}

SIMD

Conveniently, all the SWAR operations translate directly into SSE2 instructions. If you understand the SWAR version, then this is easy to follow:

int luhn(const char *s)
{
    __m128i r = _mm_loadu_si128((void *)s);

    // decode ASCII
    r = _mm_sub_epi8(r, _mm_set1_epi8(0x30));

    // double every other digit
    __m128i m = _mm_set1_epi16(0x00ff);
    r = _mm_add_epi8(r, _mm_and_si128(r, m));

    // extract and add tens digit
    __m128i t = _mm_set1_epi16(0x0006);
    t = _mm_add_epi8(r, t);
    t = _mm_srai_epi32(t, 4);
    t = _mm_and_si128(t, _mm_set1_epi8(1));
    r = _mm_add_epi8(r, t);

    // horizontal sum
    r = _mm_sad_epu8(r, _mm_set1_epi32(0));
    r = _mm_add_epi32(r, _mm_shuffle_epi32(r, 2));
    return _mm_cvtsi128_si32(r) % 10;
}

On my machine, the SIMD version is around another 3x increase over SWAR, and so nearly an order of magnitude faster than a digit-by-digit implementation.

Update: Const-me on Hacker News suggests a better option for handling the tens digit in the function above, shaving off 7% of the function’s run time on my machine:

    // if (digit > 9) digit -= 9
    __m128i nine = _mm_set1_epi8(9);
    __m128i gt = _mm_cmpgt_epi8(r, nine);
    r = _mm_sub_epi8(r, _mm_and_si128(gt, nine));

Update: u/aqrit on reddit has come up with a more optimized SSE2 solution, 12% faster than mine on my machine:

int luhn(const char *s)
{
    __m128i v = _mm_loadu_si128((void *)s);
    __m128i m = _mm_cmpgt_epi8(_mm_set1_epi16('5'), v);
    v = _mm_add_epi8(v, _mm_slli_epi16(v, 8));
    v = _mm_add_epi8(v, m);  // subtract 1 if less than 5
    v = _mm_sad_epu8(v, _mm_setzero_si128());
    v = _mm_add_epi32(v, _mm_shuffle_epi32(v, 2));
    return (_mm_cvtsi128_si32(v) - 4) % 10;
    // (('0' * 24) - 8) % 10 == 4
}

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.