Articles tagged crypto at null program

My brave new code-signing world

2026-04-25T18:12:29Z

The new w64devkit release two weeks ago is the first to be code-signed with my identity, verified by Microsoft’s certificate chain. Currently only the release packaging is signed — the self-extracting archive and its payload — but I will soon code-sign individual EXEs and DLLs within the distribution. In fact, all Windows builds of my project releases have been code-signed the past two weeks, including dcmake, and so should everything going forward. My signing identity builds reputation with each download, so users will have an easier time with SmartScreen, and security software generally. Azure Artifact Signing creates the actual signature, but the rest is done with new infrastructure I built myself, aas-sign. As is often the case, the existing options were deficient for my needs, so I had to build it myself.

This code-signing is not free, and simply having aas-sign on hand, or using the GitHub Actions action, is insufficient. You must be serious enough to spend US$10/month for the Azure subscription. After that you are subjected to the labyrinth that is the Azure portal, the most confusing UI I’ve ever used. Luckily we live in an age of wonders, and I could describe to Claude in Chrome what I wanted and it would happen (Sonnet works better than Opus for this). It took as much time to figure out Azure as I spent creating a fully-functional, native debugger front-end. Clear your schedule if you’re going to try it yourself. If it weren’t for AI assistance I would have given up.

The one-time setup process is only open to North America, and involves sharing identify documents (i.e. driver’s license) with Microsoft. Unlike the rest of Azure, that part was streamlined and fairly painless. Between the cost and this requirement, this is a niche space.

However, if this is your niche, aas-sign is currently the best software available. It’s the tool Microsoft should have written, but didn’t due to ongoing institutional failures. The alternatives are a pair of tools: Azure CLI (Python) combined with either Jsign (Java) or SignTool.exe (Windows only). All impose artificial runtime constraints hostile to build pipeline composablility. Poor engineering. In contrast, aas-sign is a native, multi-platform, single-file application.

If you know this space, osslsigncode probably comes to mind, but it produces signatures itself. It doesn’t interface with Azure and so has no role here aside from semi-reliable validation. The most popular use case is code-signing with self-signed certificates, but that actually makes everything worse.

There are two modes for aas-sign: Laptop and Action. Laptop mode is the most compelling, so we’ll start with that, but Action mode is the most useful in practice.

Laptop/desktop mode

Suppose you built an EXE or DLL, and would like to code-sign and publish it. Typically that looks like this:

$ aas-sign sign myapp.exe myapp.dll

It computes an Authenticode for each (concurrently), sends it off to Azure, gets back a signature, then a countersignature, and embeds the signatures in the images. If you have multiple signing identities then you might use --as (“sign as”):

$ aas-sign sign --as eus:contoso:jdoe myapp.exe myapp.dll

The colon-delimited triple is my own invention to combine region (East US), tenant (Contoso), and profile (J. Doe) into one string. The first time you use it, and every ~90 days thereafter, you’ll need to authenticate with Azure first:

$ aas-sign login

This will open a browser (just like az login) to log in, from which it will obtain a token than can be used to obtain signing tokens. (Yes, a token to get tokens; I’m concealing as much complexity as possible.) You might also want to establish a default identity, as typically you’d only have one:

$ aas-sign config eus:contoso:jdoe

Or all at once:

$ aas-sign login eus:contoso:jdoe

My goal was, after enduring the Azure portal sign-up, to maximally streamline code-signing.

Action mode

Manually building, signing, and publishing releases is easy and might be fine if you’re not releasing too frequently — or too ininfrequently that you forget how to do it — but likely you’d want to automate this process. I was stubborn about it myself, until Peter0x44 pushed me hard enough to take it seriously, for which I’m grateful. There’s an official GitHub Action to code-sign with Azure, but it requires a Windows runner, fatally limiting for my own needs. So aas-sign also defines a code-signing action. The previous example would have this in its own action:

  - name: Sign
    uses: skeeto/aas-sign@v1.0.0
    with:
      endpoint:  ${{ secrets.TRUSTED_SIGNING_ENDPOINT }}
      account:   ${{ secrets.TRUSTED_SIGNING_ACCOUNT }}
      profile:   ${{ secrets.CERTIFICATE_PROFILE }}
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      files: |
        myapp.exe
        myapp.dll

The secrets are bunch of strings you (or your AI agent) retrieve from the Azure portal. You also need to create Federated Identity Credential (FIC) for each repository, which I suggest triggering on an environment. (This all may sound like a joke but it’s real.) Again, just ask an AI to do all this stuff. The mandatory Azure interfacing limits how much I can streamline this process. Then aas-sign combines these with per-job tokens GitHub injects into the runner to authenticate (via the FIC) and sign.

I’ve gone through this a number of times, and the AI breezes through the GitHub UI, but struggles through the Azure portal — objective evidence of how awful it is. Idea for a UI benchmark: How many AI tokens does it take to accomplish typical activities?

For w64devkit, my plan is to run aas-sign inside the Docker build and sign executables in the container before it’s SFX-packaged. This is impossible with SignTool.exe and needlessly frictional with Jsign (requires at least a JRE if not a JDK). The easiest path forward was to literally build my own tool from scratch.

I’m considering aas-sign as a new w64devkit command, but it’s so niche that I’m likely to be its sole user. On the other hand, those already running w64devkit in GitHub Actions could use it in Action mode to code-sign their builds without any additional tools.

Billions of Code Name Permutations in 32 bits

2021-09-14T21:06:59Z

My friend over at Possibly Wrong created a code name generator. By coincidence I happened to be thinking about code names myself while recently replaying XCOM: Enemy Within (2012/2013). The game generates a random code name for each mission, and I wondered how often it repeats. The UFOpaedia page on the topic gives the word lists: 53 adjectives and 76 nouns, for a total of 4028 possible code names. A typical game has around 60 missions, and if code names are generated naively on the fly, then per the birthday paradox around half of all games will see a repeated mission code name! Fortunately this is easy to avoid, and the particular configuration here lends itself to an interesting implementation.

Mission code names are built using “adjective noun”. Some examples from the game’s word list:

Fading Hammer
Fallen Jester
Hidden Crown

To generate a code name, we could select a random adjective and a random noun, but as discussed it wouldn’t take long for a collision. The naive approach is to keep a database of previously-generated names, and to consult this database when generating new names. That works, but there’s an even better solution: use a random permutation. Done well, we don’t need to keep track of previous names, and the generator won’t repeat until it’s exhausted all possibilities.

Further, the total number of possible code names, 4028, is suspiciously shy of 4,096, a power of two (2**12). That makes designing and implementing an efficient permutation that much easier.

A linear congruential generator

A classic, obvious solution is a linear congruential generator (LCG). A full-period, 12-bit LCG is nothing more than a permutation of the numbers 0 to 4,095. When generating names, we can skip over the extra 68 values and pretend it’s a permutation of 4,028 elements. An LCG is constructed like so:

f(n) = (f(n-1)*A + C) % M

Typically the seed is used for f(0). M is selected based on the problem space or implementation efficiency, and usually a power of two. In this case it will be 4,096. Then there are some rules for choosing A and C.

Simply choosing a random f(0) per game isn’t great. The code name order will always be the same, and we’re only choosing where in the cycle to start. It would be better to vary the permutation itself, which we can do by also choosing unique A and C constants per game.

Choosing C is easy: It must be relatively prime with M, i.e. it must be odd. Since it’s addition modulo M, there’s no reason to choose C >= M since the results are identical to a smaller C. If we think of C as a 12-bit integer, 1 bit is locked in, and the other 11 bits are free to vary:

xxxxxxxxxxx1

Choosing A is more complicated: must be odd, A-1 must be divisible by 4, and A-1 should be divisible by 8 (better results). Again, thinking of this in terms of a 12-bit number, this locks in 3 bits and leaves 9 bits free:

xxxxxxxxx101

This ensures all the must and should properties of A.

Finally 0 <= f(0) < M. Because of modular arithmetic larger, values are redundant, and all possible values are valid since the LCG, being full-period, will cycle through all of them. This is just choosing the starting point in a particular permutation cycle. As a 12-bit number, all 12 bits are free:

xxxxxxxxxxxx

That’s 9 + 11 + 12 = 32 free bits to fill randomly: again, how incredibly convenient! Every 32-bit integer defines some unique code name permutation… almost. Any 32-bit descriptor where f(0) >= 4028 will collide with at least one other due to skipping, and so around 1.7% of the state space is redundant. A small loss that should shrink with slightly better word list planning. I don’t think anyone will notice.

Slice and dice

I love compact state machines, and this is an opportunity to put one to good use. My code name generator will be just one function:

uint32_t codename(uint32_t state, char *buf);

This takes one of those 32-bit permutation descriptors, writes the first code name to buf, and returns a descriptor for another permutation that starts with the next name. All we have to do is keep track of that 32-bit number and we’ll never need to worry about repeating code names until all have been exhausted.

First, lets extract A, C, and f(0), which I’m calling S. The low bits are A, middle bits are C, and high bits are S. Note the OR with 1 and 5 to lock in the hard-set bits.

long a = (state <<  3 | 5) & 0xfff;  //  9 bits
long c = (state >>  8 | 1) & 0xfff;  // 11 bits
long s =  state >> 20;               // 12 bits

Next iterate the LCG until we have a number in range:

do {
    s = (s*a + c) & 0xfff;
} while (s >= 4028);

Once we have an appropriate LCG state, compute the adjective/noun indexes and build a code name:

int i = s % 53;
int j = s / 53;
sprintf(buf, "%s %s", adjvs[i], nouns[j]);

Finally assemble the next 32-bit state. Since A and C don’t change, these are passed through while the old S is masked out and replaced with the new S.

return (state & 0xfffff) | (uint32_t)s<<20;

Putting it all together:

static const char *adjvs[] = { /* ... */ };
static const char *nouns[] = { /* ... */ };

uint32_t codename(uint32_t state, char *buf)
{
    long a = (state <<  3 | 5) & 0xfff;  //  9 bits
    long c = (state >>  8 | 1) & 0xfff;  // 11 bits
    long s =  state >> 20;               // 12 bits

    do {
        s = (s*a + c) & 0xfff;
    } while (s >= COUNTOF(adjvs)*COUNTOF(nouns));

    int i = s % COUNTOF(adjvs);
    int j = s / COUNTOF(adjvs);
    sprintf(buf, "%s %s", adjvs[i], nouns[j]);
    return (state & 0xfffff) | (uint32_t)s<<20;
}

The caller just needs to generate an initial 32-bit integer. Any 32-bit integer is valid — even zero — so this could just be, say, the unix epoch (time(2)), but adjacent values will have similar-ish permutations. I intentionally placed S in the high bits, which are least likely to vary, since it only affects where the cycle begins, while A and C have a much more dramatic impact and so are placed at more variable locations.

Regardless, it would be better to hash such an input so that adjacent time values map to distant states. It also helps hide poorer (less random) choices for A multipliers. I happen to have designed some great functions for exactly this purpose. Here’s one of my best:

static uint32_t
hash32(uint32_t x)
{
    x += 0x3243f6a8U; x ^= x >> 15;
    x *= 0xd168aaadU; x ^= x >> 15;
    x *= 0xaf723597U; x ^= x >> 15;
    return x;
}

This would be perfectly reasonable for generating all possible names in a random order:

uint32_t state = hash32(time(0));
for (int i = 0; i < 4028; i++) {
    char buf[32];
    state = codename(state, buf);
    puts(buf);
}

To further help cover up poorer A multipliers, it’s better for the word list to be pre-shuffled in its static storage. If that underlying order happens to show through, at least it will be less obvious (i.e. not in alphabetical order). Shuffling the string list in my source is just a few keystrokes in Vim, so this is easy enough.

Robustness

If you’re set on making the codename function easier to use such that consumers don’t need to think about hashes, you could “encode” and “decode” the descriptor going in an out of the function:

uint32_t codename(uint32_t state, char *buf)
{
    state += 0x3243f6a8U; state ^= state >> 17;
    state *= 0x9e485565U; state ^= state >> 16;
    state *= 0xef1d6b47U; state ^= state >> 16;

    // ...

    state = (state & 0xfffff) | (uint32_t)s<<20;
    state ^= state >> 16; state *= 0xeb00ce77U;
    state ^= state >> 16; state *= 0x88ccd46dU;
    state ^= state >> 17; state -= 0x3243f6a8U;
    return state;
}

This permutes the state coming in, and reverses that permutation on the way out (read: inverse hash). This breaks up similar starting points.

A random-access code name permutation

Of course this isn’t the only way to build a permutation. I recently picked up another trick: Kensler permutation. The key insight is cycle-walking, allowing for random-access to a permutation of a smaller domain (e.g. 4,028 elements) through permutation of a larger domain (e.g. 4096 elements).

Here’s such a code name generator built around a bespoke 12-bit xorshift-multiply permutation. I used 4 “rounds” since xorshift-multiply is less effective the smaller the permutation.

// Generate the nth code name for this seed.
void codename_n(char *buf, uint32_t seed, int n)
{
    uint32_t i = n;
    do {
        i ^= i >> 6; i ^= seed >>  0; i *= 0x325; i &= 0xfff;
        i ^= i >> 6; i ^= seed >>  8; i *= 0x3f5; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 16; i *= 0xa89; i &= 0xfff;
        i ^= i >> 6; i ^= seed >> 24; i *= 0x85b; i &= 0xfff;
        i ^= i >> 6;
    } while (i >= COUNTOF(adjvs)*COUNTOF(nouns));

    int a = i % COUNTOF(adjvs);
    int b = i / COUNTOF(adjvs);
    snprintf(buf, 22, "%s %s", adjvs[a], nouns[b]);
}

While this is more flexible, avoids poorer permutations, and doesn’t have state space collisions, I still have a soft spot for my LCG-based state machine generator.

Source code

You can find the complete, working source code with both generators here: codename.c. I used real US Secret Service code names for my word list. Some sample outputs:

PLASTIC HUMMINGBIRD
BLACK VENUS
SILENT SUNBURN
BRONZE AUTHOR
FADING MARVEL

Single-primitive authenticated encryption for fun

2021-01-30T03:39:10Z

Just as a fun exercise, I designed and implemented from scratch a standalone, authenticated encryption tool, including key derivation with stretching, using a single cryptographic primitive. Or, more specifically, half of a primitive. That primitive is the encryption function of the XXTEA block cipher. The goal was to pare both design and implementation down to the bone without being broken in practice — I hope — and maybe learn something along the way. This article is the tour of my design. Everything here will be nearly the opposite of the right answers.

The tool itself is named xxtea (lowercase), and it’s supported on all unix-like and Windows systems. It’s trivial to compile, even on the latter. The code should be easy to follow from top to bottom, with commentary about specific decisions along the way, though I’ll quote the most important stuff inline here.

The command line options follow the usual conventions. The two modes of operation are encrypt (-E) and decrypt (-D). It defaults to using standard input and standard output so it works great in pipelines. Supplying -o sends output elsewhere (automatically deleted if something goes wrong), and the optional positional argument indicates an alternate input source.

usage: xxtea <-E|-D> [-h] [-o FILE] [-p PASSWORD] [FILE]

examples:
    $ xxtea -E -o file.txt.xxtea file.txt
    $ xxtea -D -o file.txt file.txt.xxtea

If no password is provided (-p), it prompts for a UTF-8-encoded password. Of course it’s not normally a good idea to supply a password via command line argument, but it’s been useful for testing.

XXTEA block cipher

TEA stands for Tiny Encryption Algorithm and XXTEA is the second attempt at fixing weaknesses in the cipher — with partial success. The remaining issues should not be an issue for this particular application. XXTEA supports a variable block size, but I’ve hardcoded my implementation to a 128-bit block size, along with some unrolling. I’ve also discarded the unneeded decryption function. There are no data-dependent lookups or branches so it’s immune to speculation attacks.

XXTEA operates on 32-bit words and has a 128-bit key, meaning both block and key are four words apiece. My implementation is about a dozen lines long. Its prototype:

// Encrypt a 128-bit block using 128-bit key
void xxtea128_encrypt(const uint32_t key[4], uint32_t block[4]);

All cryptographic operations are built from this function. Another way to think about it is that it accepts two 128-bit inputs and returns a 128-bit result:

uint128 r = f(uint128 key, uint128 block);

Tuck that away in the back of your head since this will be important later.

Encryption

If I tossed the decryption function, how are messages decrypted? I’m sure many have already guessed: XXTEA will be used in counter mode, or CTR mode. Rather than encrypt the plaintext directly, encrypt a 128-bit block counter and treat it like a stream cipher. The message is XORed with the encrypted counter values for both encryption and decryption.

Only half the cipher is needed.
No padding scheme is necessary. With other block modes, if message lengths may not be exactly a multiple of the block size then you need some scheme for padding the last block.

A 128-bit increment with 32-bit limbs is easy:

void
increment(uint32_t ctr[4])
{
    /* 128-bit increment, first word changes fastest */
    if (!++ctr[0]) if (!++ctr[1]) if (!++ctr[2]) ++ctr[3];
}

In xxtea, words are always marshalled in little endian byte order (least significant byte first). With the first word as the least significant limb, the entire 128-bit counter is itself little endian.

The counter doesn’t start at zero, but at some randomly-selected 128-bit nonce called the initialization vector (IV), wrapping around to zero if necessary (incredibly unlikely). The IV will be included with the message in the clear. This nonce allows one key (password) to be used with multiple messages, as they’ll all be encrypted using different, randomly-chosen regions of an enormous keystream. It also provides semantic security: encrypt the same file more than once and the ciphertext will always be completely different.

for (/* ... */) {
    uint32_t cover[4] = {ctr[0], ctr[1], ctr[2], ctr[3]};
    xxtea128_encrypt(key, cover);
    block[i+0] ^= cover[0];
    block[i+1] ^= cover[1];
    block[i+2] ^= cover[2];
    block[i+3] ^= cover[3];
    increment(ctr);
}

Hash function

That’s encryption, but there’s still a matter of authentication and key derivation function (KDF). To deal with both I’ll need to devise a hash function. Since I’m only using the one primitive, somehow I need to build a hash function from a block cipher. Fortunately there’s a tool for doing just that: the Merkle–Damgård construction.

Recall that xxtea128_encrypt accepts two 128-bit inputs and returns a 128-bit result. In other words, it compresses 256 bits into 128 bits: a compression function. The two 128-bit inputs are cryptographically combined into one 128-bit result. I can repeat this operation to fold an arbitrary number of 128-bit inputs into a 128-bit hash result.

uint32_t *input = /* ... */;
uint32_t hash[4] = {0, 0, 0, 0};
xxtea128_encrypt(input +  0, hash);
xxtea128_encrypt(input +  4, hash);
xxtea128_encrypt(input +  8, hash);
xxtea128_encrypt(input + 12, hash);
// ...

Note how the input is the key, not the block. The hash state is repeatedly encrypted using the hash inputs as the key, mixing hash state and input. When the input is exhausted, that block is the result. Sort of.

I used zero for the initial hash state in my example, but it will be more challenging to attack if the starting input is something random. Like Blowfish, in xxtea I chose the first 128 bits of the decimals of pi:

void
xxtea128_hash_init(uint32_t ctx[4])
{
    /* first 32 hexadecimal digits of pi */
    ctx[0] = 0x243f6a88; ctx[1] = 0x85a308d3;
    ctx[2] = 0x13198a2e; ctx[3] = 0x03707344;
}

/* Mix one block into the hash state. */
void
xxtea128_hash_update(uint32_t ctx[4], const uint32_t block[4])
{
    xxtea128_encrypt(block, ctx);
}

There are still a couple of problems. First, what if the input isn’t a multiple of the block size? This time I do need a padding scheme to fill out that last block. In this case I pad it with bytes where each byte is the number of padding bytes. For instance, helloworld becomes, roughly speaking, helloworld666666.

That creates a different problem: This will have the same hash result as an input that actually ends with these bytes. So the second rule is that there is always a padding block, even if that block is 100% padding.

Another problem is that the Merkle–Damgård construction is prone to length-extension attacks. Anyone can take my hash result and continue appending additional data without knowing what came before. If I’m using this hash to authenticate the ciphertext, someone could, for example, use this attack to append arbitrary data to the end of messages.

Some important hash functions, such as the most common forms of SHA-2, are vulnerable to length-extension attacks. Keeping this issue in mind, I could address it later using HMAC, but I have an idea for nipping this in the bud now. Before mixing the padding block into the hash state, I swap the two middle words:

/* Append final raw-byte block to hash state. */
void
xxtea128_hash_final(uint32_t ctx[4], const void *buf, int len)
{
    assert(len < 16);
    unsigned char tmp[16];
    memset(tmp, 16-len, 16);
    memcpy(tmp, buf, len);
    uint32_t k[4] = {
        loadu32(tmp +  0), loadu32(tmp +  4),
        loadu32(tmp +  8), loadu32(tmp + 12),
    };
    /* swap middle words to break length extension attacks */
    uint32_t swap = ctx[1];
    ctx[1] = ctx[2];
    ctx[2] = swap;
    xxtea128_encrypt(k, ctx);
}

This operation “ties off” the last block so that the hash can’t be extended with more input. Or so I hope. This is my own invention, and so it may not actually work right. Again, this is for fun and learning!

Update: Aristotle Pagaltzis pointed out that when these two words are identical the hash result will be unchanged, leaving it vulnerable to length extension attack. This occurs about once every 2³² messages, which is far too small a security margin.

Caveats

Despite all that care, there are still two more potential weaknesses.

First, XXTEA was never designed to be used with the Merkle–Damgård construction. I assume attackers can modify files I will decrypt, and so the hash input is usually and mostly under control of attackers, meaning they control the cipher key. Ciphers are normally designed assuming the key is not under hostile control. This might be vulnerable to related-key attacks.

As will be discussed below, I use this custom hash function in two ways. In one the input is not controlled by attackers, so this is a non-issue. In the second, the hash state is completely unknown to the attacker before they control the input, which I believe mitigates any issues.

Second, a 128-bit hash state is a bit small these days. For very large inputs, the chance of collision via the birthday paradox is a practical issue.

In xxtea, digests are only computed over a few megabytes of input at a time at most, even when encrypting giant files, so a 128-bit state should be fine.

Key derivation

The user will supply a password and somehow I need to turn that into a 128-bit key.

What if the password is shorter than 128 bits?
What if the password is longer than 128 bits?
It’s safer for the cipher if the raw password isn’t used directly.
I’d like offline, brute force attacks to be expensive.

The first three can be resolved by running the passphrase through the hash function, using it as key derivation function. What about the last item? Rather than hash the password once, I concatenate it, including null terminator, repeatedly until it reaches a certain number of bytes (hardcoded to 64 MiB, see COST), and hash that. That’s a computational workload that attackers must repeat when guessing passwords.

To avoid timing attacks based on the password length, I precompute all possible block arrangements before starting the hash — all the different ways the password might appear concatenated across 16-byte blocks. Blocks may be redundantly computed if necessary to make this part constant time. The hash is fed entirely from these precomputed blocks.

To defend against rainbow tables and the like — as well as make it harder to attack other parts of the message construction — the initialization vector is used as a salt, fed into the hash before the password concatenation.

Unfortunately this KDF isn’t memory-hard, and attackers can use economy of scale to strengthen their attacks (GPUs, custom hardware). However, a memory-hard KDF requires lots of memory to compute the key, making memory an expensive and limiting factor for attackers. Memory-hard KDFs are complex and difficult to design, and I made the trade-off for simplicity.

Authentication

When I say the encryption is authenticated I mean that it should not be possible for anyone to tamper with the ciphertext undetected without already knowing the key. This is typically accomplished by computing a keyed hash digest and appending it to the message, message authentication code (MAC). Since it’s keyed, only someone who knows the key can compute the digest, and so attackers can’t spoof the MAC.

This is where length-extension attacks come into play: With an improperly constructed MAC, an attacker could append input without knowing the key. Fortunately my hash function isn’t vulnerable to length-extension attacks!

An alternative is to use an authenticated block mode such as GCM, which is still CTR mode at its core. Unfortunately, this is complicated, and, unlike plain CTR, it would take me a long time to convince myself I got it right. So instead I used CTR mode and my hash function in a straightforward way.

At this point there’s a question of what exactly you input into the hash function. Do you hash the plaintext or do you hash the ciphertext? It’s tempting to do the former since it’s (generally) not available to attackers, and would presumably make it harder to attack. This is a mistake. Always compute the MAC over the ciphertext, a.k.a. encrypt then authenticate.

This is the called the Doom Principle. Computing the MAC on the plaintext means that recipients must decrypt untrusted ciphertext before authenticating it. This is bad because messages should be authenticated before decryption. So that’s exactly what xxtea does. It also happens to be the simplest option.

We have a hash function, but to compute a MAC we need a keyed hash function. Again, I do the simplest thing that I believe isn’t broken: concatenate the key with the ciphertext. Or more specifically:

MAC = hash(key || ctr || ciphertext)

Update: Dimitrije Erdeljan explains why this is broken and how to fix it. Given a valid MAC, attackers can forge arbitrary messages.

The counter is because xxtea uses chunked authentication with one megabyte chunks. It can authenticate a chunk at a time, which allows it to decrypt, with authentication, arbitrary amounts of ciphertext in a fixed amount of memory. The worst that can happen is truncation between chunks — an acceptable trade-off. The counter ensures each chunk MAC is uniquely keyed, that they appear in order.

It’s also important to note that the counter is appended after the key. The counter is under hostile control — they can choose the IV — and having the key there first means they have no information about the hash state.

All chunks are one megabyte except for the last chunk, which is always shorter, signaling the end of the message. It may even be just a MAC and zero-length ciphertext. This avoids nasty issues with parsing potentially unauthenticated length fields and whatnot. Just stop successfully at the first short, authenticated chunk.

Some will likely have spotted it, but a potential weakness is that I’m using the same key for both encryption and authentication. These are normally two different keys. This is disastrous in certain cases like CBC-MAC, but I believe it’s alright here. It would be easy to compute a separate MAC key, but I opted for simple.

File format

In my usual style, encrypted files have no distinguishing headers or fields. They just look like a random block of data. A file begins with the 16-byte IV, then a sequence of zero or more one megabyte chunks, ending with a short chunk. It’s indistinguishable from /dev/random.

[IV][lMiB || MAC][1MiB || MAC][<1 MiB || MAC]

If the user types the incorrect password, it will be discovered when authenticating the first chunk (read: immediately). This saves on a dedicated check at the beginning of the file, though it means it’s not possible to distinguish between a bad password and a modified file.

I know my design has weaknesses as a result of artificial, self-imposed constraints and deliberate trade-offs, but I’m curious if I’ve made any glaring mistakes with practical consequences.

On-the-fly Linear Congruential Generator Using Emacs Calc

2019-11-19T01:17:50Z

I regularly make throwaway “projects” and do a surprising amount of programming in /tmp. For Emacs Lisp, the equivalent is the *scratch* buffer. These are places where I can make a mess, and the mess usually gets cleaned up before it becomes a problem. A lot of my established projects (ex.) start out in volatile storage and only graduate to more permanent storage once the concept has proven itself.

Throughout my whole career, this sort of throwaway experimentation has been an important part of my personal growth, and I try to encourage it in others. Even if the idea I’m trying doesn’t pan out, I usually learn something new, and occasionally it translates into an article here.

I also enjoy small programming challenges. One of the most abused tools in my mental toolbox is the Monte Carlo method, and I readily apply it to solve toy problems. Even beyond this, random number generators are frequently a useful tool (1, 2), so I find myself reaching for one all the time.

Nearly every programming language comes with a pseudo-random number generation function or library. Unfortunately the language’s standard PRNG is usually a poor choice (C, C++, C#, Go). It’s probably mediocre quality, slower than it needs to be (also), lacks reliable semantics or behavior between implementations, or is missing some other property I want. So I’ve long been a fan of BYOPRNG: Bring Your Own Pseudo-random Number Generator. Just embed a generator with the desired properties directly into the program. The best non-cryptographic PRNGs today are tiny and exceptionally friendly to embedding. Though, depending on what you’re doing, you might need to be creative about seeding.

Crafting a PRNG

On occasion I don’t have an established, embeddable PRNG in reach, and I have yet to commit xoshiro256** to memory. Or maybe I want to use a totally unique PRNG for a particular project. In these cases I make one up. With just a bit of know-how it’s not too difficult.

Probably the easiest decent PRNG to code from scratch is the venerable Linear Congruential Generator (LCG). It’s a simple recurrence relation:

x[1] = (x[0] * A + C) % M

That’s trivial to remember once you know the details. You only need to choose appropriate values for A, C, and M. Done correctly, it will be a full-period generator — a generator that visits a permutation of each of the numbers between 0 and M - 1. The seed — the value of x[0] — is chooses a starting position in this (looping) permutation.

M has a natural, obvious choice: a power of two matching the range of operands, such as 2^32 or 2^64. With this the modulo operation is free as a natural side effect of the computer architecture.

Choosing C also isn’t difficult. It must be co-prime with M, and since M is a power of two, any odd number is valid. Even 1. In theory choosing a small value like 1 is faster since the compiler won’t need to embed a large integer in the code, but this difference doesn’t show up in any micro-benchmarks I tried. If you want a cool, unique generator, then choose a large random integer. More on that below.

The tricky value is A, and getting it right is the linchpin of the whole LCG. It must be coprime with M (i.e. not even), and, for a full-period generator, A-1 must be divisible by four. For better results, A-1 should not be divisible by 8. A good choice is a prime number that satisfies these properties.

If your operands are 64-bit integers, or larger, how are you going to generate a prime number?

Primes from Emacs Calc

Emacs Calc can solve this problem. I’ve noted before how featureful it is. It has arbitrary precision, random number generation, and primality testing. It’s everything we need to choose A. (In fact, this is nearly identical to the process I used to implement RSA.) For this example I’m going to generate a 64-bit LCG for the C programming language, but it’s easy to use whatever width you like and mostly whatever language you like. If you wanted a minimal standard 128-bit LCG, this will still work.

Start by opening up Calc with M-x calc, then:

Push 2 on the stack
Push 64 on the stack
Press ^, computing 2^64 and pushing it on the stack
Press k r to generate a random number in this range
Press d r 16 to switch to hexadecimal display
Press k n to find the next prime following the random value
Repeat step 6 until you get a number that ends with 5 or D
Press k p a few times to avoid false positives.

What’s left on the stack is your A! If you want a random value for C, you can follow a similar process. Heck, make it prime, too!

The reason for using hexadecimal (step 5) and looking for 5 or D (step 7) is that such numbers satisfy both of the important properties for A-1.

Calc doesn’t try to factor your random integer. Instead it uses the Miller–Rabin primality test, a probabilistic test that, itself, requires random numbers. It has false positives but no false negatives. The false positives can be mitigated by repeating the test multiple times, hence step 8.

Trying this all out right now, I got this implementation (in C):

uint64_t lcg1(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    return s;
}

However, we can still do a little better. Outputting the entire state doesn’t have great results, so instead it’s better to create a truncated LCG and only return some portion of the most significant bits.

uint32_t lcg2(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    return s >> 32;
}

This won’t quite pass BigCrush in 64-bit form, but the results are pretty reasonable for most purposes.

But we can still do better without needing to remember much more than this.

Appending permutation

A Permuted Congruential Generator (PCG) is really just a truncated LCG with a permutation applied to its output. Like LCGs themselves, there are arbitrarily many variations. The “official” implementation has a data-dependent shift, for which I can never remember the details. Fortunately a couple of simple, easy to remember transformations is sufficient. Basically anything I used while prospecting for hash functions. I love xorshifts, so lets add one of those:

uint32_t pcg1(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    uint32_t r = s >> 32;
    r ^= r >> 16;
    return r;
}

This is a big improvement, but it still fails one BigCrush test. As they say, when xorshift isn’t enough, use xorshift-multiply! Below I generated a 32-bit prime for the multiply, but any odd integer is a valid permutation.

uint32_t pcg2(void)
{
    static uint64_t s = 0;
    s = s*UINT64_C(0x7c3c3267d015ceb5) + UINT64_C(0x24bd2d95276253a9);
    uint32_t r = s >> 32;
    r ^= r >> 16;
    r *= UINT32_C(0x60857ba9);
    return r;
}

This passes BigCrush, and I can reliably build a new one entirely from scratch using Calc any time I need it.

Bonus: Adapting to other languages

Sometimes it’s not so straightforward to adapt this technique to other languages. For example, JavaScript has limited support for 32-bit integer operations (enough for a poor 32-bit LCG) and no 64-bit integer operations. Though BigInt is now a thing, and should make a great 96- or 128-bit LCG easy to build.

function lcg(seed) {
    let s = BigInt(seed);
    return function() {
        s *= 0xef725caa331524261b9646cdn;
        s += 0x213734f2c0c27c292d814385n;
        s &= 0xffffffffffffffffffffffffn;
        return Number(s >> 64n);
    }
}

Java doesn’t have unsigned integers, so how could you build the above PCG in Java? Easy! First, remember is that Java has two’s complement semantics, including wrap around, and that two’s complement doesn’t care about unsigned or signed for multiplication (or addition, or subtraction). The result is identical. Second, the oft-forgotten >>> operator does an unsigned right shift. With these two tips:

long s = 0;

int pcg2() {
    s = s*0x7c3c3267d015ceb5L + 0x24bd2d95276253a9L;
    int r = (int)(s >>> 32);
    r ^= r >>> 16;
    r *= 0x60857ba9;
    return r;
}

So, in addition to the Calc step list above, you may need to know some of the finer details of your target language.

Keyringless GnuPG

2019-08-09T23:52:39Z

This article was discussed on Hacker News.

My favorite music player is Audacious. It follows the Winamp Classic tradition of not trying to manage my music library. Instead it waits patiently for me to throw files and directories at it. These selections will be informally grouped into transient, disposable playlists of whatever I fancy that day.

This matters to me because my music collection is the result of around 25 years of hoarding music files from various sources including CD rips, Napster P2P sharing, and, most recently, YouTube downloads. It’s not well-organized, but it’s organized well enough. Each album has its own directory, and related albums are sometimes grouped together under a directory for a particular artist.

Over the years I’ve tried various music players, and some have either wanted to manage this library or hide the underlying file-organized nature of my collection. Both situations are annoying because I really don’t want or need that abstraction. I’m going just fine thinking of my music library in terms of files, thank you very much. Same goes for ebooks.

GnuPG is like a media player that wants to manage your whole music library. Rather than MP3s, it’s crypto keys on a keyring. Nearly every operation requires keys that have been imported into the keyring. Until GnuPG 2.2.8 (June 2018), which added the --show-keys command, you couldn’t even be sure what you were importing until after it was already imported. Hopefully it wasn’t garbage.

GnuPG does has a pretty good excuse. It’s oriented around the Web of Trust model, and it can’t follow this model effectively without having all the keys at once. However, even if you don’t buy into the Web of Trust, the GnuPG interface still requires you to play by its rules. Sometimes I’ve got a message, a signature, and a public key and I just want to verify that they’re all consistent with each other, damnit.

$ gpg --import foo.asc
gpg: key 1A719EF63AEB2CFE: public key "foo" imported
gpg: Total number processed: 1
gpg:               imported: 1
$ gpg --verify --trust-model always message.txt.sig message.txt
gpg: Signature made Fri 09 Aug 2019 05:44:43 PM EDT
gpg:                using EDDSA key ...1A719EF63AEB2CFE
gpg: Good signature from "foo" [unknown]
gpg: WARNING: Using untrusted key!
$ gpg --batch --yes --delete-key 1A719EF63AEB2CFE

Three commands and seven lines of output when one of each would do. Plus there’s a false warning: Wouldn’t an “always” trust model mean that this key is indeed trusted?

Signify

Compare this to OpenBSD’s signify (also). There’s no keyring, and it’s up to the user — or the program shelling out to signify — to supply the appropriate key. It’s like the music player that just plays whatever I give it. Here’s a simplified usage overview:

signify -G [-c comment] -p pubkey -s seckey
signify -S [-x sigfile] -s seckey -m message
signify -V [-x sigfile] -p pubkey -m message

When generating a new keypair (-G), the user must choose the destination files for the public and secret keys. When signing a message (a file), the user must supply the secret key and the message. When verifying a file, the user must supply the public key and the message. This is a popular enough model that other, compatible implementations with the same interface have been developed.

Signify is deliberately incompatible with OpenPGP and uses its own simpler, and less featureful, format. Wouldn’t it be nice to have a similar interface to verify OpenPGP signatures?

SimpleGPG

Well, I thought so. So I put together a shell script that wraps GnuPG and provides such an interface:

SimpleGPG

The interface is nearly identical to signify, and the GnuPG keyring is hidden away as if it didn’t exist. The main difference is that the keys and signatures produced and consumed by this tool are fully compatible with OpenPGP. You could use this script without requiring anyone else to adopt something new or different.

To avoid touching your real keyring, the script creates a temporary keyring directory each time it’s run. The GnuPG option --homedir instructs it to use this temporary keyring and ignore the usual one. The temporary keyring is destroyed when the script exits. This is kind of clunky, but there’s no way around it.

Verification looks roughly like this in the script:

$ tmp=$(mktemp -d simplegpg-XXXXXX)
$ gpg --homedir $tmp
$ gpg --homedir $tmp --import foo.asc
$ gpg --homedir $tmp --verify message.txt.sig message.txt
$ rm -rf $tmp

Generating a key is trivial, and there’s only a prompt for the protection passphrase. Like signify, it will generate an Ed25519 key and all outputs are ASCII-armored.

$ simplegpg -G -p keyname.asc -s keyname.pgp
passphrase:
passphrase (confirm):

Since signify doesn’t have a concept of a user ID for a key, just an “untrusted comment”, the user ID is not emphasized here. The default user ID will be “simplegpg key”, so, if you plan to share the key with regular GnuPG users who will need to import it into a keyring, you probably want to use -c to give it a more informative name.

Unfortunately due GnuPG’s very limited, keyring-oriented interface, key generation is about three times slower than it should be. That’s because the protection key is run though the String-to-Key (S2K) algorithm three times:

Immediately after the key is generated, the passphrase is converted to a key, the key is encrypted, and it’s put onto the temporary keyring.
When exporting, the key passphrase is again run through the S2K to get the protection key to decrypt it.
The export format uses a slightly different S2K algorithm, so this export S2K is now used to create yet another protection key.

Technically the second could be avoided since gpg-agent, which is always required, could be holding the secret key material. As far as I can tell, gpg-agent simply does not learn freshly-generated keys. I do not know why this is the case.

This is related to another issue. If you’re accustomed to GnuPG, you may notice that the passphrase prompt didn’t come from pinentry, a program specialized for passphrase prompts. GnuPG normally uses it for this. Instead, the script handles the passphrase prompt and passes the passphrase to GnuPG (via a file descriptor). This would not be necessary if gpg-agent did its job. Without this part of the script, users are prompted three times, via pinentry, for their passphrase when generating a key.

When signing messages, the passphrase prompt comes from pinentry since it’s initiated by GnuPG.

$ simplegpg -S -s keyname.pgp -m message.txt
passphrase:

This will produce message.txt.sig with an OpenPGP detached signature.

The passphrase prompt is for --import, not --detach-sign. As with key generation, the S2K is run more than necessary: twice instead of once. First to generate the decryption key, then a second time to generate a different encryption key for the keyring since the export format and keyring use different algorithms. Ugh.

But at least gpg-agent does its job this time, so only one passphrase prompt is necessary. In general, a downside of these temporary keyrings is that gpg-agent treats each as different keys, and you will need to enter your passphrase once for each message signed. Just like signify.

Verification, of course, requires no prompting and no S2K.

$ simplegpg -V -p keyname.asc -m message.txt

That’s all there is to keyringless OpenPGP signatures. Since I’m not interested in the Web of Trust or keyservers, I wish GnuPG was more friendly to this model of operation.

passphrase2pgp

I mentioned that SimpleGPG is fully compatible with other OpenPGP systems. This includes my own passphrase2pgp, where your secret key is stored only in your brain. No need for a secret key file. In the time since I first wrote about it, passphrase2pgp has gained the ability to produce signatures itself!

I’ve got my environment set up — $REALNAME, $EMAIL, and $KEYID per the README — so I don’t need to supply a user ID argument, nor will I be prompted to confirm my passphrase since it’s checked against a known fingerprint. Generating the public key, for sharing, looks like this:

$ passphrase2pgp -K --armor --public >keyname.asc

Or just:

$ passphrase2pgp -ap >keyname.asc

Like with signify and SimplePGP, to sign a message I’m prompted for my passphrase. It takes longer since the “S2K” here is much stronger by necessity. The passphrase is used to generate the secret key, then from that the signature on the message:

$ passphrase2pgp -S message.txt

For the SimpleGPG user on the other side it all looks the same as before:

$ simplegpg -V -p keyname.asc -m message.txt

I’m probably going to start signing my open source software releases, and this is how I intend to do it.

The Long Key ID Collider

2019-07-22T21:27:02Z

Over the last couple weeks I’ve spent a lot more time working with OpenPGP keys. It’s a consequence of polishing my passphrase-derived PGP key generator. I’ve tightened up the internals, and it’s enabled me to explore the corners of the format, try interesting things, and observe how various OpenPGP implementations respond to weird inputs.

For one particularly cool trick, take a look at these two (private) keys I generated yesterday. Here’s the first:

And the second:

Concatenate these and then import them into GnuPG to have a look at them. To avoid littering in your actual keyring, especially with private keys, use the --homedir option to set up a temporary keyring. I’m going to omit that option in the examples.

$ gpg --import < keys.asc
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: Total number processed: 2
gpg:               imported: 2
gpg:       secret keys read: 2
gpg:   secret keys imported: 2

The user ID is “nullprogram.com” since I made these and that’s me taking credit. “992A5EEE1D1049FA” is called the long key ID: a 64-bit value that identifies the key. It’s the lowest 64 bits of the full key ID, a 160-bit SHA-1 hash. In the old days everyone used a short key ID to identify keys, which was the lowest 32 bits of the key. For these keys, that would be “1D1049FA”. However, this was deemed way too short, and everyone has since switched to long key IDs, or even the full 160-bit key ID.

The key ID is nothing more than a SHA-1 hash of the key creation date — unsigned 32-bit unix epoch seconds — and the public key material. So secret keys have the same key ID as their associated public key. This makes sense since they’re a key pair and they go together.

Look closely and you’ll notice that both keypairs have the same long key ID. If you hadn’t already guessed from the title of this article, these are two different keys with the same long key ID. In other words, I’ve created a long key ID collision. The GnuPG --list-keys command prints the full key ID since it’s so important:

$ gpg --list-keys
---------------------
pub   ed25519 2019-07-22 [SCA]
      A422F8B0E1BF89802521ECB2992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

pub   ed25519 2019-07-22 [SCA]
      F43BC80C4FC2603904E7BE02992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

I was only targeting the lower 64 bits, but I actually managed to collide the lowest 68 bits by chance. So a long key ID still isn’t enough to truly identify any particular key.

This isn’t news, of course. Nor am I the first person to create a long key ID collision. In 2013, David Leon Gil published a long key ID collision for two 4096-bit RSA public keys. However, that is the only other example I was able to find. He did not include the private keys and did not elaborate on how he did it. I know he did generate viable keys, not just garbage for the public key portions, since they’re both self-signed.

Creating these keys was trickier than I had anticipated, and there’s an old, clever trick that makes it work. Building atop the work I did for passphrase2pgp, I created a standalone tool that will create a long key ID collision and print the two keypairs to standard output:

https://github.com/skeeto/pgpcollider

Example usage:

$ go get -u github.com/skeeto/pgpcollider
$ pgpcollider --verbose > keys.asc

This can take up to a day to complete when run like this. The tool can optionally coordinate many machines — see the --server / -S and --client / -C options — to work together, greatly reducing the total time. It took around 4 hours to create the keys above on a single machine, generating a around 1 billion extra keys in the process. As discussed below, I actually got lucky that it only took 1 billion. If you modify the program to do short key ID collisions, it only takes a few seconds.

The rest of this article is about how it works.

Birthday Attacks

An important detail is that this technique doesn’t target any specific key ID. Cloning someone’s long key ID is still very expensive. No, this is a birthday attack. To find a collision in a space of 2^64, on average I only need to generate 2^32 samples — the square root of that space. That’s perfectly feasible on a regular desktop computer. To collide long key IDs, I need only generate about 4 billion IDs and efficiently do membership tests on that set as I go.

That last step is easier said than done. Naively, that might look like this (pseudo-code):

seen := map of long key IDs to keys
loop forever {
    key := generateKey()
    longID := key.ID[12:20]
    if longID in seen {
        output seen[longID]
        output key
        break
    } else {
        seen[longID] = key
    }
}

Consider the size of that map. Each long ID is 8 bytes, and we expect to store around 2^32 of them. That’s at minimum 32 GB of storage just to track all the long IDs. The map itself is going to have some overhead, too. Since these are literally random lookups, this all mostly needs to be in RAM or else lookups are going to be very slow and impractical.

And I haven’t even counted the keys yet. As a saving grace, these are Ed25519 keys, so that’s 32 bytes for the public key and 32 bytes for the private key, which I’ll need if I want to make a self-signature. (The signature itself will be larger than the secret key.) That’s around 256GB more storage, though at least this can be stored on the hard drive. However, to address these from the map I’d need at least 38 bits, plus some more in case it goes over. Just call it another 8 bytes.

So that’s, at a bare minimum, 64GB of RAM plus 256GB of other storage. Since nothing is ideal, we’ll need more than this. This is all still feasible, but will require expensive hardware. We can do a lot better.

Keys from seeds

The first thing you might notice is that we can jettison that 256GB of storage by being a little more clever about how we generate keys. Since we don’t actually care about the security of these keys, we can generate each key from a seed much smaller than the key itself. Instead of using 8 bytes to reference a key in storage, just use those 8 bytes to store the seed used to make the key.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    key := generateKey(seed)
    longID := key.ID[12:20]
    if longID in seen {
        output generateKey(seen[longID])
        output key
        break
    } else {
        seen[longID] = seed
    }
}

I’m incrementing a counter to generate the seeds because I don’t want to experience the birthday paradox to apply to my seeds. Each really must be unique. I’m using SplitMix64 for the PRNG since I learned it’s the fastest for Go, so a simple increment to generate seeds is perfectly fine.

Ultimately, this still uses utterly excessive amounts of memory. Wouldn’t it be crazy if we could somehow get this 64GB map down to just a few MBs of RAM? Well, we can!

Rainbow tables

For decades, password crackers have faced a similar problem. They want to precompute the hashes for billions of popular passwords so that they can efficiently reverse those password hashes later. However, storing all those hashes would be unnecessarily expensive, or even infeasible.

So they don’t. Instead they use rainbow tables. Password hashes are chained together into a hash chain, where a password hash leads to a new password, then to a hash, and so on. Then only store the beginning and the end of each chain.

To lookup a hash in the rainbow table, run the hash chain algorithm starting from the target hash and, for each hash, check if it matches the end of one of the chains. If so, recompute that chain and note the step just before the target hash value. That’s the corresponding password.

For example, suppose the password “foo” hashes to 9bfe98eb, and we have a reduction function that maps a hash to some password. In this case, it maps 9bfe98eb to “bar”. A trivial reduction function could just be an index into a list of passwords. A hash chain starting from “foo” might look like this:

foo -> 9bfe98eb -> bar -> 27af0841 -> baz -> d9d4bbcb

In reality a chain would be a lot longer. Another chain starting from “apple” might look like this:

apple -> 7bbc06bc -> candle -> 82a46a63 -> dog -> 98c85d0a

We only store the tuples (foo, d9d4bbcb) and (apple, 98c85d0a) in our database. If the chains had been one million hashes long, we’d still only store those two tuples. That’s literally a 1:1000000 compression ratio!

Later on we’re faced with reversing the hash 27af0841, which isn’t listed directly in the database. So we run the chain forward from that hash until either I hit the maximum chain length (i.e. password not in the table), or we recognize a hash:

27af0841 -> baz -> d9d4bbcb

That d9d4bbcb hash is listed as being in the “foo” hash chain. So I regenerate that hash chain to discover that “bar” leads to 27af0841. Password cracked!

Collider rainbow table

My collider works very similarly. A hash chain works like this: Start with a 64-bit seed as before, generate a key, get the long key ID, then use the long key ID as the seed for the next key.

There’s one big difference. In the rainbow table the purpose is to run the hash function backwards by looking at the previous step in the chain. For the collider, I want to know if any of the hash chains collide. So long as each chain starts from a unique seed, it would mean we’ve found two different seeds that lead to the same long key ID.

Alternatively, it could be two different seeds that lead to the same key, which wouldn’t be useful, but that’s trivial to avoid.

A simple and efficient way to check if two chains contain the same sequence is to stop them at the same place in that sequence. Rather than run the hash chains for some fixed number of steps, they stop when they reach a distinguishing point. In my collider a distinguishing point is where the long key ID ends with at least N 0 bits, where N determines the average chain length. I chose 17 bits.

func computeChain(seed) {
    loop forever {
        key := generateKey(seed)
        longID := key.ID[12:20]
        if distinguished(longID) {
            return longID
        }
        seed = longID
    }
}

If two different hash chains end on the same distinguishing point, they’re guaranteed to have collided somewhere in the middle.

To determine where two chains collided, regenerate each chain and find the first long key ID that they have in common. The step just before are the colliding keys.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    longID := computeChain(seed)
    if longID in seen {
        output findCollision(seed, seen[longID])
        break
    } else {
        seen[longID] = seed
    }
}

Hash chains computation is embarrassingly parallel, so the load can be spread efficiently across CPU cores. With these rainbow(-like) tables, my tool can generate and track billions of keys in mere megabytes of memory. The additional computational cost is the time it takes to generate a couple more chains than otherwise necessary.

Predictable, Passphrase-Derived PGP Keys

2019-07-10T04:18:29Z

tl;dr: passphrase2pgp.

One of my long-term concerns has been losing my core cryptographic keys, or just not having access to them when I need them. I keep my important data backed up, and if that data is private then I store it encrypted. My keys are private, but how am I supposed to encrypt them? The chicken or the egg?

The OpenPGP solution is to (optionally) encrypt secret keys using a key derived from a passphrase. GnuPG prompts the user for this passphrase when generating keys and when using secret keys. This protects the keys at rest, and, with some caution, they can be included as part of regular backups. The OpenPGP specification, RFC 4880 has many options for deriving a key from this passphrase, called String-to-Key, or S2K, algorithms. None of the options are great.

In 2012, I selected the strongest S2K configuration at the time and, along with a very strong passphrase, put my GnuPG keyring on the internet as part of my public dotfiles repository. It was a kind of super-backup that would guarantee their availability anywhere I’d need them.

My timing was bad because, with the release of GnuPG 2.1 in 2014, GnuPG fundamentally changed its secret keyring format. S2K options are now (quietly!) ignored when deriving the protection keys. Instead it auto-calibrates to much weaker settings. With this new version of GnuPG, I could no longer update the keyring in my dotfiles repository without significantly downgrading its protection.

By 2017 I was pretty irritated with the whole situation. I let my OpenPGP keys expire, and then I wrote my own tool to replace the only feature of GnuPG I was actively using: encrypting my backups with asymmetric encryption. One of its core features is that the asymmetric keypair can be derived from a passphrase using a memory-hard key derivation function (KDF). Attackers must commit a significant quantity of memory (expensive) when attempting to crack the passphrase, making the passphrase that much more effective.

Since the asymmetric keys themselves, not just the keys protecting them, are derived from a passphrase, I never need to back them up! They’re also always available whenever I need them. My keys are essentially stored entirely in my brain as if I was a character in a William Gibson story.

Tackling OpenPGP key generation

At the time I had expressed my interest in having this feature for OpenPGP keys. It’s something I’ve wanted for a long time. I first took a crack at it in 2013 (now the the old-version branch) for generating RSA keys. RSA isn’t that complicated but it’s very easy to screw up. Since I was rolling it from scratch, I didn’t really trust myself not to subtly get it wrong. Plus I never figured out how to self-sign the key. GnuPG doesn’t accept secret keys that aren’t self-signed, so it was never useful.

I took another crack at it in 2018 with a much more brute force approach. When a program needs to generate keys, it will either read from /dev/u?random or, on more modern systems, call getentropy(3). These are all ultimately system calls, and I know how to intercept those with Ptrace. If I want to control key generation for any program, not just GnuPG, I could intercept these inputs and replace them with the output of a CSPRNG keyed by a passphrase.

Keyed: Linux Entropy Interception

In practice this doesn’t work at all. Real programs like GnuPG and OpenSSH’s ssh-keygen don’t rely solely on these entropy inputs. They also grab entropy from other places, like getpid(2), gettimeofday(2), and even extract their own scheduler and execution time noise. Without modifying these programs I couldn’t realistically control their key generation.

Besides, even if it did work, it would still be fragile and unreliable since these programs could always change how they use the inputs. So, ultimately, it was more of an experiment than something practical.

passphrase2pgp

For regular readers, it’s probably obvious that I recently learned Go. While searching for good projects idea for cutting my teeth, I noticed that Go’s “extended” standard library has a lot of useful cryptographic support, so the idea of generating the keys myself may be worth revisiting.

Something else also happened since my previous attempt: The OpenPGP ecosystem now has widespread support for elliptic curve cryptography. So instead of RSA, I could generate a Curve25519 keypair, which, by design, is basically impossible to screw up. Not only would I be generating keys on my own terms, I’d being doing it in style, baby.

There are two different ways to use Curve25519:

Digital signatures: Ed25519 (EdDSA)
Diffie–Hellman (encryption): X25519 (ECDH)

In GnuPG terms, the first would be a “sign only” key and the second is an “encrypt only” key. But can’t you usually do both after you generate a new OpenPGP key? If you’ve used GnuPG, you’ve probably seen the terms “primary key” and “subkey”, but you probably haven’t had think about them since it’s all usually automated.

The primary key is the one associated directly with your identity. It’s always a signature key. The OpenPGP specification says this is a signature key only by convention, but, practically speaking, it really must be since signatures is what holds everything together. Like packaging tape.

If you want to use encryption, independently generate an encryption key, then sign that key with the primary key, binding that key as a subkey to the primary key. This all happens automatically with GnuPG.

Fun fact: Two different primary keys can have the same subkey. Anyone could even bind any of your subkeys to their primary key! They only need to sign the public key! Though, of course, they couldn’t actually use your key since they’d lack the secret key. It would just be really confusing, and could, perhaps in certain situations, even cause some OpenPGP clients to malfunction. (Note to self: This demands investigation!)

It’s also possible to have signature subkeys. What good is that? Paranoid folks will keep their primary key only on a secure, air-gapped, then use only subkeys on regular systems. The subkeys can be revoked and replaced independently of the primary key if something were to go wrong.

In Go, generating an X25519 key pair is this simple (yes, it actually takes array pointers, which is rather weird):

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/curve25519"
)

func main() {
	var seckey, pubkey [32]byte
	rand.Read(seckey[:]) // FIXME: check for error
	seckey[0] &= 248
	seckey[31] &= 127
	seckey[31] |= 64
	curve25519.ScalarBaseMult(&pubkey, &seckey)
	fmt.Printf("pub %x\n", pubkey[:])
	fmt.Printf("sec %x\n", seckey[:])
}

The three bitwise operations are optional since it will do these internally, but it ensures that the secret key is in its canonical form. The actual Diffie–Hellman exchange requires just one more function call: curve25519.ScalarMult().

For Ed25519, the API is higher-level:

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/ed25519"
)

func main() {
	seed := make([]byte, ed25519.SeedSize)
	rand.Read(seed) // FIXME: check for error
	key := ed25519.NewKeyFromSeed(seed)
	fmt.Printf("pub %x\n", key[32:])
	fmt.Printf("sec %x\n", key[:32])
}

Signing a message with this key is just one function call: ed25519.Sign().

Unfortunately that’s the easy part. The other 400 lines of the real program are concerned only with encoding these values in the complex OpenPGP format. That’s the hard part. GnuPG’s --list-packets option was really useful for debugging this part.

OpenPGP specification

(Feel free to skip this section if the OpenPGP wire format isn’t interesting to you.)

Following the specification was a real challenge, especially since many of the details for Curve25519 only appear in still incomplete (and still erroneous) updates to the specification. I certainly don’t envy the people who have to parse arbitrary OpenPGP packets. It’s finicky and has arbitrary parts that don’t seem to serve any purpose, such as redundant prefix and suffix bytes on signature inputs. Fortunately I only had to worry about the subset that represents an unencrypted secret key export.

OpenPGP data is broken up into packets. Each packet begins with a tag identifying its type, followed by a length, which itself is a variable length. All the packets produced by passphrase2pgp are short, so I could pretend lengths were all a single byte long.

For a secret key export with one subkey, we need the following packets in this order:

Secret-Key: Public-Key packet with secret key appended
User ID: just a length-prefixed, UTF-8 string
Signature: binds Public-Key packet (1) and User ID packet (2)
Secret-Subkey: Public-Subkey packet with secret subkey appended
Signature: binds Public-Key packet (1) and Public-Subkey packet (4)

A Public-Key packet contains the creation date, key type, and public key data. A Secret-Key packet is the same, but with the secret key literally appended on the end and a different tag. The Key ID is (essentially) a SHA-1 hash of the Public-Key packet, meaning the creation date is part of the Key ID. That’s important for later.

I had wondered if the SHAttered attack could be used to create two different keys with the same full Key ID. However, there’s no slack space anywhere in the input, so I doubt it.

User IDs are usually a RFC 2822 name and email address, but that’s only convention. It can literally be an empty string, though that wouldn’t be useful. OpenPGP clients that require anything more than an empty string, such as GnuPG during key generation, are adding artificial restrictions.

The first Signature packet indicates the signature date, the signature issuer’s Key ID, and then optional metadata about how the primary key is to be used and the capabilities the key owner’s client. The signature itself is formed by appending the Public-Key packet portion of the Secret-Key packet, the User ID packet, and the previously described contents of the signature packet. The concatenation is hashed, the hash is signed, and the signature is appended to the packet. Since the options are included in the signature, they can’t be changed by another person.

In theory the signature is redundant. A client could accept the Secret-Key packet and User ID packet and consider the key imported. It would then create its own self-signature since it has everything it needs. However, my primary target for passphrase2pgp is GnuPG, and it will not accept secret keys that are not self-signed.

The Secret-Subkey packet is exactly the same as the Secret-Key packet except that it uses a different tag to indicate it’s a subkey.

The second Signature packet is constructed the same as the previous signature packet. However, it signs the concatenation of the Public-Key and Public-Subkey packets, binding the subkey to that primary key. This key may similarly have its own options.

To create a public key export from this input, a client need only chop off the secret keys and fix up the packet tags and lengths. The signatures remain untouched since they didn’t include the secret keys. That’s essentially what other people will receive about your key.

If someone else were to create a Signature packet binding your Public-Subkey packet with their Public-Key packet, they could set their own options on their version of the key. So my question is: Do clients properly track this separate set of options and separate owner for the same key? If not, they have a problem!

The format may not sound so complex from this description, but there are a ton of little details that all need to be correct. To make matters worse, the feedback is usually just a binary “valid” or “invalid”. The world could use an OpenPGP format debugger.

Usage

There is one required argument: either --uid (-u) or --load (-l). The former specifies a User ID since a key with an empty User ID is pretty useless. It’s my own artificial restriction on the User ID. The latter loads a previously-generated key which will come with a User ID.

To generate a key for use in GnuPG, just pipe the output straight into GnuPG:

$ passphrase2pgp --uid "Foo " | gpg --import

You will be prompted for a passphrase. That passphrase is run through Argon2id, a memory-hard KDF, with the User ID as the salt. Deriving the key requires 8 passes over 1GB of state, which takes my current computers around 8 seconds. With the --paranoid (-x) option enabled, that becomes 16 passes over 2GB (perhaps not paranoid enough?). The output is 64 bytes: 32 bytes to seed the primary key and 32 bytes to seed the subkey.

Despite the aggressive KDF settings, you will still need to choose a strong passphrase. Anyone who has your public key can mount an offline attack. A 10-word Diceware or Pokerware passphrase is more than sufficient (~128 bits) while also being quite reasonable to memorize.

Since the User ID is the salt, an attacker couldn’t build a single rainbow table to attack passphrases for different people. (Though your passphrase really should be strong enough that this won’t matter!) The cost is that you’ll need to use exactly the same User ID again to reproduce the key. In theory you could change the User ID afterward to whatever you like without affecting the Key ID, though it will require a new self-signature.

The keys are not encrypted (no S2K), and there are few options you can choose when generating the keys. If you want to change any of this, use GnuPG’s --edit-key tool after importing. For example, to set a protection passphrase:

$ gpg --edit-key Foo
gpg> passwd

There’s a lot that can be configured from this interface.

If you just need the public key to publish or share, the --public (-p) option will suppress the private parts and output only a public key. It works well in combination with ASCII armor, --armor (-a). For example, to put your public key on the clipboard:

$ passphrase2pgp -u '...' -ap | xclip

The tool can create detached signatures (--sign, -S) entirely on its own, too, so you don’t need to import the keys into GnuPG just to make signatures:

$ passphrase2pgp --sign --uid '...' program.exe

This would create a file named program.exe.sig with the detached signature, ready to be verified by another OpenPGP implementation. In fact, you can hook it directly up to Git for signing your tags and commits without GnuPG:

$ git config --global gpg.program passphrase2pgp

This only works for signing, and it cannot verify (verify-tag or verify-commit).

It’s pretty tedious to enter the --uid option all the time, so, if omitted, passphrase2pgp will infer the User ID from the environment variables REALNAME and EMAIL. Combined with the KEYID environment variable (see the README for details), you can easily get away with never storing your keys: only generate them on demand when needed.

That’s how I intend to use passphrase2pgp. When I want to sign a file, I’ll only need one option, one passphrase prompt, and a few seconds of patience:

$ passphrase2pgp -S path/to/file

January 1, 1970

The first time you run the tool you might notice one offensive aspect of its output: Your key will be dated January 1, 1970 — i.e. unix epoch zero. This predates PGP itself by more than two decades, so it might alarm people who receive your key.

Why do this? As I noted before, the creation date is part of the Key ID. Use a different date, and, as far as OpenPGP is concerned, you have a different key. Since users probably don’t want to remember a specific datetime, at seconds resolution, in addition to their passphrase, passphrase2pgp uses the same hard-coded date by default. A date of January 1, 1970 is like NULL in a database: no data.

If you don’t like this, you can override it with the --time (-t) or --now (-n) options, but it’s up to you to remain consistent.

Vanity Keys

If you’re interested in vanity keys — e.g. where the Key ID spells out words or looks unusual — it wouldn’t take much work to hack up the passphrase2pgp source into generating your preferred vanity keys. It would easily beat anything else I could find online.

Reconsidering limited OpenPGP

Initially my intention was never to output an encryption subkey, and passphrase2pgp would only be useful for signatures. By default it still only produces a sign key, but you can still get an encryption subkey with the --subkey (-s) option. I figured it might be useful to generate an encryption key, even if it’s not output by default. Users can always ask for it later if they have a need for it.

Why only a signing key? Nobody should be using OpenPGP for encryption anymore. Use better tools instead and retire the 20th century cryptography. If you don’t have an encryption subkey, nobody can send you OpenPGP-encrypted messages.

In contrast, OpenPGP signatures are still kind of useful and lack a practical alternative. The Web of Trust failed to reach critical mass, but that doesn’t seem to matter much in practice. Important OpenPGP keys can be bootstrapped off TLS by strategically publishing them on HTTPS servers. Keybase.io has done interesting things in this area.

Further, GitHub officially supports OpenPGP signatures, and I believe GitLab does too. This is another way to establish trust for a key. IMHO, there’s generally too much emphasis on binding a person’s legal identity to their OpenPGP key (e.g. the idea behind key-signing parties). I suppose that’s useful for holding a person legally accountable if they do something wrong. I’d prefer trust a key with has an established history of valuable community contributions, even if done so only under a pseudonym.

So sometime in the future I may again advertise an OpenPGP public key. If I do, those keys would certainly be generated with passphrase2pgp. I may not even store the secret keys on a keyring, and instead generate them on the fly only when I occasionally need them.

Looking for Entropy in All the Wrong Places

2019-04-30T22:50:09Z

Imagine we’re writing a C program and we need some random numbers. Maybe it’s for a game, or for a Monte Carlo simulation, or for cryptography. The standard library has a rand() function for some of these purposes.

int r = rand();

There are some problems with this. Typically the implementation is a rather poor PRNG, and we can do much better. It’s a poor choice for Monte Carlo simulations, and outright dangerous for cryptography. Furthermore, it’s usually a dynamic function call, which has a high overhead compared to how little the function actually does. In glibc, it’s also synchronized, adding even more overhead.

But, more importantly, this function returns the same sequences of values each time the program runs. If we want different numbers each time the program runs, it needs to be seeded — but seeded with what? Regardless of what PRNG we ultimately use, we need inputs unique to this particular execution.

The right places

On any modern unix-like system, the classical approach is to open /dev/urandom and read some bytes. It’s not part of POSIX but it is a de facto standard. These random bits are seeded from the physical world by the operating system, making them highly unpredictable and uncorrelated. They’re are suitable for keying a CSPRNG and, from there, generating all the secure random bits you will ever need (perhaps with fast-key-erasure). Why not /dev/random? Because on Linux it’s pointlessly superstitious, which has basically ruined that path for everyone.

/* Returns zero on failure. */
int
getbits(void *buf, size_t len)
{
    int result = 0;
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        result = fread(buf, len, 1, f);
        fclose(f);
    }
    return result;
}

int
main(void)
{
    unsigned seed;
    if (getbits(&seed, sizeof(seed))) {
        srand(seed);
    } else {
        die();
    }

    /* ... */
}

Note how there are two different places getbits() could fail, with multiple potential causes.

It could fail to open the file. Perhaps the program isn’t running on a modern unix-like system. Perhaps it’s running in a chroot and /dev/urandom wasn’t created. Perhaps there are too many file descriptors already open. Perhaps there isn’t enough memory available to open a file. Perhaps the file permissions disallow it or it’s blocked by Mandatory Access Control (MAC).
It could fail to read the file. This essentially can’t happen unless the system is severely misconfigured, in which case a successful read would be suspect anyway. In this case it’s probably still a good idea to check the result.

The need for creating a file descriptor a serious issue for libraries. Libraries that quietly create and close file descriptors can interfere with the main program, especially if its asynchronous. The main program might rely on file descriptors being consecutive, predictable, or monotonic (example). File descriptors are also a limited resource, so it may exhaust a file descriptor slot needed for the main program. For a network service, a remote attacker could perhaps open enough sockets to deny a file descriptor to getbits(), blocking the program from gathering entropy.

/dev/urandom is simple, but it’s not an ideal API.

getentropy(2)

Wouldn’t it be nicer if our program could just directly ask the operating system to fill a buffer with random bits? That’s what the OpenBSD folks thought, so they introduced a getentropy(2) system call. When called correctly it cannot fail!

int getentropy(void *buf, size_t buflen);

Other operating systems followed suit, including Linux, though on Linux getentropy(2) is a library function implemented using getrandom(2), the actual system call. It’s been in the Linux kernel since version 3.17 (October 2014), but the libc wrapper didn’t appear in glibc until version 2.25 (February 2017). So as of this writing, there are still many systems where it’s still not practical to use even if their kernel is new enough.

For now on Linux you may still want to check, and have a strategy in place, for an ENOSYS result. Some systems are still running kernels that are 5 years old, or older.

OpenBSD also has another trick up its trick-filled sleeves: the .openbsd.randomdata section. Just as the .bss section is filled with zeros, the .openbsd.randomdata section is filled with securely-generated random bits. You could put your PRNG state in this section and it will be seeded as part of loading the program. Cool!

RtlGenRandom()

Windows doesn’t have /dev/urandom. Instead it has:

CryptGenRandom()
CryptAcquireContext()
CryptReleaseContext()

Though in typical Win32 fashion, the API is ugly, overly-complicated, and has multiple possible failure points. It’s essentially impossible to use without referencing documentation. Ugh.

However, Windows 98 and later has RtlGenRandom(), which has a much more reasonable interface. Looks an awful lot like getentropy(2), eh?

BOOLEAN RtlGenRandom(
  PVOID RandomBuffer,
  ULONG RandomBufferLength
);

The problem is that it’s not quite an official API, and no promises are made about it. In practice, far too much software now depends on it that the API is unlikely to ever break. Despite the prototype above, this function is actually named SystemFunction036(), and you have to supply your own prototype. Here’s my little drop-in snippet that turns it nearly into getentropy(2):

#ifdef _WIN32
#  define WIN32_LEAN_AND_MEAN
#  include 
#  pragma comment(lib, "advapi32.lib")
   BOOLEAN NTAPI SystemFunction036(PVOID, ULONG);
#  define getentropy(buf, len) (SystemFunction036(buf, len) ? 0 : -1)
#endif

It works in Wine, too, where, at least in my version, it reads from /dev/urandom.

The wrong places

That’s all well and good, but suppose we’re masochists. We want our program to be maximally portable so we’re sticking strictly to functionality found in the standard C library. That means no getentropy(2) and no RtlGenRandom(). We can still try to open /dev/urandom, but it might fail, or it might not actually be useful, so we’ll want a backup.

The usual approach found in a thousand tutorials is time(3):

srand(time(NULL));

It would be better to use an integer hash function to mix up the result from time(0) before using it as a seed. Otherwise two programs started close in time may have similar initial sequences.

srand(triple32(time(NULL)));

The more pressing issue is that time(3) has a resolution of one second. If the program is run twice inside of a second, they’ll both have the same sequence of numbers. It would be better to use a higher resolution clock, but, standard C doesn’t provide a clock with greater than one second resolution. That normally requires calling into POSIX or Win32.

So, we need to find some other sources of entropy unique to each execution of the program.

Quick and dirty “string” hash function

Before we get into that, we need a way to mix these different sources together. Here’s a small, 32-bit “string” hash function. The loop is the same algorithm as Java’s hashCode(), and I appended my own integer hash as a finalizer for much better diffusion.

uint32_t
hash32s(const void *buf, size_t len, uint32_t h)
{
    const unsigned char *p = buf;
    for (size_t i = 0; i < len; i++)
        h = h * 31 + p[i];
    h ^= h >> 17;
    h *= UINT32_C(0xed5ad4bb);
    h ^= h >> 11;
    h *= UINT32_C(0xac4c1b51);
    h ^= h >> 15;
    h *= UINT32_C(0x31848bab);
    h ^= h >> 14;
    return h;
}

It accepts a starting hash value, which is essentially a “context” for the digest that allows different inputs to be appended together. The finalizer acts as an implicit “stop” symbol in between inputs.

I used fixed-width integers, but it could be written nearly as concisely using only unsigned long and some masking to truncate to 32-bits. I leave this as an exercise to the reader.

Some of the values to be mixed in will be pointers themselves. These could instead be cast to integers and passed through an integer hash function, but using string hash avoids various caveats. Besides, one of the inputs will be a string, so we’ll need this function anyway.

Randomized pointers (ASLR, random stack gap, etc.)

Attackers can use predictability to their advantage, so modern systems use unpredictability to improve security. Memory addresses for various objects and executable code are randomized since some attacks require an attacker to know their addresses. We can skim entropy from these pointers to seed our PRNG.

Address Space Layout Randomization (ASLR) is when executable code and its associated data is loaded to a random offset by the loader. Code designed for this is called Position Independent Code (PIC). This has long been used when loading dynamic libraries so that all of the libraries on a system don’t have to coordinate with each other to avoid overlapping.

To improve security, it has more recently been extended to programs themselves. On both modern unix-like systems and Windows, position-independent executables (PIE) are now the default.

To skim entropy from ASLR, we just need the address of one of our functions. All the functions in our program will have the same relative offset, so there’s no reason to use more than one. An obvious choice is main():

    uint32_t h = 0;  /* initial hash value */
    int (*mainptr)() = main;
    h = hash32s(&mainptr, sizeof(mainptr), h);

Notice I had to store the address of main() in a variable, and then treat the pointer itself as a buffer for the hash function? It’s not hashing the machine code behind main, just its address. The symbol main doesn’t store an address, so it can’t be given to the hash function to represent its address. This is analogous to an array versus a pointer.

On a typical x86-64 Linux system, and when this is a PIE, that’s about 3 bytes worth of entropy. On 32-bit systems, virtual memory is so tight that it’s worth a lot less. We might want more entropy than that, and we want to cover the case where the program isn’t compiled as a PIE.

On unix-like systems, programs are typically dynamically linked against the C library, libc. Each shared object gets its own ASLR offset, so we can skim more entropy from each shared object by picking a function or variable from each. Let’s do malloc(3) for libc ASLR:

    void *(*mallocptr)() = malloc;
    h = hash32s(&mallocptr, sizeof(mallocptr), h);

Allocators themselves often randomize the addresses they return so that data objects are stored at unpredictable addresses. In particular, glibc uses different strategies for small (brk(2)) versus big (mmap(2)) allocations. That’s two different sources of entropy:

    void *small = malloc(1);        /* 1 byte */
    h = hash32s(&small, sizeof(small), h);
    free(small);

    void *big = malloc(1UL << 20);  /* 1 MB */
    h = hash32s(&big, sizeof(big), h);
    free(big);

Finally the stack itself is often mapped at a random address, or at least started with a random gap, so that local variable addresses are also randomized.

    void *ptr = &ptr;
    h = hash32s(&ptr, sizeof(ptr), h);

Time sources

We haven’t used time(3) yet! Let’s still do that, using the full width of time_t this time around:

    time_t t = time(0);
    h = hash32s(&t, sizeof(t), h);

We do have another time source to consider: clock(3). It returns an approximation of the processor time used by the program. There’s a tiny bit of noise and inconsistency between repeated calls. We can use this to extract a little bit of entropy over many repeated calls.

Naively we might try to use it like this:

    /* Note: don't use this */
    for (int i = 0; i < 1000; i++) {
        clock_t c = clock();
        h = hash32s(&c, sizeof(c), h);
    }

The problem is that the resolution for clock() is typically rough enough that modern computers can execute multiple instructions between ticks. On Windows, where CLOCKS_PER_SEC is low, that entire loop will typically complete before the result from clock() increments even once. With that arrangement we’re hardly getting anything from it! So here’s a better version:

    for (int i = 0; i < 1000; i++) {
        unsigned long counter = 0;
        clock_t start = clock();
        while (clock() == start)
            counter++;
        h = hash32s(&start, sizeof(start), h);
        h = hash32s(&counter, sizeof(counter), h);
    }

The counter makes the resolution of the clock no longer important. If it’s low resolution, then we’ll get lots of noise from the counter. If it’s high resolution, then we get noise from the clock value itself. Running the hash function an extra time between overall clock(3) samples also helps with noise.

A legitimate use of tmpnam(3)

We’ve got one more source of entropy available: tmpnam(3). This function generates a unique, temporary file name. It’s dangerous to use as intended because it doesn’t actually create the file. There’s a race between generating the name for the file and actually creating it.

Fortunately we don’t actually care about the name as a filename. We’re using this to sample entropy not directly available to us. In attempt to get a unique name, the standard C library draws on its own sources of entropy.

    char buf[L_tmpnam] = {0};
    tmpnam(buf);
    h = hash32s(buf, sizeof(buf), h);

The rather unfortunately downside is that lots of modern systems produce a linker warning when it sees tmpnam(3) being linked, even though in this case it’s completely harmless.

So what goes into a temporary filename? It depends on the implementation.

glibc and musl

Both get a high resolution timestamp and generate the filename directly from the timestamp (no hashing, etc.). Unfortunately glibc does a very poor job of also mixing getpid(2) into the timestamp before using it, and probably makes things worse by doing so.

On these platforms, this is is a way to sample a high resolution timestamp without calling anything non-standard.

dietlibc

In the latest release as of this writing it uses rand(3), which makes this useless. It’s also a bug since the C library isn’t allowed to affect the state of rand(3) outside of rand(3) and srand(3). I submitted a bug report and this has since been fixed.

In the next release it will use a generator seeded by the ELF AT_RANDOM value if available, or ASLR otherwise. This makes it moderately useful.

libiberty

Generated from getpid(2) alone, with a counter to handle multiple calls. It’s basically a way to sample the process ID without actually calling getpid(2).

BSD libc / Bionic (Android)

Actually gathers real entropy from the operating system (via arc4random(2)), which means we’re getting a lot of mileage out of this one.

uclibc

Its implementation is obviously forked from glibc. However, it first tries to read entropy from /dev/urandom, and only if that fails does it fallback to glibc’s original high resolution clock XOR getpid(2) method (still not hashing it).

Finishing touches

Finally, still use /dev/urandom if it’s available. This doesn’t require us to trust that the output is anything useful since it’s just being mixed into the other inputs.

    char rnd[4];
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        if (fread(rnd, sizeof(rnd), 1, f))
            h = hash32s(rnd, sizeof(rnd), h);
        fclose(f);
    }

When we’re all done gathering entropy, set the seed from the result.

    srand(h);   /* or whatever you're seeding */

That’s bound to find some entropy on just about any host. Though definitely don’t rely on the results for cryptography.

Lua

I recently tackled this problem in Lua. It has a no-batteries-included design, demanding very little of its host platform: nothing more than an ANSI C implementation. Because of this, a Lua program has even fewer options for gathering entropy than C. But it’s still not impossible!

To further complicate things, Lua code is often run in a sandbox with some features removed. For example, Lua has os.time() and os.clock() wrapping the C equivalents, allowing for the same sorts of entropy sampling. When run in a sandbox, os might not be available. Similarly, io might not be available for accessing /dev/urandom.

Have you ever printed a table, though? Or a function? It evaluates to a string containing the object’s address.

$ lua -e 'print(math)'
table: 0x559577668a30
$ lua -e 'print(math)'
table: 0x55e4a3679a30

Since the raw pointer values are leaked to Lua, we can skim allocator entropy like before. Here’s the same hash function in Lua 5.3:

local function hash32s(buf, h)
    for i = 1, #buf do
        h = h * 31 + buf:byte(i)
    end
    h = h & 0xffffffff
    h = h ~ (h >> 17)
    h = h * 0xed5ad4bb
    h = h & 0xffffffff
    h = h ~ (h >> 11)
    h = h * 0xac4c1b51
    h = h & 0xffffffff
    h = h ~ (h >> 15)
    h = h * 0x31848bab
    h = h & 0xffffffff
    h = h ~ (h >> 14)
    return h
end

Now hash a bunch of pointers in the global environment:

local h = hash32s({}, 0)  -- hash a new table
for varname, value in pairs(_G) do
    h = hash32s(varname, h)
    h = hash32s(tostring(value), h)
    if type(value) == 'table' then
        for k, v in pairs(value) do
            h = hash32s(tostring(k), h)
            h = hash32s(tostring(v), h)
        end
    end
end

math.randomseed(h)

Unfortunately this doesn’t actually work well on one platform I tested: Cygwin. Cygwin has few security features, notably lacking ASLR, and having a largely deterministic allocator.

When to use it

In practice it’s not really necessary to use these sorts of tricks of gathering entropy from odd places. It’s something that comes up more in coding challenges and exercises than in real programs. I’m probably already making platform-specific calls in programs substantial enough to need it anyway.

On a few occasions I have thought about these things when debugging. ASLR makes return pointers on the stack slightly randomized on each run, which can change the behavior of some kinds of bugs. Allocator and stack randomization does similar things to most of your pointers. GDB tries to disable some of these features during debugging, but it doesn’t get everything.

Prospecting for Hash Functions

2018-07-31T22:32:45Z

Update 2022: TheIronBorn has found even better permutations using a smarter technique. That thread completely eclipses my efforts in this article.

I recently got an itch to design my own non-cryptographic integer hash function. Firstly, I wanted to better understand how hash functions work, and the best way to learn is to do. For years I’d been treating them like magic, shoving input into it and seeing random-looking, but deterministic, output come out the other end. Just how is the avalanche effect achieved?

Secondly, could I apply my own particular strengths to craft a hash function better than the handful of functions I could find online? Especially the classic ones from Thomas Wang and Bob Jenkins. Instead of struggling with the mathematics, maybe I could software engineer my way to victory, working from the advantage of access to the excessive computational power of today.

Suppose, for example, I wrote tool to generate a random hash function definition, then JIT compile it to a native function in memory, then execute that function across various inputs to evaluate its properties. My tool could rapidly repeat this process in a loop until it stumbled upon an incredible hash function the world had never seen. That’s what I actually did. I call it the Hash Prospector:

https://github.com/skeeto/hash-prospector

It only works on x86-64 because it uses the same JIT compiling technique I’ve discussed before: allocate a page of memory, write some machine instructions into it, set the page to executable, cast the page pointer to a function pointer, then call the generated code through the function pointer.

Generating a hash function

My focus is on integer hash functions: a function that accepts an n-bit integer and returns an n-bit integer. One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.

This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversible, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input.

For example, XOR with a constant is trivially reversible: XOR the output with the same constant to reverse it. Addition with a constant is reversed by subtraction with the same constant. Since the integer operations are modular arithmetic, modulo 2^n for n-bit integers, multiplication by an odd number is reversible. Odd numbers are coprime with the power-of-two modulus, so there is some modular multiplicative inverse that reverses the operation.

Bret Mulvey’s hash function article provides a convenient list of some reversible operations available for constructing integer hash functions. This list was the catalyst for my little project. Here are the ones used by the hash prospector:

x  = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation

I’ve come across a couple more useful operations while studying existing integer hash functions, but I didn’t put these in the prospector.

hash += ~(hash << constant);
hash -= ~(hash << constant);

The prospector picks some operations at random and fills in their constants randomly within their proper constraints. For example, here’s an awful hash function I made it generate as an example:

// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
    x *= 0x1eca7d79U;
    x ^= x >> 20;
    x  = (x << 8) | (x >> 24);
    x  = ~x;
    x ^= x << 5;
    x += 0x10afe4e7U;
    return x;
}

That function is reversible, and it would be relatively straightforward to define its inverse. However, it has awful biases and poor avalanche. How do I know this?

The measure of a hash function

There are two key properties I’m looking for in randomly generated hash functions.

High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.
Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

Initially I screwed up and only measured the first property. This lead to some hash functions that seemed to be amazing before close inspection, since, for a 32-bit hash function, it was flipping over 15 output bits on average. However, the particular bits being flipped were heavily biased, resulting in obvious patterns in the output.

For example, when hashing a counter starting from zero, the high bits would follow a regular pattern. 15 to 16 bits were being flipped each time, but it was always the same bits.

Conveniently it’s easy to measure both properties at the same time. For an n-bit integer hash function, create an n by n table initialized to zero. The rows are input bits and the columns are output bits. The ith row and jth column track the correlation between the ith input bit and jth output bit.

Then exhaustively iterate over all 2^n inputs, and flip each bit one at a time. Increment the appropriate element in the table if the output bit flips.

When you’re done, ideally each element in the table is exactly 2^(n-1). That is, each output bit was flipped exactly half the time by each input bit. Therefore the bias of the hash function is the distance (the error) of the computed table from the ideal table.

For example, the ideal bias table for an 8-bit hash function would be:

128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128
128 128 128 128 128 128 128

The hash prospector computes the standard deviation in order to turn this into a single, normalized measurement. Lower scores are better.

However, there’s still one problem: the input space for a 32-bit hash function is over 4 billion values. The full test takes my computer about an hour and a half. Evaluating a 64-bit hash function is right out.

Again, Monte Carlo to the rescue! Rather than sample the entire space, just sample a random subset. This provides a good estimate in less than a second, allowing lots of terrible hash functions to be discarded early. The full test can be saved only for the known good 32-bit candidates. 64-bit functions will only ever receive the estimate.

What did I find?

Once I got the bias issue sorted out, and after hours and hours of running, followed up with some manual tweaking on my part, the prospector stumbled across this little gem:

// DO use this one!
uint32_t
prospector32(uint32_t x)
{
    x ^= x >> 15;
    x *= 0x2c1b3c6dU;
    x ^= x >> 12;
    x *= 0x297a2d39U;
    x ^= x >> 15;
    return x;
}

According to a full (e.g. not estimated) bias evaluation, this function beats the snot out of most of 32-bit hash functions I could find. It even comes out ahead of this well known hash function that I believe originates from the H2 SQL Database. (Update: Thomas Mueller has confirmed that, indeed, this is his hash function.)

uint32_t
hash32(uint32_t x)
{
    x = ((x >> 16) ^ x) * 0x45d9f3bU;
    x = ((x >> 16) ^ x) * 0x45d9f3bU;
    x = (x >> 16) ^ x;
    return x;
}

It’s still an excellent hash function, just slightly more biased than mine.

Very briefly, prospector32() was the best 32-bit hash function I could find, and I thought I had a major breakthrough. Then I noticed the finalizer function for the 32-bit variant of MurmurHash3. It’s also a 32-bit hash function:

uint32_t
murmurhash32_mix32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x85ebca6bU;
    x ^= x >> 13;
    x *= 0xc2b2ae35U;
    x ^= x >> 16;
    return x;
}

This one is just barely less biased than mine. So I still haven’t discovered the best 32-bit hash function, only the second best one. :-)

A pattern emerges

If you’re paying close enough attention, you may have noticed that all three functions above have the same structure. The prospector had stumbled upon it all on its own without knowledge of the existing functions. It may not be so obvious for the second function, but here it is refactored:

uint32_t
hash32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x45d9f3bU;
    x ^= x >> 16;
    x *= 0x45d9f3bU;
    x ^= x >> 16;
    return x;
}

I hadn’t noticed this until after the prospector had come across it on its own. The pattern for all three is XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift. There’s something particularly useful about this multiply-xorshift construction (also). The XOR-right-shift diffuses bits rightward and the multiply diffuses bits leftward. I like to think it’s “sloshing” the bits right, left, right, left.

It seems that multiplication is particularly good at diffusion, so it makes perfect sense to exploit it in non-cryptographic hash functions, especially since modern CPUs are so fast at it. Despite this, it’s not used much in cryptography due to issues with completing it in constant time.

I like to think of this construction in terms of a five-tuple. For the three functions it’s the following:

(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()

The prospector actually found lots of decent functions following this pattern, especially where the middle shift is smaller than the outer shift. Thinking of it in terms of this tuple, I specifically directed it to try different tuple constants. That’s what I meant by “tweaking.” Eventually my new function popped out with its really low bias.

The prospector has a template option (-p) if you want to try it yourself:

$ ./prospector -p xorr,mul,xorr,mul,xorr

If you really have your heart set on certain constants, such as my specific selection of shifts, you can lock those in while randomizing the other constants:

$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15

Or the other way around:

$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr

My function seems a little strange using shifts of 15 bits rather than a nice, round 16 bits. However, changing those constants to 16 increases the bias. Similarly, neither of the two 32-bit constants is a prime number, but nudging those constants to the nearest prime increases the bias. These parameters really do seem to be a local minima in the bias, and using prime numbers isn’t important.

What about 64-bit integer hash functions?

So far I haven’t been able to improve on 64-bit hash functions. The main function to beat is SplittableRandom / SplitMix64:

uint64_t
splittable64(uint64_t x)
{
    x ^= x >> 30;
    x *= 0xbf58476d1ce4e5b9U;
    x ^= x >> 27;
    x *= 0x94d049bb133111ebU;
    x ^= x >> 31;
    return x;
}

Here’s its inverse since it’s sometimes useful:

uint64_t
splittable64_r(uint64_t x)
{
    x ^= x >> 31 ^ x >> 62;
    x *= 0x319642b2d24d8ec3U;
    x ^= x >> 27 ^ x >> 54;
    x *= 0x96de1b173f119089U;
    x ^= x >> 30 ^ x >> 60;
    return x;
}

I also came across this function:

uint64_t
hash64(uint64_t x)
{
    x ^= x >> 32;
    x *= 0xd6e8feb86659fd93U;
    x ^= x >> 32;
    x *= 0xd6e8feb86659fd93U;
    x ^= x >> 32;
    return x;
}

Again, these follow the same construction as before. There really is something special about it, and many other people have noticed, too.

Both functions have about the same bias. (Remember, I can only estimate the bias for 64-bit hash functions.) The prospector has found lots of functions with about the same bias, but nothing provably better. Until it does, I have no new 64-bit integer hash functions to offer.

Beyond random search

Right now the prospector does a completely random, unstructured search hoping to stumble upon something good by chance. Perhaps it would be worth using a genetic algorithm to breed those 5-tuples towards optimum? Others have had success in this area with simulated annealing.

There’s probably more to exploit from the multiply-xorshift construction that keeps popping up. If anything, the prospector is searching too broadly, looking at constructions that could never really compete no matter what the constants. In addition to everything above, I’ve been looking for good 32-bit hash functions that don’t use any 32-bit constants, but I’m really not finding any with a competitively low bias.

Update after one week

About one week after publishing this article I found an even better hash function. I believe this is the least biased 32-bit integer hash function of this form ever devised. It’s even less biased than the MurmurHash3 finalizer.

// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x7feb352dU;
    x ^= x >> 15;
    x *= 0x846ca68bU;
    x ^= x >> 16;
    return x;
}

// inverse
uint32_t
lowbias32_r(uint32_t x)
{
    x ^= x >> 16;
    x *= 0x43021123U;
    x ^= x >> 15 ^ x >> 30;
    x *= 0x1d69e2a5U;
    x ^= x >> 16;
    return x;
}

If you’re willing to use an additional round of multiply-xorshift, this next function actually reaches the theoretical bias limit (bias = ~0.021) as exhibited by a perfect integer hash function:

// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
    x ^= x >> 17;
    x *= 0xed5ad4bbU;
    x ^= x >> 11;
    x *= 0xac4c1b51U;
    x ^= x >> 15;
    x *= 0x31848babU;
    x ^= x >> 14;
    return x;
}

~~It’s statistically indistinguishable from a random permutation of all 32-bit integers.~~(Update 2025: Peter Schmidt-Nielsen has provided a second-order characteristic test that quickly identifies statistically significant biases in triple32.)

Update, February 2020

Some people have been experimenting with using my hash functions in GLSL shaders, and the results are looking good:

Inspiration from Data-dependent Rotations

2018-02-07T23:59:59Z

This article is an expanded email I wrote in response to a question from Frank Muller. He asked how I arrived at my solution to a branchless UTF-8 decoder:

I mean, when you started, I’m pretty the initial solution was using branches, right? Then, you’ve applied some techniques to eliminate them.

A bottom-up approach that begins with branches and then proceeds to eliminate them one at a time sounds like a plausible story. However, this story is the inverse of how it actually played out. It began when I noticed a branchless decoder could probably be done, then I put together the pieces one at a time without introducing any branches. But what sparked that initial idea?

The two prior posts reveal my train of thought at the time: a look at the Blowfish cipher and a 64-bit PRNG shootout. My layman’s study of Blowfish was actually part of an examination of a number of different block ciphers. For example, I also read the NSA’s Speck and Simon paper and then implemented the 128/128 variant of Speck — a 128-bit key and 128-bit block. I didn’t take the time to write an article about it, but note how the entire cipher — key schedule, encryption, and decryption — is just 40 lines of code:

struct speck {
    uint64_t k[32];
};

void
speck_init(struct speck *ctx, uint64_t x, uint64_t y)
{
    ctx->k[0] = y;
    for (uint64_t i = 0; i < 31; i++) {
        x = (x >> 8) | (x << 56);
        x += y;
        x ^= i;
        y = (y << 3) | (y >> 61);
        y ^= x;
        ctx->k[i + 1] = y;
    }
}

void
speck_encrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
    for (int i = 0; i < 32; i++) {
        *x = (*x >> 8) | (*x << 56);
        *x += *y;
        *x ^= ctx->k[i];
        *y = (*y << 3) | (*y >> 61);
        *y ^= *x;
    }
}

static void
speck_decrypt(const struct speck *ctx, uint64_t *x, uint64_t *y)
{
    for (int i = 0; i < 32; i++) {
        *y ^= *x;
        *y = (*y >> 3) | (*y << 61);
        *x ^= ctx->k[31 - i];
        *x -= *y;
        *x = (*x << 8) | (*x >> 56);
    }
}

Isn’t that just beautiful? It’s so tiny and fast. Other than the not-very-arbitrary selection of 32 rounds, and the use of 3-bit and 8-bit rotations, there are no magic values. One could fairly reasonably commit this cipher to memory if necessary, similar to the late RC4. Speck is probably my favorite block cipher right now, except that I couldn’t figure out the key schedule for any of the other key/block size variants.

Another cipher I studied, though in less depth, was RC5 (1994), a block cipher by (obviously) Ron Rivest. The most novel part of RC5 is its use of data dependent rotations. This was a very deliberate decision, and the paper makes this clear:

RC5 should highlight the use of data-dependent rotations, and encourage the assessment of the cryptographic strength data-dependent rotations can provide.

What’s a data-dependent rotation. In the Speck cipher shown above, notice how the right-hand side of all the rotation operations is a constant (3, 8, 56, and 61). Suppose that these operands were not constant, instead they were based on some part of the value of the block:

    int r = *y & 0x0f;
    *x = (*x >> r) | (*x << (64 - r));

Such “random” rotations “frustrate differential cryptanalysis” according to the paper, increasing the strength of the cipher.

Another algorithm that uses data-dependent shift is the PCG family of PRNGs. Honestly, the data-dependent “permutation” shift is the defining characteristic of PCG. As a reminder, here’s the simplified PCG from my shootout:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

Notice how the final shift depends on the high order bits of the PRNG state. (This one weird trick from Melissa O’Neil will significantly improve your PRNG. Xorshift experts hate her.)

I think this raises a really interesting question: Why did it take until 2014 for someone to apply a data-dependent shift to a PRNG? Similarly, why are data-dependent rotations not used in many ciphers?

My own theory is that this is because many older instruction set architectures can’t perform data-dependent shift operations efficiently.

Many instruction sets only have a fixed shift (e.g. 1-bit), or can only shift by an immediate (e.g. a constant). In these cases, a data-dependent shift would require a loop. These loops would be a ripe source of side channel attacks in ciphers due to the difficultly of making them operate in constant time. It would also be relatively slow for video game PRNGs, which often needed to run on constrained hardware with limited instruction sets. For example, the 6502 (Atari, Apple II, NES, Commodore 64) and the Z80 (too many to list) can only shift/rotate one bit at a time.

Even on an architecture with an instruction for data-dependent shifts, such as the x86, those shifts will be slower than constant shifts, at least in part due to the additional data dependency.

It turns out there are also some patent issues (ex. 1, 2). Fortunately most of these patents have now expired, and one in particular is set to expire this June. I still like my theory better.

To branchless decoding

So I was thinking about data-dependent shifts, and I had also noticed I could trivially check the length of a UTF-8 code point using a small lookup table — the first step in my decoder. What if I combined these: a data-dependent shift based on a table lookup. This would become the last step in my decoder. The idea for a branchless UTF-8 decoder was essentially borne out of connecting these two thoughts, and then filling in the middle.

Finding the Best 64-bit Simulation PRNG

2017-09-21T21:25:00Z

August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256** is now my preferred PRNG.

I use pseudo-random number generators (PRNGs) a whole lot. They’re an essential component in lots of algorithms and processes.

Monte Carlo simulations, where PRNGs are used to compute numeric estimates for problems that are difficult or impossible to solve analytically.
Monte Carlo tree search AI, where massive numbers of games are played out randomly in search of an optimal move. This is a specific application of the last item.
Genetic algorithms, where a PRNG creates the initial population, and then later guides in mutation and breeding of selected solutions.
Cryptography, where a cryptographically-secure PRNGs (CSPRNGs) produce output that is predictable for recipients who know a particular secret, but not for anyone else. This article is only concerned with plain PRNGs.

For the first three “simulation” uses, there are two primary factors that drive the selection of a PRNG. These factors can be at odds with each other:

The PRNG should be very fast. The application should spend its time running the actual algorithms, not generating random numbers.
PRNG output should have robust statistical qualities. Bits should appear to be independent and the output should closely follow the desired distribution. Poor quality output will negatively effect the algorithms using it. Also just as important is how you use it, but this article will focus only on generating bits.

In other situations, such as in cryptography or online gambling, another important property is that an observer can’t learn anything meaningful about the PRNG’s internal state from its output. For the three simulation cases I care about, this is not a concern. Only speed and quality properties matter.

Depending on the programming language, the PRNGs found in various standard libraries may be of dubious quality. They’re slower than they need to be, or have poorer quality than required. In some cases, such as rand() in C, the algorithm isn’t specified, and you can’t rely on it for anything outside of trivial examples. In other cases the algorithm and behavior is specified, but you could easily do better yourself.

My preference is to BYOPRNG: Bring Your Own Pseudo-random Number Generator. You get reliable, identical output everywhere. Also, in the case of C and C++ — and if you do it right — by embedding the PRNG in your project, it will get inlined and unrolled, making it far more efficient than a slow call into a dynamic library.

A fast PRNG is going to be small, making it a great candidate for embedding as, say, a header library. That leaves just one important question, “Can the PRNG be small and have high quality output?” In the 21st century, the answer to this question is an emphatic “yes!”

For the past few years my main go to for a drop-in PRNG has been xorshift*. The body of the function is 6 lines of C, and its entire state is a 64-bit integer, directly seeded. However, there are a number of choices here, including other variants of Xorshift. How do I know which one is best? The only way to know is to test it, hence my 64-bit PRNG shootout:

64-bit PRNG Shootout

Sure, there are other such shootouts, but they’re all missing something I want to measure. I also want to test in an environment very close to how I’d use these PRNGs myself.

Shootout results

Before getting into the details of the benchmark and each generator, here are the results. These tests were run on an i7-6700 (Skylake) running Linux 4.9.0.

                               Speed (MB/s)
PRNG           FAIL  WEAK  gcc-6.3.0 clang-3.8.1
------------------------------------------------
baseline          X     X      15000       13100
blowfishcbc16     0     1        169         157
blowfishcbc4      0     5        725         676
blowfishctr16     1     3        187         184
blowfishctr4      1     5        890        1000
mt64              1     7       1700        1970
pcg64             0     4       4150        3290
rc4               0     5        366         185
spcg64            0     8       5140        4960
xoroshiro128+     0     6       8100        7720
xorshift128+      0     2       7660        6530
xorshift64*       0     3       4990        5060

The clear winner is xoroshiro128+, with a function body of just 7 lines of C. It’s clearly the fastest, and the output had no observed statistical failures. However, that’s not the whole story. A couple of the other PRNGS have advantages that situationally makes them better suited than xoroshiro128+. I’ll go over these in the discussion below.

These two versions of GCC and Clang were chosen because these are the latest available in Debian 9 “Stretch.” It’s easy to build and run the benchmark yourself if you want to try a different version.

Speed benchmark

In the speed benchmark, the PRNG is initialized, a 1-second alarm(1) is set, then the PRNG fills a large volatile buffer of 64-bit unsigned integers again and again as quickly as possible until the alarm fires. The amount of memory written is measured as the PRNG’s speed.

The baseline “PRNG” writes zeros into the buffer. This represents the absolute speed limit that no PRNG can exceed.

The purpose for making the buffer volatile is to force the entire output to actually be “consumed” as far as the compiler is concerned. Otherwise the compiler plays nasty tricks to make the program do as little work as possible. Another way to deal with this would be to write(2) buffer, but of course I didn’t want to introduce unnecessary I/O into a benchmark.

On Linux, SIGALRM was impressively consistent between runs, meaning it was perfectly suitable for this benchmark. To account for any process scheduling wonkiness, the bench mark was run 8 times and only the fastest time was kept.

The SIGALRM handler sets a volatile global variable that tells the generator to stop. The PRNG call was unrolled 8 times to avoid the alarm check from significantly impacting the benchmark. You can see the effect for yourself by changing UNROLL to 1 (i.e. “don’t unroll”) in the code. Unrolling beyond 8 times had no measurable effect to my tests.

Due to the PRNGs being inlined, this unrolling makes the benchmark less realistic, and it shows in the results. Using volatile for the buffer helped to counter this effect and reground the results. This is a fuzzy problem, and there’s not really any way to avoid it, but I will also discuss this below.

Statistical benchmark

To measure the statistical quality of each PRNG — mostly as a sanity check — the raw binary output was run through dieharder 3.31.1:

prng | dieharder -g200 -a -m4

This statistical analysis has no timing characteristics and the results should be the same everywhere. You would only need to re-run it to test with a different version of dieharder, or a different analysis tool.

There’s not much information to glean from this part of the shootout. It mostly confirms that all of these PRNGs would work fine for simulation purposes. The WEAK results are not very significant and is only useful for breaking ties. Even a true RNG will get some WEAK results. For example, the x86 RDRAND instruction (not included in actual shootout) got 7 WEAK results in my tests.

The FAIL results are more significant, but a single failure doesn’t mean much. A non-failing PRNG should be preferred to an otherwise equal PRNG with a failure.

Individual PRNGs

Admittedly the definition for “64-bit PRNG” is rather vague. My high performance targets are all 64-bit platforms, so the highest PRNG throughput will be built on 64-bit operations (if not wider). The original plan was to focus on PRNGs built from 64-bit operations.

Curiosity got the best of me, so I included some PRNGs that don’t use any 64-bit operations. I just wanted to see how they stacked up.

Blowfish

One of the reasons I wrote a Blowfish implementation was to evaluate its performance and statistical qualities, so naturally I included it in the benchmark. It only uses 32-bit addition and 32-bit XOR. It has a 64-bit block size, so it’s naturally producing a 64-bit integer. There are two different properties that combine to make four variants in the benchmark: number of rounds and block mode.

Blowfish normally uses 16 rounds. This makes it a lot slower than a non-cryptographic PRNG but gives it a security margin. I don’t care about the security margin, so I included a 4-round variant. At expected, it’s about four times faster.

The other feature I tested is the block mode: Cipher Block Chaining (CBC) versus Counter (CTR) mode. In CBC mode it encrypts zeros as plaintext. This just means it’s encrypting its last output. The ciphertext is the PRNG’s output.

In CTR mode the PRNG is encrypting a 64-bit counter. It’s 11% faster than CBC in the 16-round variant and 23% faster in the 4-round variant. The reason is simple, and it’s in part an artifact of unrolling the generation loop in the benchmark.

In CBC mode, each output depends on the previous, but in CTR mode all blocks are independent. Work can begin on the next output before the previous output is complete. The x86 architecture uses out-of-order execution to achieve many of its performance gains: Instructions may be executed in a different order than they appear in the program, though their observable effects must generally be ordered correctly. Breaking dependencies between instructions allows out-of-order execution to be fully exercised. It also gives the compiler more freedom in instruction scheduling, though the volatile accesses cannot be reordered with respect to each other (hence it helping to reground the benchmark).

Statistically, the 4-round cipher was not significantly worse than the 16-round cipher. For simulation purposes the 4-round cipher would be perfectly sufficient, though xoroshiro128+ is still more than 9 times faster without sacrificing quality.

On the other hand, CTR mode had a single failure in both the 4-round (dab_filltree2) and 16-round (dab_filltree) variants. At least for Blowfish, is there something that makes CTR mode less suitable than CBC mode as a PRNG?

In the end Blowfish is too slow and too complicated to serve as a simulation PRNG. This was entirely expected, but it’s interesting to see how it stacks up.

Mersenne Twister (MT19937-64)

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. I tested the 64-bit variant, MT19937-64, and there are four problems:

It’s between 1/4 and 1/5 the speed of xoroshiro128+.
It’s got a large state: 2,500 bytes. Versus xoroshiro128+’s 16 bytes.
Its implementation is three times bigger than xoroshiro128+, and much more complicated.
It had one statistical failure (dab_filltree2).

Curiously my implementation is 16% faster with Clang than GCC. Since Mersenne Twister isn’t seriously in the running, I didn’t take time to dig into why.

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

Permuted Congruential Generator (PCG)

The Permuted Congruential Generator (PCG) has some really interesting history behind it, particularly with its somewhat unusual paper, controversial for both its excessive length (58 pages) and informal style. It’s in close competition with Xorshift and xoroshiro128+. I was really interested in seeing how it stacked up.

PCG is really just a Linear Congruential Generator (LCG) that doesn’t output the lowest bits (too poor quality), and has an extra permutation step to make up for the LCG’s other weaknesses. I included two variants in my benchmark: the official PCG and a “simplified” PCG (sPCG) with a simple permutation step. sPCG is just the first PCG presented in the paper (34 pages in!).

Here’s essentially what the simplified version looks like:

uint32_t
spcg32(uint64_t s[1])
{
    uint64_t m = 0x9b60933458e17d7d;
    uint64_t a = 0xd737232eeccdf7ed;
    *s = *s * m + a;
    int shift = 29 - (*s >> 61);
    return *s >> shift;
}

The third line with the modular multiplication and addition is the LCG. The bit shift is the permutation. This PCG uses the most significant three bits of the result to determine which 32 bits to output. That’s the novel component of PCG.

The two constants are entirely my own devising. It’s two 64-bit primes generated using Emacs’ M-x calc: 2 64 ^ k r k n k p k p k p.

Heck, that’s so simple that I could easily memorize this and code it from scratch on demand. Key takeaway: This is one way that PCG is situationally better than xoroshiro128+. In a pinch I could use Emacs to generate a couple of primes and code the rest from memory. If you participate in coding competitions, take note.

However, you probably also noticed PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software.

Instead, I doubled up on everything to run two PRNGs in parallel. Despite the doubling in state size, the period doesn’t get any larger since the PRNGs don’t interact with each other. We get something in return, though. Remember what I said about out-of-order execution? Except for the last step combining their results, since the two PRNGs are independent, doubling up shouldn’t quite halve the performance, particularly with the benchmark loop unrolling business.

Here’s my doubled-up version:

uint64_t
spcg64(uint64_t s[2])
{
    uint64_t m  = 0x9b60933458e17d7d;
    uint64_t a0 = 0xd737232eeccdf7ed;
    uint64_t a1 = 0x8b260b70b8e98891;
    uint64_t p0 = s[0];
    uint64_t p1 = s[1];
    s[0] = p0 * m + a0;
    s[1] = p1 * m + a1;
    int r0 = 29 - (p0 >> 61);
    int r1 = 29 - (p1 >> 61);
    uint64_t high = p0 >> r0;
    uint32_t low  = p1 >> r1;
    return (high << 32) | low;
}

The “full” PCG has some extra shifts that makes it 25% (GCC) to 50% (Clang) slower than the “simplified” PCG, but it does halve the WEAK results.

In this 64-bit form, both are significantly slower than xoroshiro128+. However, if you find yourself only needing 32 bits at a time (always throwing away the high 32 bits from a 64-bit PRNG), 32-bit PCG is faster than using xoroshiro128+ and throwing away half its output.

RC4

This is another CSPRNG where I was curious how it would stack up. It only uses 8-bit operations, and it generates a 64-bit integer one byte at a time. It’s the slowest after 16-round Blowfish and generally not useful as a simulation PRNG.

xoroshiro128+

xoroshiro128+ is the obvious winner in this benchmark and it seems to be the best 64-bit simulation PRNG available. If you need a fast, quality PRNG, just drop these 11 lines into your C or C++ program:

uint64_t
xoroshiro128plus(uint64_t s[2])
{
    uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    uint64_t result = s0 + s1;
    s1 ^= s0;
    s[0] = ((s0 << 55) | (s0 >> 9)) ^ s1 ^ (s1 << 14);
    s[1] = (s1 << 36) | (s1 >> 28);
    return result;
}

There’s one important caveat: That 16-byte state must be well-seeded. Having lots of zero bytes will lead terrible initial output until the generator mixes it all up. Having all zero bytes will completely break the generator. If you’re going to seed from, say, the unix epoch, then XOR it with 16 static random bytes.

xorshift128+ and xorshift64*

These generators are closely related and, like I said, xorshift64* was what I used for years. Looks like it’s time to retire it.

uint64_t
xorshift64star(uint64_t s[1])
{
    uint64_t x = s[0];
    x ^= x >> 12;
    x ^= x << 25;
    x ^= x >> 27;
    s[0] = x;
    return x * UINT64_C(0x2545f4914f6cdd1d);
}

However, unlike both xoroshiro128+ and xorshift128+, xorshift64* will tolerate weak seeding so long as it’s not literally zero. Zero will also break this generator.

If it weren’t for xoroshiro128+, then xorshift128+ would have been the winner of the benchmark and my new favorite choice.

uint64_t
xorshift128plus(uint64_t s[2])
{
    uint64_t x = s[0];
    uint64_t y = s[1];
    s[0] = y;
    x ^= x << 23;
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26);
    return s[1] + y;
}

It’s a lot like xoroshiro128+, including the need to be well-seeded, but it’s just slow enough to lose out. There’s no reason to use xorshift128+ instead of xoroshiro128+.

Conclusion

My own takeaway (until I re-evaluate some years in the future):

The best 64-bit simulation PRNG is xoroshiro128+.
“Simplified” PCG can be useful in a pinch.
When only 32-bit integers are necessary, use PCG.

Things can change significantly between platforms, though. Here’s the shootout on a ARM Cortex-A53:

                    Speed (MB/s)
PRNG         gcc-5.4.0   clang-3.8.0
------------------------------------
baseline          2560        2400
blowfishcbc16       36.5        45.4
blowfishcbc4       135         173
blowfishctr16       36.4        45.2
blowfishctr4       133         168
mt64               207         254
pcg64              980         712
rc4                 96.6        44.0
spcg64            1021         948
xoroshiro128+     2560        1570
xorshift128+      2560        1520
xorshift64*       1360        1080

LLVM is not as mature on this platform, but, with GCC, both xoroshiro128+ and xorshift128+ matched the baseline! It seems memory is the bottleneck.

So don’t necessarily take my word for it. You can run this shootout in your own environment — perhaps even tossing in more PRNGs — to find what’s appropriate for your own situation.

Blowpipe: a Blowfish-encrypted, Authenticated Pipe

2017-09-15T23:59:59Z

Blowpipe is a toy crypto tool that creates a Blowfish-encrypted pipe. It doesn’t open any files and instead encrypts and decrypts from standard input to standard output. This pipe can encrypt individual files or even encrypt a network connection (à la netcat).

Most importantly, since Blowpipe is intended to be used as a pipe (duh), it will never output decrypted plaintext that hasn’t been authenticated. That is, it will detect tampering of the encrypted stream and truncate its output, reporting an error, without producing the manipulated data. Some very similar tools that aren’t considered toys lack this important feature, such as aespipe.

Purpose

Blowpipe came about because I wanted to study Blowfish, a 64-bit block cipher designed by Bruce Schneier in 1993. It’s played an important role in the history of cryptography and has withstood cryptanalysis for 24 years. Its major weakness is its small block size, leaving it vulnerable to birthday attacks regardless of any other property of the cipher. Even in 1993 the 64-bit block size was a bit on the small side, but Blowfish was intended as a drop-in replacement for the Data Encryption Standard (DES) and the International Data Encryption Algorithm (IDEA), other 64-bit block ciphers.

The main reason I’m calling this program a toy is that, outside of legacy interfaces, it’s simply not appropriate to deploy a 64-bit block cipher in 2017. Blowpipe shouldn’t be used to encrypt more than a few tens of GBs of data at a time. Otherwise I’m fairly confident in both my message construction and my implementation. One detail is a little uncertain, and I’ll discuss it later when describing message format.

A tool that I am confident about is Enchive, though since it’s intended for file encryption, it’s not appropriate for use as a pipe. It doesn’t authenticate until after it has produced most of its output. Enchive does try its best to delete files containing unauthenticated output when authentication fails, but this doesn’t prevent you from consuming this output before it can be deleted, particularly if you pipe the output into another program.

Usage

As you might expect, there are two modes of operation: encryption (-E) and decryption (-D). The simplest usage is encrypting and decrypting a file:

$ blowpipe -E < data.gz > data.gz.enc
$ blowpipe -D < data.gz.enc | gunzip > data.txt

In both cases you will be prompted for a passphrase which can be up to 72 bytes in length. The only verification for the key is the first Message Authentication Code (MAC) in the datastream, so Blowpipe cannot tell the difference between damaged ciphertext and an incorrect key.

In a script it would be smart to check Blowpipe’s exit code after decrypting. The output will be truncated should authentication fail somewhere in the middle. Since Blowpipe isn’t aware of files, it can’t clean up for you.

Another use case is securely transmitting files over a network with netcat. In this example I’ll use a pre-shared key file, keyfile. Rather than prompt for a key, Blowpipe will use the raw bytes of a given file. Here’s how I would create a key file:

$ head -c 32 /dev/urandom > keyfile

First the receiver listens on a socket (bind(2)):

$ nc -lp 2000 | blowpipe -D -k keyfile > data.zip

Then the sender connects (connect(2)) and pipes Blowpipe through:

$ blowpipe -E -k keyfile < data.zip | nc -N hostname 2000

If all went well, Blowpipe will exit with 0 on the receiver side.

Blowpipe doesn’t buffer its output (but see -w). It performs one read(2), encrypts whatever it got, prepends a MAC, and calls write(2) on the result. This means it can comfortably transmit live sensitive data across the network:

$ nc -lp 2000 | blowpipe -D

# dmesg -w | blowpipe -E | nc -N hostname 2000

Kernel messages will appear on the other end as they’re produced by dmesg. Though keep in mind that the size of each line will be known to eavesdroppers. Blowpipe doesn’t pad it with noise or otherwise try to disguise the length. Those lengths may leak useful information.

Blowfish

This whole project started when I wanted to play with Blowfish as a small drop-in library. I wasn’t satisfied with the selection, so I figured it would be a good exercise to write my own. Besides, the specification is both an enjoyable and easy read (and recommended). It justifies the need for a new cipher and explains the various design decisions.

I coded from the specification, including writing a script to generate the subkey initialization tables. Subkeys are initialized to the binary representation of pi (the first ~10,000 decimal digits). After a couple hours of work I hooked up the official test vectors to see how I did, and all the tests passed on the first run. This wasn’t reasonable, so I spent awhile longer figuring out how I screwed up my tests. Turns out I absolutely nailed it on my first shot. It’s a really great sign for Blowfish that it’s so easy to implement correctly.

Blowfish’s key schedule produces five subkeys requiring 4,168 bytes of storage. The key schedule is unusually complex: Subkeys are repeatedly encrypted with themselves as they are being computed. This complexity inspired the bcrypt password hashing scheme, which essentially works by iterating the key schedule many times in a loop, then encrypting a constant 24-byte string. My bcrypt implementation wasn’t nearly as successful on my first attempt, and it took hours of debugging in order to match OpenBSD’s outputs.

The encryption and decryption algorithms are nearly identical, as is typical for, and a feature of, Feistel ciphers. There are no branches (preventing some side-channel attacks), and the only operations are 32-bit XOR and 32-bit addition. This makes it ideal for implementation on 32-bit computers.

One tricky point is that encryption and decryption operate on a pair of 32-bit integers (another giveaway that it’s a Feistel cipher). To put the cipher to practical use, these integers have to be serialized into a byte stream. The specification doesn’t choose a byte order, even for mixing the key into the subkeys. The official test vectors are also 32-bit integers, not byte arrays. An implementer could choose little endian, big endian, or even something else.

However, there’s one place in which this decision is formally made: the official test vectors mix the key into the first subkey in big endian byte order. By luck I happened to choose big endian as well, which is why my tests passed on the first try. OpenBSD’s version of bcrypt also uses big endian for all integer encoding steps, further cementing big endian as the standard way to encode Blowfish integers.

Blowfish library

The Blowpipe repository contains a ready-to-use, public domain Blowfish library written in strictly conforming C99. The interface is just three functions:

void blowfish_init(struct blowfish *, const void *key, int len);
void blowfish_encrypt(struct blowfish *, uint32_t *, uint32_t *);
void blowfish_decrypt(struct blowfish *, uint32_t *, uint32_t *);

Technically the key can be up to 72 bytes long, but the last 16 bytes have an incomplete effect on the subkeys, so only the first 56 bytes should matter. Since bcrypt runs the key schedule multiple times, all 72 bytes have full effect.

The library also includes a bcrypt implementation, though it will only produce the raw password hash, not the base-64 encoded form. The main reason for including bcrypt is to support Blowpipe.

Message format

The main goal of Blowpipe was to build a robust, authenticated encryption tool using only Blowfish as a cryptographic primitive.

It uses bcrypt with a moderately-high cost as a key derivation function (KDF). Not terrible, but this is not a memory hard KDF, which is important for protecting against cheap hardware brute force attacks.
Encryption is Blowfish in “counter” CTR mode. A 64-bit counter is incremented and encrypted, producing a keystream. The plaintext is XORed with this keystream like a stream cipher. This allows the last block to be truncated when output and eliminates some padding issues. Since CRT mode is trivially malleable, the MAC becomes even more important. In CTR mode, blowfish_decrypt() is never called. In fact, Blowpipe never uses it.
The authentication scheme is Blowfish-CBC-MAC with a unique key and encrypt-then-authenticate (something I harmlessly got wrong with Enchive). It essentially encrypts the ciphertext again with a different key, but in Cipher Block Chaining mode (CBC), but it only saves the final block. The final block is prepended to the ciphertext as the MAC. On decryption the same block is computed again to ensure that it matches. Only someone who knows the MAC key can compute it.

Of all three Blowfish uses, I’m least confident about authentication. CBC-MAC is tricky to get right, though I am following the rules: fixed length messages using a different key than encryption.

Wait a minute. Blowpipe is pipe-oriented and can output data without buffering the entire pipe. How can there be fixed-length messages?

The pipe datastream is broken into 64kB chunks. Each chunk is authenticated with its own MAC. Both the MAC and chunk length are written in the chunk header, and the length is authenticated by the MAC. Furthermore, just like the keystream, the MAC is continued from previous chunk, preventing chunks from being reordered. Blowpipe can output the content of a chunk and discard it once it’s been authenticated. If any chunk fails to authenticate, it aborts.

This also leads to another useful trick: The pipe is terminated with a zero length chunk, preventing an attacker from appending to the datastream. Everything after the zero-length chunk is discarded. Since the length is authenticated by the MAC, the attacker also cannot truncate the pipe since that would require knowledge of the MAC key.

The pipe itself has a 17 byte header: a 16 byte random bcrypt salt and 1 byte for the bcrypt cost. The salt is like an initialization vector (IV) that allows keys to be safely reused in different Blowpipe instances. The cost byte is the only distinguishing byte in the stream. Since even the chunk lengths are encrypted, everything else in the datastream should be indistinguishable from random data.

Portability

Blowpipe runs on POSIX systems and Windows (Mingw-w64 and MSVC). I initially wrote it for POSIX (on Linux) of course, but I took an unusual approach when it came time to port it to Windows. Normally I’d invent a generic OS interface that makes the appropriate host system calls. This time I kept the POSIX interface (read(2), write(2), open(2), etc.) and implemented the tiny subset of POSIX that I needed in terms of Win32. That implementation can be found under w32-compat/. I even dropped in a copy of my own getopt().

One really cool feature of this technique is that, on Windows, Blowpipe will still “open” /dev/urandom. It’s intercepted by my own open(2), which in response to that filename actually calls CryptAcquireContext() and pretends like it’s a file. It’s all hidden behind the file descriptor. That’s the unix way.

I’m considering giving Enchive the same treatment since it would simply and reduce much of the interface code. In fact, this project has taught me a number of ways that Enchive could be improved. That’s the value of writing “toys” such as Blowpipe.

Introducing the Pokerware Secure Passphrase Generator

2017-07-27T17:49:10Z

I recently developed Pokerware, an offline passphrase generator that operates in the same spirit as Diceware. The primary difference is that it uses a shuffled deck of playing cards as its entropy source rather than dice. Draw some cards and use them to select a uniformly random word from a list. Unless you’re some sort of tabletop gaming nerd, a deck of cards is more readily available than five 6-sided dice, which would typically need to be borrowed from the Monopoly board collecting dust on the shelf, then rolled two at a time.

There are various flavors of two different word lists here:

https://github.com/skeeto/pokerware/releases/tag/1.0

Hardware random number generators are difficult to verify and may not actually be as random as they promise, either intentionally or unintentionally. For the particularly paranoid, Diceware and Pokerware are an easily verifiable alternative for generating secure passphrases for cryptographic purposes. At any time, a deck of 52 playing cards is in one of 52! possible arrangements. That’s more than 225 bits of entropy. If you give your deck a thorough shuffle, it will be in an arrangement that has never been seen before and will never be seen again. Pokerware draws on some of these bits to generate passphrases.

The Pokerware list has 5,304 words (12.4 bits per word), compared to Diceware’s 7,776 words (12.9 bits per word). My goal was to invent a card-drawing scheme that would uniformly select from a list in the same sized ballpark as Diceware. Much smaller and you’d have to memorize more words for the same passphrase strength. Much larger and the words on the list would be more difficult to memorize, since the list would contain longer and less frequently used words. Diceware strikes a nice balance at five dice.

One important difference for me is that I like my Pokerware word lists a lot more than the two official Diceware lists. My lists only have simple, easy-to-remember words (for American English speakers, at least), without any numbers or other short non-words. Pokerware has two official lists, “formal” and “slang,” since my early testers couldn’t agree on which was better. Rather than make a difficult decision, I took the usual route of making no decision at all.

The “formal” list is derived in part from Google’s Ngram Viewer, with my own additional filters and tweaking. It’s called “formal” because the ngrams come from formal publications and represent more formal kinds of speech.

The “slang” list is derived from every reddit comment between December 2005 and May 2017, tamed by the same additional filters. I have this data on hand, so I may as well put it to use. I figured more casually-used words would be easier to remember. Due to my extra filtering, there’s actually a lot of overlap between these lists, so the differences aren’t too significant.

If you have your own word list, perhaps in a different language, you can use the Makefile in the repository to build your own Pokerware lookup table, both plain text and PDF. The PDF is generated using Groff macros.

Passphrase generation instructions

Thoroughly shuffle the deck.
Draw two cards. Sort them by value, then suit. Suits are in alphabetical order: Clubs, Diamonds, Hearts, Spades.
Draw additional cards until you get a card that doesn’t match the face value of either of your initial two cards. Observe its suit.
Using your two cards and observed suit, look up a word in the table.
Place all cards back in the deck, shuffle, and repeat from step 2 until you have the desired number of words. Each word is worth 12.4 bits of entropy.

A word of warning about step 4: If you use software to do the word list lookup, beware that it might save your search/command history — and therefore your passphrase — to a file. For example, the less pager will store search history in ~/.lesshst. It’s easy to prevent that one:

$ LESSHISTFILE=- less pokerware-slang.txt

Example word generation

Suppose in step 2 you draw King of Hearts (KH/K♥) and Queen of Clubs (QC/Q♣).

In step 3 you first draw King of Diamonds (KD/K♦), discarding it because it matches the face value of one of your cards from step 2.

Next you draw Four of Spades (4S/4♠), taking spades as your extra suit.

In order, this gives you Queen of Clubs, King of Hearts, and Spades: QCKHS or Q♣K♥♠. This corresponds to “wizard” in the formal word list and would be the first word in your passphrase.

A deck of cards as an office tool

I now have an excuse to keep a deck of cards out on my desk at work. I’ve been using Diceware — or something approximating it since I’m not so paranoid about hardware RNGs. From now I’ll deal new passwords from an in-reach deck of cards. Though typically I need to tweak the results to meet outdated character-composition requirements.

Why I've Retired My PGP Keys and What's Replaced It

2017-03-12T21:54:38Z

Update August 2019: I’ve got a PGP key again but only for signing. I use another of my own tools, passphrase2pgp, to manage it.

tl;dr: Enchive (rhymes with “archive”) has replaced my use of GnuPG.

Two weeks ago I tried to encrypt a tax document for archival and noticed my PGP keys had just expired. GnuPG had (correctly) forbidden the action, requiring that I first edit the key and extend the expiration date. Rather than do so, I decided to take this opportunity to retire my PGP keys for good. Over time I’ve come to view PGP as largely a failure — it never reached the critical mass, the tooling has always been problematic, and it’s now a dead end. The only thing it’s been successful at is signing Linux packages, and even there it could be replaced with something simpler and better.

I still have a use for PGP: encrypting sensitive files to myself for long term storage. I’ve also been using it to consistently to sign Git tags for software releases. However, very recently this lost its value, though I doubt anyone was verifying these signatures anyway. It’s never been useful for secure email, especially when most people use it incorrectly. I only need to find a replacement for archival encryption.

I could use an encrypted filesystem, but which do I use? I use LUKS to protect my laptop’s entire hard drive in the event of a theft, but for archival I want something a little more universal. Basically I want the following properties:

Sensitive content must not normally be in a decrypted state. PGP solves this by encrypting files individually. The archive filesystem can always be mounted. An encrypted volume would need to be mounted just prior to accessing it, during which everything would be exposed.
I should be able to encrypt files from any machine, even less-trusted ones. With PGP I can load my public key on any machine and encrypt files to myself. It’s like a good kind of ransomware.
It should be easy to back these files up elsewhere, even on less-trusted machines/systems. This isn’t reasonably possible with an encrypted filesystem which would need to be backed up as a huge monolithic block of data. With PGP I can toss encrypted files anywhere.
I don’t want to worry about per-file passphrases. Everything should be encrypted with/to the same key. PGP solves this by encrypting files to a recipient. This requirement prevents most stand-alone crypto tools from qualifying.

I couldn’t find anything that fit the bill, so I did exactly what you’re not supposed to do and rolled my own: Enchive. It was loosely inspired by OpenBSD’s signify. It has the tiny subset of PGP features that I need — using modern algorithms — plus one more feature I’ve always wanted: the ability to generate a keypair from a passphrase. This means I can reliably access my archive keypair anywhere.

On Enchive

Here’s where I’d put the usual disclaimer about not using it for anything serious, blah blah blah. But really, I don’t care if anyone else uses Enchive. It exists just to scratch my own personal itch. If you have any doubts, don’t use it. I’m putting it out there in case anyone else is in the same boat. It would also be nice if any glaring flaws I may have missed were pointed out.

Not expecting it to be available as a nice package, I wanted to make it trivial to build Enchive anywhere I’d need it. Except for including stdint.h in exactly one place to get the correct integers for crypto, it’s written in straight C89. All the crypto libraries are embedded, and there are no external dependencies. There’s even an “amalgamation” build, so make isn’t required: just point your system’s cc at it and you’re done.

Algorithms

For encryption, Enchive uses Curve25519, ChaCha20, and HMAC-SHA256.

Rather than the prime-number-oriented RSA as used in classical PGP (yes, GPG 2 can do better), Curve25519 is used for the asymmetric cryptography role, using the relatively new elliptic curve cryptography. It’s stronger cryptography and the keys are much smaller. It’s a Diffie-Hellman function — an algorithm used to exchange cryptographic keys over a public channel — so files are encrypted by generating an ephemeral keypair and using this ephemeral keypair to perform a key exchange with the master keys. The ephemeral public key is included with the encrypted file and the ephemeral private key is discarded.

I used the “donna” implementation in Enchive. Despite being the hardest to understand (mathematically), this is the easiest to use. It’s literally just one function of two arguments to do everything.

Curve25519 only establishes the shared key, so next is the stream cipher ChaCha20. It’s keyed by the shared key to actually encrypt the data. This algorithm has the same author as Curve25519 (djb), so it’s natural to use these together. It’s really straightforward, so there’s not much to say about it.

For the Message Authentication Code (MAC), I chose HMAC-SHA256. It prevents anyone from modifying the message. Note: This doesn’t prevent anyone who knows the master public key from replacing the file wholesale. That would be solved with a digital signature, but this conflicts with my goal of encrypting files without the need of my secret key. The MAC goes at the end of the file, allowing arbitrarily large files to be encrypted single-pass as a stream.

There’s a little more to it (IV, etc.) and is described in detail in the README.

Usage

The first thing you’d do is generate a keypair. By default this is done from /dev/urandom, in which case you should immediately back them up. But if you’re like me, you’ll be using Enchive’s --derive (-d) feature to create it from a passphrase. In that case, the keys are backed up in your brain!

$ enchive keygen --derive
secret key passphrase:
secret key passphrase (repeat):
passphrase (empty for none):
passphrase (repeat):

The first prompt is for the secret key passphrase. This is converted into a Curve25519 keypair using an scrypt-like key derivation algorithm. The process requires 512MB of memory (to foil hardware-based attacks) and takes around 20 seconds.

The second passphrase (or the only one when --derive isn’t used), is the protection key passphrase. The secret key is encrypted with this passphrase to protect it at rest. You’ll need to enter it any time you decrypt a file. The key derivation step is less aggressive for this key, but you could also crank it up if you like.

At the end of this process you’ll have two new files under $XDG_CONFIG_DIR/enchive: enchive.pub (32 bytes) and enchive.sec (64 bytes). The first you can distribute anywhere you’d like to encrypt files; it’s not particularly sensitive. The second is needed to decrypt files.

To encrypt a file for archival:

$ enchive archive sensitive.zip

No prompt for passphrase. This will create sensitive.zip.enchive.

To decrypt later:

$ enchive extract sensitive.zip.enchive
passphrase:

If you’ve got many files to decrypt, entering your passphrase over and over would get tiresome, so Enchive includes a key agent that keeps the protection key in memory for a period of time (15 minutes by default). Enable it with the --agent flag (it may be enabled by default someday).

$ enchive --agent extract sensitive.zip.enchive

Unlike ssh-agent and gpg-agent, there’s no need to start the agent ahead of time. It’s started on demand as needed and terminates after the timeout. It’s completely painless.

Both archive and extract operate stdin to stdout when no file is given.

Feature complete

As far as I’m concerned, Enchive is feature complete. It does everything I need, I don’t want it to do anything more, and at least two of us have already started putting it to use. The interface and file formats won’t change unless someone finds a rather significant flaw. There is some wiggle room to replace the algorithms in the future should Enchive have that sort of longevity.

The Physical Analog for Encryption is the Hyperdrive

2012-08-06T00:00:00Z

I was recently watching GetDaved play through X-Wing Alliance, a game I myself played in college. I have a lot of nostalgia for it, especially because TIE Fighter was the first games I ever invested a lot of time into playing. Just hearing the sounds and music brings back relaxing memories.

In one of the early missions the player travels through hyperspace (which ain’t like dusting crops) to a storage area located in deep space. It’s a family business and the player is out there to take inventory of storage containers. Like when I saw the wormhole minefield in Deep Space 9, it got me thinking, “Why?” Why keep all these storage containers in deep space? There’s no defense or security out there to stop someone from stealing containers. It seems like it would be better to store those at the home base where they can be protected.

Storing items at random locations in deep space is actually very secure — more so than any lock! Space is huge. Even with faster-than-light travel searching a galaxy for a storage location would be impractical. It would be as impractical as using brute-force to find an encryption key — another huge search space. Also, if the storage location as been in use for X years, you’d need to come within X light-years of it, at least, in order to find it, since even gravity itself is limited by the speed of light.

Physical locks are usually described as the physical analogy of cryptography. Honestly, it’s not a very good analogy. The brute-force method for bypassing a lock isn’t to keep trying different keys or combinations until it works. No, it’s to just smash something (a window, the lock) or pick the lock. When translated back into the crypto world that’s like breaking a cipher, which isn’t a practical attack in modern cryptography.

No, the physical analogy for cryptography is deep space storage. The only practical way to access deep space items is to learn the coordinates of the storage location, which is the equivalent of the encryption key. If the coordinates are lost or forgotten, the items are as good as destroyed, just like data.

There are actually some advantages of physical “encryption.” Ciphertext can be decrypted offline without being detected. It’s not possible to visit deep space storage without having a physical presence, which is certainly more detectable than offline decryption. There’s also the advantage that it’s somewhat easier to tell when the key (location) generation algorithm is busted or you’re just bad at picking passphrases: someone else’s stuff will already be there. A literal collision.

Publishing My Private Keys

2012-06-24T00:00:00Z

Update March 2017: I no longer use PGP. Also, there’s a bug in GnuPG that silently discards these security settings, and it’s unlikely to ever get fixed. You’ll need to find/build an old version of GnuPG if you want to properly protect your secret keys.

Update August 2019: I’ve got a PGP key again, but I’m using my own tool, passphrase2pgp, to manage it. This tool allows for a particular workflow that GnuPG has never and will never provide. It doesn’t rely on S2K as described below.

One of the items in my dotfiles repository is my PGP keys, both private and public. I believe this is a unique approach that hasn’t been done before — a public experiment. It may seem dangerous, but I’ve given it careful thought and I’m only using the tools already available from GnuPG. It ensures my keys are well backed-up (via the Torvalds method) and available wherever I should need them.

In your GnuPG directory there are two core files: secring.gpg and pubring.gpg. The first contains your secret keys and the second contains public keys. secring.gpg is not itself encrypted. You can (should) have different passphrases for each key, after all. These files (or any PGP file) can be inspected with --list-packets. Notice it won’t prompt for a passphrase in order to get this data,

$ gpg --list-packets ~/.gnupg/secring.gpg
:secret key packet:
    version 4, algo 1, created 1298734547, expires 0
    skey[0]: [2048 bits]
    skey[1]: [17 bits]
    iter+salt S2K, algo: 9, SHA1 protection, hash: 10, salt: ...
    protect count: 10485760 (212)
    protect IV:  a6 61 4a 95 44 1e 7e 90 88 c3 01 70 8d 56 2e 11
    encrypted stuff follows
:user ID packet: "Christopher Wellons <...>"
:signature packet: algo 1, keyid 613382C548B2B841
... and so on ...

Each key is encrypted individually within this file with a passphrase. If you try to use the key, GPG will attempt to decrypt it by asking for the passphrase. If someone were to somehow gain access to your secring.gpg, they’d still need to get your passphrase, so pick a strong one. The official documentation advises you to keep your secring.gpg well-guarded and only rely on the passphrase as a cautionary measure. I’m ignoring that part.

If you’re using GPG’s defaults, your secret key is encrypted with CAST5, a symmetric block cipher. The encryption key is your passphrase salted (mixed with a non-secret random number) and hashed with SHA-1 65,536 times. Using the hash function over and over is called key stretching. It greatly increases the amount of required work for a brute-force attack, making your passphrase more effective. All of these settings can be adjusted to better protect the secret key at the cost of less portability. Since I’ve chosen to publish my secring.gpg in my dotfiles repository I cranked up the settings as far as I can.

I changed the cipher to AES256, which is more modern, more trusted, and more widely used than CAST5. For the passphrase digest, I selected SHA-512. There are better passphrase digest algorithms out there but this is the longest, slowest one that GPG offers. The PGP spec supports between 1024 and 65,011,712 digest iterations, so I picked one of the largest. 65 million iterations takes my laptop over a second to process — absolutely brutal for someone attempting a brute-force attack. Here’s the command to change to this configuration on an existing key,

gpg --s2k-cipher-algo AES256 --s2k-digest-algo SHA512 --s2k-mode 3 \
    --s2k-count 65000000 --edit-key 

When the edit key prompt comes up, enter passwd to change your passphrase. You can enter the same passphrase again and it will re-use it with the new configuration.

I’m feeling quite secure with my secret key, despite publishing my secring.gpg. Before now, I was much more at risk of losing it to disk failure than having it exposed. I challenge anyone who doubts my security to crack my secret key. I’d rather learn that I’m wrong sooner than later!

With this established in my dotfiles repository, I can more easily include private dotfiles. Rather than use a symmetric cipher with an individual passphrase on each file, I encrypt the private dotfiles to myself. All my private dotfiles are managed with one key: my PGP key. This also plays better with Emacs. While it supports transparent encryption, it doesn’t even attempt to manage your passphrase (with good reason). If the file is encrypted with a symmetric cipher, Emacs will prompt for a passphrase on each save. If I encrypt them with my public key, I only need the passphrase when I first open the file.

How it works right now is any dotfile that ends with .priv.pgp will be decrypted into place — not symlinked, unfortunately, since this is impossible. The install script has a -p switch to disable private dotfiles, such as when I’m using an untrusted computer. gpg-agent ensures that I only need to enter my passphrase once during the install process no matter how many private dotfiles there are.

Versioning Personal Configuration Dotfiles

2012-06-23T00:00:00Z

For almost two months now I’ve been versioning all my personal dotfiles in Git. Just as when I did the same with Emacs, it’s been extremely liberating and I wish I had been doing this for years. Currently it covers 11 different applications including my web browser, shell, window manager, and cryptographic keys, giving me a unified experience across all of my machines — which, between home, work, and virtual computers is about half a dozen.

Like anything, the biggest problem with not versioning these files is introducing changes. If I add an interesting tweak to a dotfile, I won’t see that change on my other machines until I either copy it over or I enter it manually again. Because I’d worry about clobbering other unpropagated changes, it was usually the latter. Only changes I could commit to memory would propagate. Any tweak that wasn’t easy to duplicate manually I couldn’t rely on, so I was discouraged from customizing too much and relied mostly on defaults. This is bad!

Source control solves almost all of this trivially. If I notice a pattern in my habits or devise an interesting configuration, I can immediately make the change, commit it, and push it. Later, when I’m on another computer and I notice it missing, I just do a pull without needing to worry about clobbering any local changes. When moving onto a new computer/install, all I need to do is clone the repository and I’ve got every configuration I have without having the snoop around the last computer I used figuring out what to copy over.

Most of the applications I prefer have tidy, manually-editable dotfiles that version well, so I would be able to capture almost my entire environment. One near-exception was Firefox. By itself, it doesn’t play well, but since I use Pentadactyl I’m able to configure it cleanly like a proper application.

The last straw that triggered my dotfiles repository was managing my Bash aliases. It had gotten just long enough that I was tired of manually synchronizing them. It was finally time to invest some time into nipping this in the bud once and for all. Unsure what approach to take, I looked around to see what other people were doing. There are two basic approaches: version your entire home directory or symbolically link your dotfiles into place from a stand-alone repository.

The first approach is straightforward but has a number of issues that make it a poor choice. You don’t need an install script or anything special, you just use your home directory.

cd
git init
git add .bashrc .gitconfig ...

The first problem is that most the files Git sees you do not want to version. These are all going to show up in the status listing and, because there’s no pattern to them, there’s really no way to filter them out with a .gitconfig. Any other clones you have in your home directory may also confuse Git, looking like submodules. You’ll have to dodge this extra stuff all the time when working in the repository.

The second problem is that Git has only only one .git directory, in the repository root. If there’s no .git in the current directory, it will keep searching upwards until it finds one … which will inevitably be your dotfiles repository. This will eventually lead to annoying mistakes where you accidentally commit work to your dotfiles repository for awhile until you notice you forgot a git init. A possible workaround is to keep the .git directory out of your home directory and use the environment variable GIT_DIR to tell Git where it is when you’re working on it. That sounds like a pain to me.

The other approach is to have your dotfiles repository cloned on its own, then use symlinks to put the configuration files into place. You need to write an install script to do this. However, not all configuration files are sitting directly in your home directory. Some have their own directory. Modern applications have moved into a directory under ~/.config/. Your script needs to handle these.

Why symlinks rather than just copying the file into place? Well, if you make any changes to the installed files, Git won’t see them and you risk losing those changes.

Why symlinks rather than hard links? Symlinks deal with the atomic replacement issue better. Conscientious applications are very careful about how they write your data to disk. Unless it’s some kind of database, files are never edited in-place. The application rewrites the entire file at once. If the application is stupid and overwrites the file directly, there’s a brief instant where you data is not on disk at all! First, it truncates the original file, deleting your data, then it rewrites the data, and, if it’s not too stupid, calls fsync() to force the write to the hardware. It’s stupid, but it will work with symlinks.

The conscientious application will write the data to a temporary file, call fsync(), then atomically rename() the new file over top the original file. If there’s any failure along the way, some intact version of the data will be on the disk. The problem is that this will replace your symlink and changes won’t be captured by the repository. Such an incident will be obvious with symlinks, since the file will no longer be a symlink. Hard links are much less obvious.

Smart applications, like Emacs, also know not to clobber your symlinks and will handle these writes properly, leaving the symlink intact. With hard links, there is no way for the application to know it needs to treat a file specially.

I figured that I could use someone else’s install script, so I wouldn’t have to worry about getting this right. Since Ruby is so popular with Git, many people are using Rake for this task. However, I want to be able to maintain the install script myself and I don’t know Rake. I also don’t want to depend on anything unusual to install my dotfiles. So that was out.

Second, I don’t want to have to specifically list the files to install, or not install, in the script. Don’t put the same information in two places when one will do. This script should be able to tell on its own what files to install.

Third, I didn’t want my dotfiles to actually be dotfiles in my repository. It makes them hard to see and manage, since they’re hidden. They’re much easier to handle when the dot is replaced with an underscore.

So I wrote my own install script which installs any file beginning with an underscore. I’ve since added support for “private” dotfiles along the way. These are dotfiles that contain sensitive information and are encrypted in the repository, allowing me to continue publishing it safely.

If you’d like to create your own dotfiles repository, my dotfile repository may not be useful beyond standing as an example but my install script may be directly reusable for you.

There’s a lot to talk about, so I’ll be making a few more posts about this.

SSH and GPG Agents

2012-06-08T00:00:00Z

If you’re using SSH or GPG with any sort of frequency, you should definitely be using their accompanying *-agent programs. The agents allow you to gain a whole lot of convenience without compromising your security. Many people seem to be unaware these tools exist, so here’s an overview along with some tips on how to use them effectively.

Let’s start from the top.

Both SSH and GPG involve the use of asymmetric encryption, and the private key is protected by a user-entered passphrase. The private key is generally never written in to the filesystem in plaintext. In the case of GPG, these keys are the primary focus of the application. For SSH, they’re a useful tool to make accessing remote machines less tedious. (The SSH server is authenticated by a public key, too, but this is unrelated to agents.)

For those who are unaware, rather than enter a password when logging into a remove machine, you can identify yourself by a public key. Generating a key is simple.

ssh-keygen

You’ll almost certainly want to accept the default location for the key (~/.ssh/id_rsa) because this is where SSH will look for it. Make sure you enter a passphrase, which will encrypt the private key. The reason this is important is because, without it, anyone who gains access to your id_rsa file will be able to access any remote systems that have been told to trust your public key. By having a passphrase, this person needs not only the id_rsa file, but also the passphrase (two-factor authentication), so you probably want to pick a long, strong one. This may sound inconvenient, but ssh-agent will help you.

The key generation process will create two files: id_rsa (private key) and id_rsa.pub (public key). The latter is what you give to remote systems.

Telling a remote system about your key is simple,

ssh-copy-id

This will copy your id_rsa.pub to the remote system, prompting you for the password on the remote system (not the passphrase you just entered), adding it to the file ~/.ssh/authorized_keys. From this point on, all logins will use your new keypair rather than prompt you for a password. Since you put a passphrase on your key, this may seem pointless — it seems you still need to type in a password for every connection. Bear with me here!

As a side note, you should have a unique SSH keypair for each site, so you’ll have several of them. This way you can revoke access to a particular site without affecting the others.

For GPG — the GNU Privacy Guard, the free software PGP implementation — your keys are stored under ~/.gnupg/ in a database. Generating a key is also a simple command,

gpg --gen-key

This is a slightly more complicated process, which I won’t get into here. In contrast to SSH, you’ll generally have only one keypair per identity (i.e. you only have one).

So you’ve got these keys are encrypted by passphrases. If they’re going to be any use then they’ll be long, annoying things that are a pain to type in. If that was the end of the story this would be really inconvenient, enough to make the use of passphrases too costly for many people to bother. Fortunately, we have agents to help.

An agent is a daemon process that can hold onto your passphrase (gpg-agent) or your private key (ssh-agent) so that you only need to enter your passphrase once within in some period of time (possibly for the entire life of the agent process), rather than type it many times over and over again as it’s needed. The agents are very careful about how they hold on to this sensitive information, such as avoiding having it written to swap. You can also configure how long you want them to hold onto your passphrase/key before purging it from memory.

The ssh and gpg programs need to know where to find the agents. This is done through environmental variables. For ssh-agent, the process ID is stored in SSH_AGENT_PID and the location of the Unix socket for communication is in SSH_AUTH_SOCK. gpg-agent stuffs everything into one variable, GPG_AGENT_INFO (which is a pain if you want to use this information in a script). When the main program is invoked and it needs to use the private key, it will use these variables and get in touch with the agent to see if it can supply the needed information without bothering the user.

Remember, a process can’t change the environment of their parent process so you need to set this information in the agent’s parent shell somehow. There are two methods to set these up: eval and exec.

When you start the agent, it forks off its daemon process and prints the variable information to stdout. This can be evaled directly into the current environment. You could drop these lines directly in your .bashrc so that the agents are always there. (Though they won’t exit with your shell, lingering around uselessly! More on this ahead.)

eval $(ssh-agent)
eval $(gpg-agent --daemon)

For the exec method, you replace your current shell with a new one with a modified environment. To do this, you ask the agent to exec into a shell, with the variables set, rather than return control.

exec ssh-agent bash
exec gpg-agent --daemon bash

As cool trick, you can chain these together. ssh-agent becomes gpg-agent which then becomes bash.

exec ssh-agent gpg-agent --daemon bash

Note that gpg-agent is capable of being an ssh-agent as well by using the --enable-ssh-support option, so you don’t need to launch an ssh-agent. Unfortunately, I don’t like to use this because gpg-agent gets a little too personal with the SSH key, storing its own copy with its own passphrase again.

On the other hand, gpg-agent is much more advanced than OpenSSH’s ssh-agent. When you want to have ssh-agent manage a key, you need to first tell it about the key with ssh-add. With no arguments, it will use ~/.ssh/id_rsa. If you forget to do this, ssh will ask for your passphrase directly, in your terminal, not allowing ssh-agent to hold onto it. By comparison, gpg will always ask gpg-agent to retrieve your passphrase when it’s needed (if the agent is available), so it will cache your passphrase on demand. No need to explicitly register with the agent. Even better, it will try its best to use a “PIN entry” program to read your key, which helps protect against some kinds of keyloggers — preventing other processes from seeing your keystrokes.

Well, this is all fine and dandy except when you’ve already got an agent running. Say you’re launching a new terminal emulator window from an existing one, creating a new shell. Unfortunately, even though you have agents running and they’re listed in your environment (from the origin shell), they’ll still spawn new agents! This is really lousy behavior, in my opinion. There’s no --inherit option to tell them to silently pass along the information of the existing agent if it appears to be valid. This causes two problems. One, you’ll need to enter your passphrases again for the new agent. Second, these new agents will linger around after the spawning shell has exited — hogging important non-swappable memory.

The direct workaround is to, in your shell init script, check for these variables yourself and check that they’re valid (the agent process is still running) before trying to spawn any agents. This is tedious, error-prone, and makes each user do a lot of work that could have been done in one place by one person instead.

There’s still the problem of when you launch a new shell that doesn’t inherit the variables (i.e. a remote login), so there’s no way for it to be aware of the existing agents. To fix this, you’d need to write the agent information to a file. The shell init script checks this file for an existing agent before spawning one. This is even more complicated, more error-prone, and subject to race-conditions. Why make every use go through this process?!

Fortunately someone’s done all this work so you don’t have to! There’s an awesome little tool called Keychain which can be used to launch the agents for you. It stores the agent information in a file so that you only ever launch one instance of the agent, and the agents will be shared across every shell. It does have an --inherit option — the default behavior, so you don’t even need to ask nicely. Instead of running the *-agents directly, you just put this in your .bashrc,

eval $(keychain --eval --quiet)

So simple and it just works! I was so happy when I found this. This is the magic word that makes using agents a breeze, so I can’t recommend it enough.

Avoid Zip Archives

2009-03-22T00:00:00Z

In a previous post about the LZMA compression algorithm, I made a negative comment about zip archives and moved on. I would like to go into more detail about it now.

A zip archive serves three functions all-in-one: compression, archive, and encryption. On a unix-like system, these functions would normally provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The unix philosophy says to "write programs that do one thing and do it well".

So in the case of zip archives, we are doing three things poorly when, instead, we should be using three separate tools that each do one thing well.

When we use three different tools, our encrypted archive is a lot like an onion. On the outside we have encryption. After we peel that off by decrypting it, we have compression, and after removing that lair, finally the archive. This is reflected in the filename: .tar.gz.gpg. As a side note, if GPG didn't already support it, we could add base-64 encoding if needed as another layer on the onion: .tar.gz.gpg.b64.

By using separate tools, we can also swap different tools in and out without breaking any spec. Previously I mentioned using LZMA, which could be used in place of gzip or bzip2. Instead of .tar.gz.gpg you can have .tar.lzma.gpg. Or you can swap out GPG for encryption and use, say, CipherSaber as .tar.lzma.cs2. If we use a single one-size-fits-all format, we are limited by the spec.

Compression

Both zip and gzip basically use the same compression algorithm. The zip spec actually allows for a variety of other compression algorithms, but you cannot rely on other tools to support them.

Zip archives are also inside out. Instead of solid compression, which is what happens in tarballs, each file is compressed individually. Redundancy between different files cannot be exploited. The equivalent would be an inside out tarball: .gz.tar. This would be produced by first individually gzipping each file in a directory tree, then archiving them with tar. This results in larger archive sizes.

However, there is an advantage to inside out archives: random access. We can access a file in the middle of the archive without having to take the whole thing apart. In general use, this sort of thing isn't really needed, and solid compression would be more useful.

Encryption

Encryption is where zip has been awful in the past. The original spec's encryption algorithm had serious flaws and no one should even consider using them today.

Since then, AES encryption has been worked into the standard and implemented differently by different tools. Unless the same zip tool is used on each end, you can't be sure AES encryption will work.

By placing encryption as part of the file spec, each tool has to implement its own encryption, probably leaving out considerations like using secure memory. These tools are concentrating on archiving and compression, and so encryption will likely not be given a solid effort.

In the implementations I know of, the archive index isn't encrypted, so someone could open it up and see lots of file metadata, including filenames.

When you encrypt a tarball with GnuPG, you have all the flexibility of PGP available. Asymmetric encryption, web of trust, multiple strong encryption algorithms, digital signatures, strong key management, etc. It would be unreasonable for an archive format to have this kind of thing built in.

Conclusion

You are almost always better off using a tarball rather than a zip archive. Unfortunately the receiver of an archive will often be unable to open anything else, so you may have no choice.

Controlling a Minefield

2008-12-16T00:00:00Z

Some time ago I was watching through the entire series of Deep Space 9. It was a Star Trek television show about a space station that rests next a wormhole that connects to the other side of the galaxy (The Delta quadrant).

The Delta quadrant is ruled by a group called the Dominion, and they are looking to conquer the Federation side of the galaxy (the Alpha quadrant). At one point during the series, the Federation needs to temporarily disable the wormhole to prevent Dominion ships from crossing through. They do this by mining the wormhole with identical, cloaked, self-replicating mines.

If a mine is destroyed, the neighboring mines will replicate a replacement. The minefield repairs itself. This makes removing the minefield within a reasonable amount of time difficult to impossible. If even a single mine is left behind, it can replicate the entire minefield again.

The most interesting question here is this:

When the Federation returns and wants to remove the minefield, how would they do it? What would stop the Dominion from doing the same thing?

The first thing that comes to mind is having a kill signal, but what would this signal be? It could simply be a plain "kill" command, but the Dominion could also broadcast such a signal to disable the minefield. Consider that the Dominion could capture a single mine and study everything about its workings. The minefield itself could therefore hold no secrets whatsoever. This leaves out any possibility of a secret kill command stored in the mines.

Here's what I would do, assuming that humans or aliens have not yet discovered some giant breakthrough in factoring in the Star Trek universe. I would randomly generate two very large prime numbers. Today, two 1024-bit primes should be more than enough, but in 350 years even larger numbers would probably be necessary. Then, I multiply these two number together and store this number in the mine software. To disable the minefield, I simply broadcast these two numbers into the minefield. The mines would be programmed to take the product of any pairs of numbers it receives. If the product matches the internal number, the mine shuts down.

Voila! A method for shutting down the minefield. The enemy can know everything about every single mine's construction, including the software and data stored on every mine, but will be unable to disable the minefield without factoring a very large composite number, which would presumably be difficult or impossible (within a reasonable amount of time).

Another possibility would be using a hash. Come up with a strong passphrase, then use a hashing algorithm like SHA-1 or MD5, or whatever is available and appropriate in 350 years, to hash the passphrase. Store the hash in the mines. When you want to disable the minefield, broadcast the passphrase. These mines will hash the broadcast and compare it to the stored hash. It's really the same solution as before: a one-way function. This is also similar to how passwords are stored inside a computer today.

If we wanted more commands, like "don't blow up any ships for awhile" or "increase minefield density", we could generate more composites corresponding to each command. However, once a command is issued, the secret — the two prime numbers — is out, and it cannot be used again. In this case, I would go into the realm of public key cryptography.

I would issue a command, along with a timestamp, and maybe even a nonce that could double as a global identifier for the command, and sign the whole deal using my private key. On each mine I would store the public key. When a command is received, the mines would check the signature before executing the command. I could then issue repeat commands, as the timestamps would change each time. An adversary learns nothing when a command is issued, because the time stamps would make any replay attacks useless.

Minefields just like this exist today all over the Internet, as botnets. Thousands of computers all around the world become infected with malware and come under the control of a single individual or group. Individual machines in the botnet could be taken out, but removing the entire botnet is difficult as it grows and repairs itself. Any security researcher could disassemble the botnet malware and learn anything about it, so the malware can store no secrets. How does a malicious person control the botnet, then, without someone else taking control? Public key cryptography, just as described above.