Keyringless GnuPG

This article was discussed on Hacker News.

My favorite music player is Audacious. It follows the Winamp Classic tradition of not trying to manage my music library. Instead it waits patiently for me to throw files and directories at it. These selections will be informally grouped into transient, disposable playlists of whatever I fancy that day.

This matters to me because my music collection is the result of around 25 years of hoarding music files from various sources including CD rips, Napster P2P sharing, and, most recently, YouTube downloads. It’s not well-organized, but it’s organized well enough. Each album has its own directory, and related albums are sometimes grouped together under a directory for a particular artist.

Over the years I’ve tried various music players, and some have either wanted to manage this library or hide the underlying file-organized nature of my collection. Both situations are annoying because I really don’t want or need that abstraction. I’m going just fine thinking of my music library in terms of files, thank you very much. Same goes for ebooks.

GnuPG is like a media player that wants to manage your whole music library. Rather than MP3s, it’s crypto keys on a keyring. Nearly every operation requires keys that have been imported into the keyring. Until GnuPG 2.2.8 (June 2018), which added the --show-keys command, you couldn’t even be sure what you were importing until after it was already imported. Hopefully it wasn’t garbage.

GnuPG does has a pretty good excuse. It’s oriented around the Web of Trust model, and it can’t follow this model effectively without having all the keys at once. However, even if you don’t buy into the Web of Trust, the GnuPG interface still requires you to play by its rules. Sometimes I’ve got a message, a signature, and a public key and I just want to verify that they’re all consistent with each other, damnit.

$ gpg --import foo.asc
gpg: key 1A719EF63AEB2CFE: public key "foo" imported
gpg: Total number processed: 1
gpg:               imported: 1
$ gpg --verify --trust-model always message.txt.sig message.txt
gpg: Signature made Fri 09 Aug 2019 05:44:43 PM EDT
gpg:                using EDDSA key ...1A719EF63AEB2CFE
gpg: Good signature from "foo" [unknown]
gpg: WARNING: Using untrusted key!
$ gpg --batch --yes --delete-key 1A719EF63AEB2CFE

Three commands and seven lines of output when one of each would do. Plus there’s a false warning: Wouldn’t an “always” trust model mean that this key is indeed trusted?

Signify

Compare this to OpenBSD’s signify (also). There’s no keyring, and it’s up to the user — or the program shelling out to signify — to supply the appropriate key. It’s like the music player that just plays whatever I give it. Here’s a simplified usage overview:

signify -G [-c comment] -p pubkey -s seckey
signify -S [-x sigfile] -s seckey -m message
signify -V [-x sigfile] -p pubkey -m message

When generating a new keypair (-G), the user must choose the destination files for the public and secret keys. When signing a message (a file), the user must supply the secret key and the message. When verifying a file, the user must supply the public key and the message. This is a popular enough model that other, compatible implementations with the same interface have been developed.

Signify is deliberately incompatible with OpenPGP and uses its own simpler, and less featureful, format. Wouldn’t it be nice to have a similar interface to verify OpenPGP signatures?

SimpleGPG

Well, I thought so. So I put together a shell script that wraps GnuPG and provides such an interface:

SimpleGPG

The interface is nearly identical to signify, and the GnuPG keyring is hidden away as if it didn’t exist. The main difference is that the keys and signatures produced and consumed by this tool are fully compatible with OpenPGP. You could use this script without requiring anyone else to adopt something new or different.

To avoid touching your real keyring, the script creates a temporary keyring directory each time it’s run. The GnuPG option --homedir instructs it to use this temporary keyring and ignore the usual one. The temporary keyring is destroyed when the script exits. This is kind of clunky, but there’s no way around it.

Verification looks roughly like this in the script:

$ tmp=$(mktemp -d simplegpg-XXXXXX)
$ gpg --homedir $tmp
$ gpg --homedir $tmp --import foo.asc
$ gpg --homedir $tmp --verify message.txt.sig message.txt
$ rm -rf $tmp

Generating a key is trivial, and there’s only a prompt for the protection passphrase. Like signify, it will generate an Ed25519 key and all outputs are ASCII-armored.

$ simplegpg -G -p keyname.asc -s keyname.pgp
passphrase:
passphrase (confirm):

Since signify doesn’t have a concept of a user ID for a key, just an “untrusted comment”, the user ID is not emphasized here. The default user ID will be “simplegpg key”, so, if you plan to share the key with regular GnuPG users who will need to import it into a keyring, you probably want to use -c to give it a more informative name.

Unfortunately due GnuPG’s very limited, keyring-oriented interface, key generation is about three times slower than it should be. That’s because the protection key is run though the String-to-Key (S2K) algorithm three times:

  1. Immediately after the key is generated, the passphrase is converted to a key, the key is encrypted, and it’s put onto the temporary keyring.

  2. When exporting, the key passphrase is again run through the S2K to get the protection key to decrypt it.

  3. The export format uses a slightly different S2K algorithm, so this export S2K is now used to create yet another protection key.

Technically the second could be avoided since gpg-agent, which is always required, could be holding the secret key material. As far as I can tell, gpg-agent simply does not learn freshly-generated keys. I do not know why this is the case.

This is related to another issue. If you’re accustomed to GnuPG, you may notice that the passphrase prompt didn’t come from pinentry, a program specialized for passphrase prompts. GnuPG normally uses it for this. Instead, the script handles the passphrase prompt and passes the passphrase to GnuPG (via a file descriptor). This would not be necessary if gpg-agent did its job. Without this part of the script, users are prompted three times, via pinentry, for their passphrase when generating a key.

When signing messages, the passphrase prompt comes from pinentry since it’s initiated by GnuPG.

$ simplegpg -S -s keyname.pgp -m message.txt
passphrase:

This will produce message.txt.sig with an OpenPGP detached signature.

The passphrase prompt is for --import, not --detach-sign. As with key generation, the S2K is run more than necessary: twice instead of once. First to generate the decryption key, then a second time to generate a different encryption key for the keyring since the export format and keyring use different algorithms. Ugh.

But at least gpg-agent does its job this time, so only one passphrase prompt is necessary. In general, a downside of these temporary keyrings is that gpg-agent treats each as different keys, and you will need to enter your passphrase once for each message signed. Just like signify.

Verification, of course, requires no prompting and no S2K.

$ simplegpg -V -p keyname.asc -m message.txt

That’s all there is to keyringless OpenPGP signatures. Since I’m not interested in the Web of Trust or keyservers, I wish GnuPG was more friendly to this model of operation.

passphrase2pgp

I mentioned that SimpleGPG is fully compatible with other OpenPGP systems. This includes my own passphrase2pgp, where your secret key is stored only in your brain. No need for a secret key file. In the time since I first wrote about it, passphrase2pgp has gained the ability to produce signatures itself!

I’ve got my environment set up — $REALNAME, $EMAIL, and $KEYID per the README — so I don’t need to supply a user ID argument, nor will I be prompted to confirm my passphrase since it’s checked against a known fingerprint. Generating the public key, for sharing, looks like this:

$ passphrase2pgp -K --armor --public >keyname.asc

Or just:

$ passphrase2pgp -ap >keyname.asc

Like with signify and SimplePGP, to sign a message I’m prompted for my passphrase. It takes longer since the “S2K” here is much stronger by necessity. The passphrase is used to generate the secret key, then from that the signature on the message:

$ passphrase2pgp -S message.txt

For the SimpleGPG user on the other side it all looks the same as before:

$ simplegpg -V -p keyname.asc -m message.txt

I’m probably going to start signing my open source software releases, and this is how I intend to do it.

The Long Key ID Collider

Over the last couple weeks I’ve spent a lot more time working with OpenPGP keys. It’s a consequence of polishing my passphrase-derived PGP key generator. I’ve tightened up the internals, and it’s enabled me to explore the corners of the format, try interesting things, and observe how various OpenPGP implementations respond to weird inputs.

For one particularly cool trick, take a look at these two (private) keys I generated yesterday. Here’s the first:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAjJgvdh3N2pegXPEuMe25nJ3gI7k8gEgQvCor
AExppm4AAQC0TNsuIRHkxaGjLNN6hQowRMxLXAMrkZfMcp1DTG8GBg1TzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAGq0APwOtCFVCxpv
d/gzKUg0SkdygmriV1UmrQ+KYx9dhzC6xwEAqwDGsSgSbCqPdkwqi/tOn+MwZ5N9
jYxy48PZGZ2V3ws=
=bBGR
-----END PGP PRIVATE KEY BLOCK-----

And the second:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAzjSPKjpOuJoLP6G0z7pptx4sBNiqmgEI0xiH
Z4Xb16kAAP0Qyon06UB2/gOeV/KjAjCi91MeoUd7lsA5yn82RR5bOxAkzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAEv4AQDLRqx10v3M
bwVnJ8BDASAOzrPw+Rz1tKbjG9r45iE7NQEAhm9QVtFd8SN337kIWcq8wXA6j1tY
+UeEsjg+SHzkqA4=
=QnLn
-----END PGP PRIVATE KEY BLOCK-----

Concatenate these and then import them into GnuPG to have a look at them. To avoid littering in your actual keyring, especially with private keys, use the --homedir option to set up a temporary keyring. I’m going to omit that option in the examples.

$ gpg --import < keys.asc
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: Total number processed: 2
gpg:               imported: 2
gpg:       secret keys read: 2
gpg:   secret keys imported: 2

The user ID is “nullprogram.com” since I made these and that’s me taking credit. “992A5EEE1D1049FA” is called the long key ID: a 64-bit value that identifies the key. It’s the lowest 64 bits of the full key ID, a 160-bit SHA-1 hash. In the old days everyone used a short key ID to identify keys, which was the lowest 32 bits of the key. For these keys, that would be “1D1049FA”. However, this was deemed way too short, and everyone has since switched to long key IDs, or even the full 160-bit key ID.

The key ID is nothing more than a SHA-1 hash of the key creation date — unsigned 32-bit unix epoch seconds — and the public key material. So secret keys have the same key ID as their associated public key. This makes sense since they’re a key pair and they go together.

Look closely and you’ll notice that both keypairs have the same long key ID. If you hadn’t already guessed from the title of this article, these are two different keys with the same long key ID. In other words, I’ve created a long key ID collision. The GnuPG --list-keys command prints the full key ID since it’s so important:

$ gpg --list-keys
---------------------
pub   ed25519 2019-07-22 [SCA]
      A422F8B0E1BF89802521ECB2992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

pub   ed25519 2019-07-22 [SCA]
      F43BC80C4FC2603904E7BE02992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

I was only targeting the lower 64 bits, but I actually managed to collide the lowest 68 bits by chance. So a long key ID still isn’t enough to truly identify any particular key.

This isn’t news, of course. Nor am I the first person to create a long key ID collision. In 2013, David Leon Gil published a long key ID collision for two 4096-bit RSA public keys. However, that is the only other example I was able to find. He did not include the private keys and did not elaborate on how he did it. I know he did generate viable keys, not just garbage for the public key portions, since they’re both self-signed.

Creating these keys was trickier than I had anticipated, and there’s an old, clever trick that makes it work. Building atop the work I did for passphrase2pgp, I created a standalone tool that will create a long key ID collision and print the two keypairs to standard output:

Example usage:

$ go get -u github.com/skeeto/pgpcollider
$ pgpcollider --verbose > keys.asc

This can take up to a day to complete when run like this. The tool can optionally coordinate many machines — see the --server / -S and --client / -C options — to work together, greatly reducing the total time. It took around 4 hours to create the keys above on a single machine, generating a around 1 billion extra keys in the process. As discussed below, I actually got lucky that it only took 1 billion. If you modify the program to do short key ID collisions, it only takes a few seconds.

The rest of this article is about how it works.

Birthday Attacks

An important detail is that this technique doesn’t target any specific key ID. Cloning someone’s long key ID is still very expensive. No, this is a birthday attack. To find a collision in a space of 2^64, on average I only need to generate 2^32 samples — the square root of that space. That’s perfectly feasible on a regular desktop computer. To collide long key IDs, I need only generate about 4 billion IDs and efficiently do membership tests on that set as I go.

That last step is easier said than done. Naively, that might look like this (pseudo-code):

seen := map of long key IDs to keys
loop forever {
    key := generateKey()
    longID := key.ID[12:20]
    if longID in seen {
        output seen[longID]
        output key
        break
    } else {
        seen[longID] = key
    }
}

Consider the size of that map. Each long ID is 8 bytes, and we expect to store around 2^32 of them. That’s at minimum 32 GB of storage just to track all the long IDs. The map itself is going to have some overhead, too. Since these are literally random lookups, this all mostly needs to be in RAM or else lookups are going to be very slow and impractical.

And I haven’t even counted the keys yet. As a saving grace, these are Ed25519 keys, so that’s 32 bytes for the public key and 32 bytes for the private key, which I’ll need if I want to make a self-signature. (The signature itself will be larger than the secret key.) That’s around 256GB more storage, though at least this can be stored on the hard drive. However, to address these from the map I’d need at least 38 bits, plus some more in case it goes over. Just call it another 8 bytes.

So that’s, at a bare minimum, 64GB of RAM plus 256GB of other storage. Since nothing is ideal, we’ll need more than this. This is all still feasible, but will require expensive hardware. We can do a lot better.

Keys from seeds

The first thing you might notice is that we can jettison that 256GB of storage by being a little more clever about how we generate keys. Since we don’t actually care about the security of these keys, we can generate each key from a seed much smaller than the key itself. Instead of using 8 bytes to reference a key in storage, just use those 8 bytes to store the seed used to make the key.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    key := generateKey(seed)
    longID := key.ID[12:20]
    if longID in seen {
        output generateKey(seen[longID])
        output key
        break
    } else {
        seen[longID] = seed
    }
}

I’m incrementing a counter to generate the seeds because I don’t want to experience the birthday paradox to apply to my seeds. Each really must be unique. I’m using SplitMix64 for the PRNG since I learned it’s the fastest for Go, so a simple increment to generate seeds is perfectly fine.

Ultimately, this still uses utterly excessive amounts of memory. Wouldn’t it be crazy if we could somehow get this 64GB map down to just a few MBs of RAM? Well, we can!

Rainbow tables

For decades, password crackers have faced a similar problem. They want to precompute the hashes for billions of popular passwords so that they can efficiently reverse those password hashes later. However, storing all those hashes would be unnecessarily expensive, or even infeasible.

So they don’t. Instead they use rainbow tables. Password hashes are chained together into a hash chain, where a password hash leads to a new password, then to a hash, and so on. Then only store the beginning and the end of each chain.

To lookup a hash in the rainbow table, run the hash chain algorithm starting from the target hash and, for each hash, check if it matches the end of one of the chains. If so, recompute that chain and note the step just before the target hash value. That’s the corresponding password.

For example, suppose the password “foo” hashes to 9bfe98eb, and we have a reduction function that maps a hash to some password. In this case, it maps 9bfe98eb to “bar”. A trivial reduction function could just be an index into a list of passwords. A hash chain starting from “foo” might look like this:

foo -> 9bfe98eb -> bar -> 27af0841 -> baz -> d9d4bbcb

In reality a chain would be a lot longer. Another chain starting from “apple” might look like this:

apple -> 7bbc06bc -> candle -> 82a46a63 -> dog -> 98c85d0a

We only store the tuples (foo, d9d4bbcb) and (apple, 98c85d0a) in our database. If the chains had been one million hashes long, we’d still only store those two tuples. That’s literally a 1:1000000 compression ratio!

Later on we’re faced with reversing the hash 27af0841, which isn’t listed directly in the database. So we run the chain forward from that hash until either I hit the maximum chain length (i.e. password not in the table), or we recognize a hash:

27af0841 -> baz -> d9d4bbcb

That d9d4bbcb hash is listed as being in the “foo” hash chain. So I regenerate that hash chain to discover that “bar” leads to 27af0841. Password cracked!

Collider rainbow table

My collider works very similarly. A hash chain works like this: Start with a 64-bit seed as before, generate a key, get the long key ID, then use the long key ID as the seed for the next key.

There’s one big difference. In the rainbow table the purpose is to run the hash function backwards by looking at the previous step in the chain. For the collider, I want to know if any of the hash chains collide. So long as each chain starts from a unique seed, it would mean we’ve found two different seeds that lead to the same long key ID.

Alternatively, it could be two different seeds that lead to the same key, which wouldn’t be useful, but that’s trivial to avoid.

A simple and efficient way to check if two chains contain the same sequence is to stop them at the same place in that sequence. Rather than run the hash chains for some fixed number of steps, they stop when they reach a distinguishing point. In my collider a distinguishing point is where the long key ID ends with at least N 0 bits, where N determines the average chain length. I chose 17 bits.

func computeChain(seed) {
    loop forever {
        key := generateKey(seed)
        longID := key.ID[12:20]
        if distinguished(longID) {
            return longID
        }
        seed = longID
    }
}

If two different hash chains end on the same distinguishing point, they’re guaranteed to have collided somewhere in the middle.

To determine where two chains collided, regenerate each chain and find the first long key ID that they have in common. The step just before are the colliding keys.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    longID := computeChain(seed)
    if longID in seen {
        output findCollision(seed, seen[longID])
        break
    } else {
        seen[longID] = seed
    }
}

Hash chains computation is embarrassingly parallel, so the load can be spread efficiently across CPU cores. With these rainbow(-like) tables, my tool can generate and track billions of keys in mere megabytes of memory. The additional computational cost is the time it takes to generate a couple more chains than otherwise necessary.

Predictable, Passphrase-Derived PGP Keys

tl;dr: passphrase2pgp.

One of my long-term concerns has been losing my core cryptographic keys, or just not having access to them when I need them. I keep my important data backed up, and if that data is private then I store it encrypted. My keys are private, but how am I supposed to encrypt them? The chicken or the egg?

The OpenPGP solution is to (optionally) encrypt secret keys using a key derived from a passphrase. GnuPG prompts the user for this passphrase when generating keys and when using secret keys. This protects the keys at rest, and, with some caution, they can be included as part of regular backups. The OpenPGP specification, RFC 4880 has many options for deriving a key from this passphrase, called String-to-Key, or S2K, algorithms. None of the options are great.

In 2012, I selected the strongest S2K configuration and, along with a very strong passphrase, put my GnuPG keyring on the internet as part of my public dotfiles repository. It was a kind of super-backup that would guarantee their availability anywhere I’d need them.

My timing was bad because, with the release of GnuPG 2.1 in 2014, GnuPG fundamentally changed its secret keyring format. S2K options are now (quietly!) ignored when deriving the protection keys. Instead it auto-calibrates to much weaker settings. With this new version of GnuPG, I could no longer update the keyring in my dotfiles repository without significantly downgrading its protection.

By 2017 I was pretty irritated with the whole situation. I let my OpenPGP keys expire, and then I wrote my own tool to replace the only feature of GnuPG I was actively using: encrypting my backups with asymmetric encryption. One of its core features is that the asymmetric keypair can be derived from a passphrase using a memory-hard key derivation function (KDF). Attackers must commit a significant quantity of memory (expensive) when attempting to crack the passphrase, making the passphrase that much more effective.

Since the asymmetric keys themselves, not just the keys protecting them, are derived from a passphrase, I never need to back them up! They’re also always available whenever I need them. My keys are essentially stored entirely in my brain as if I was a character in a William Gibson story.

Tackling OpenPGP key generation

At the time I had expressed my interest in having this feature for OpenPGP keys. It’s something I’ve wanted for a long time. I first took a crack at it in 2013 (now the the old-version branch) for generating RSA keys. RSA isn’t that complicated but it’s very easy to screw up. Since I was rolling it from scratch, I didn’t really trust myself not to subtly get it wrong. Plus I never figured out how to self-sign the key. GnuPG doesn’t accept secret keys that aren’t self-signed, so it was never useful.

I took another crack at it in 2018 with a much more brute force approach. When a program needs to generate keys, it will either read from /dev/u?random or, on more modern systems, call getentropy(3). These are all ultimately system calls, and I know how to intercept those with Ptrace. If I want to control key generation for any program, not just GnuPG, I could intercept these inputs and replace them with the output of a CSPRNG keyed by a passphrase.

Keyed: Linux Entropy Interception

In practice this doesn’t work at all. Real programs like GnuPG and OpenSSH’s ssh-keygen don’t rely solely on these entropy inputs. They also grab entropy from other places, like getpid(2), gettimeofday(2), and even extract their own scheduler and execution time noise. Without modifying these programs I couldn’t realistically control their key generation.

Besides, even if it did work, it would still be fragile and unreliable since these programs could always change how they use the inputs. So, ultimately, it was more of an experiment than something practical.

passphrase2pgp

For regular readers, it’s probably obvious that I recently learned Go. While searching for good projects idea for cutting my teeth, I noticed that Go’s “extended” standard library has a lot of useful cryptographic support, so the idea of generating the keys myself may be worth revisiting.

Something else also happened since my previous attempt: The OpenPGP ecosystem now has widespread support for elliptic curve cryptography. So instead of RSA, I could generate a Curve25519 keypair, which, by design, is basically impossible to screw up. Not only would I be generating keys on my own terms, I’d being doing it in style, baby.

There are two different ways to use Curve25519:

  1. Digital signatures: Ed25519 (EdDSA)
  2. Diffie–Hellman (encryption): X25519 (ECDH)

In GnuPG terms, the first would be a “sign only” key and the second is an “encrypt only” key. But can’t you usually do both after you generate a new OpenPGP key? If you’ve used GnuPG, you’ve probably seen the terms “primary key” and “subkey”, but you probably haven’t had think about them since it’s all usually automated.

The primary key is the one associated directly with your identity. It’s always a signature key. The OpenPGP specification says this is a signature key only by convention, but, practically speaking, it really must be since signatures is what holds everything together. Like packaging tape.

If you want to use encryption, independently generate an encryption key, then sign that key with the primary key, binding that key as a subkey to the primary key. This all happens automatically with GnuPG.

Fun fact: Two different primary keys can have the same subkey. Anyone could even bind any of your subkeys to their primary key! They only need to sign the public key! Though, of course, they couldn’t actually use your key since they’d lack the secret key. It would just be really confusing, and could, perhaps in certain situations, even cause some OpenPGP clients to malfunction. (Note to self: This demands investigation!)

It’s also possible to have signature subkeys. What good is that? Paranoid folks will keep their primary key only on a secure, air-gapped, then use only subkeys on regular systems. The subkeys can be revoked and replaced independently of the primary key if something were to go wrong.

In Go, generating an X25519 key pair is this simple (yes, it actually takes array pointers, which is rather weird):

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/curve25519"
)

func main() {
	var seckey, pubkey [32]byte
	rand.Read(seckey[:]) // FIXME: check for error
	seckey[0] &= 248
	seckey[31] &= 127
	seckey[31] |= 64
	curve25519.ScalarBaseMult(&pubkey, &seckey)
	fmt.Printf("pub %x\n", pubkey[:])
	fmt.Printf("sec %x\n", seckey[:])
}

The three bitwise operations are optional since it will do these internally, but it ensures that the secret key is in its canonical form. The actual Diffie–Hellman exchange requires just one more function call: curve25519.ScalarMult().

For Ed25519, the API is higher-level:

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/ed25519"
)

func main() {
	seed := make([]byte, ed25519.SeedSize)
	rand.Read(seed) // FIXME: check for error
	key := ed25519.NewKeyFromSeed(seed)
	fmt.Printf("pub %x\n", key[32:])
	fmt.Printf("sec %x\n", key[:32])
}

Signing a message with this key is just one function call: ed25519.Sign().

Unfortunately that’s the easy part. The other 400 lines of the real program are concerned only with encoding these values in the complex OpenPGP format. That’s the hard part. GnuPG’s --list-packets option was really useful for debugging this part.

OpenPGP specification

(Feel free to skip this section if the OpenPGP wire format isn’t interesting to you.)

Following the specification was a real challenge, especially since many of the details for Curve25519 only appear in still incomplete (and still erroneous) updates to the specification. I certainly don’t envy the people who have to parse arbitrary OpenPGP packets. It’s finicky and has arbitrary parts that don’t seem to serve any purpose, such as redundant prefix and suffix bytes on signature inputs. Fortunately I only had to worry about the subset that represents an unencrypted secret key export.

OpenPGP data is broken up into packets. Each packet begins with a tag identifying its type, followed by a length, which itself is a variable length. All the packets produced by passphrase2pgp are short, so I could pretend lengths were all a single byte long.

For a secret key export with one subkey, we need the following packets in this order:

  1. Secret-Key: Public-Key packet with secret key appended
  2. User ID: just a length-prefixed, UTF-8 string
  3. Signature: binds Public-Key packet (1) and User ID packet (2)
  4. Secret-Subkey: Public-Subkey packet with secret subkey appended
  5. Signature: binds Public-Key packet (1) and Public-Subkey packet (4)

A Public-Key packet contains the creation date, key type, and public key data. A Secret-Key packet is the same, but with the secret key literally appended on the end and a different tag. The Key ID is (essentially) a SHA-1 hash of the Public-Key packet, meaning the creation date is part of the Key ID. That’s important for later.

I had wondered if the SHAttered attack could be used to create two different keys with the same full Key ID. However, there’s no slack space anywhere in the input, so I doubt it.

User IDs are usually a RFC 2822 name and email address, but that’s only convention. It can literally be an empty string, though that wouldn’t be useful. OpenPGP clients that require anything more than an empty string, such as GnuPG during key generation, are adding artificial restrictions.

The first Signature packet indicates the signature date, the signature issuer’s Key ID, and then optional metadata about how the primary key is to be used and the capabilities the key owner’s client. The signature itself is formed by appending the Public-Key packet portion of the Secret-Key packet, the User ID packet, and the previously described contents of the signature packet. The concatenation is hashed, the hash is signed, and the signature is appended to the packet. Since the options are included in the signature, they can’t be changed by another person.

In theory the signature is redundant. A client could accept the Secret-Key packet and User ID packet and consider the key imported. It would then create its own self-signature since it has everything it needs. However, my primary target for passphrase2pgp is GnuPG, and it will not accept secret keys that are not self-signed.

The Secret-Subkey packet is exactly the same as the Secret-Key packet except that it uses a different tag to indicate it’s a subkey.

The second Signature packet is constructed the same as the previous signature packet. However, it signs the concatenation of the Public-Key and Public-Subkey packets, binding the subkey to that primary key. This key may similarly have its own options.

To create a public key export from this input, a client need only chop off the secret keys and fix up the packet tags and lengths. The signatures remain untouched since they didn’t include the secret keys. That’s essentially what other people will receive about your key.

If someone else were to create a Signature packet binding your Public-Subkey packet with their Public-Key packet, they could set their own options on their version of the key. So my question is: Do clients properly track this separate set of options and separate owner for the same key? If not, they have a problem!

The format may not sound so complex from this description, but there are a ton of little details that all need to be correct. To make matters worse, the feedback is usually just a binary “valid” or “invalid”. The world could use an OpenPGP format debugger.

Usage

There is one required argument: either --uid (-u) or --load (-l). The former specifies a User ID since a key with an empty User ID is pretty useless. It’s my own artificial restriction on the User ID. The latter loads a previously-generated key which will come with a User ID.

To generate a key for use in GnuPG, just pipe the output straight into GnuPG:

$ passphrase2pgp --uid "Foo <[email protected]>" | gpg --import

You will be prompted for a passphrase. That passphrase is run through Argon2id, a memory-hard KDF, with the User ID as the salt. Deriving the key requires 8 passes over 1GB of state, which takes my current computers around 8 seconds. With the --paranoid (-x) option enabled, that becomes 16 passes over 2GB (perhaps not paranoid enough?). The output is 64 bytes: 32 bytes to seed the primary key and 32 bytes to seed the subkey.

Despite the aggressive KDF settings, you will still need to choose a strong passphrase. Anyone who has your public key can mount an offline attack. A 10-word Diceware or Pokerware passphrase is more than sufficient (~128 bits) while also being quite reasonable to memorize.

Since the User ID is the salt, an attacker couldn’t build a single rainbow table to attack passphrases for different people. (Though your passphrase really should be strong enough that this won’t matter!) The cost is that you’ll need to use exactly the same User ID again to reproduce the key. In theory you could change the User ID afterward to whatever you like without affecting the Key ID, though it will require a new self-signature.

The keys are not encrypted (no S2K), and there are few options you can choose when generating the keys. If you want to change any of this, use GnuPG’s --edit-key tool after importing. For example, to set a protection passphrase:

$ gpg --edit-key Foo
gpg> passwd

There’s a lot that can be configured from this interface.

If you just need the public key to publish or share, the --public (-p) option will suppress the private parts and output only a public key. It works well in combination with ASCII armor, --armor (-a). For example, to put your public key on the clipboard:

$ passphrase2pgp -u '...' -ap | xclip

The tool can create detached signatures (--sign, -S) entirely on its own, too, so you don’t need to import the keys into GnuPG just to make signatures:

$ passphrase2pgp --sign --uid '...' program.exe

This would create a file named program.exe.sig with the detached signature, ready to be verified by another OpenPGP implementation. In fact, you can hook it directly up to Git for signing your tags and commits without GnuPG:

$ git config --global gpg.program passphrase2pgp

This only works for signing, and it cannot verify (verify-tag or verify-commit).

It’s pretty tedious to enter the --uid option all the time, so, if omitted, passphrase2pgp will infer the User ID from the environment variables REALNAME and EMAIL. Combined with the KEYID environment variable (see the README for details), you can easily get away with never storing your keys: only generate them on demand when needed.

That’s how I intend to use passphrase2pgp. When I want to sign a file, I’ll only need one option, one passphrase prompt, and a few seconds of patience:

$ passphrase2pgp -S path/to/file

January 1, 1970

The first time you run the tool you might notice one offensive aspect of its output: Your key will be dated January 1, 1970 — i.e. unix epoch zero. This predates PGP itself by more than two decades, so it might alarm people who receive your key.

Why do this? As I noted before, the creation date is part of the Key ID. Use a different date, and, as far as OpenPGP is concerned, you have a different key. Since users probably don’t want to remember a specific datetime, at seconds resolution, in addition to their passphrase, passphrase2pgp uses the same hard-coded date by default. A date of January 1, 1970 is like NULL in a database: no data.

If you don’t like this, you can override it with the --time (-t) or --now (-n) options, but it’s up to you to remain consistent.

Vanity Keys

If you’re interested in vanity keys — e.g. where the Key ID spells out words or looks unusual — it wouldn’t take much work to hack up the passphrase2pgp source into generating your preferred vanity keys. It would easily beat anything else I could find online.

Reconsidering limited OpenPGP

Initially my intention was never to output an encryption subkey, and passphrase2pgp would only be useful for signatures. By default it still only produces a sign key, but you can still get an encryption subkey with the --subkey (-s) option. I figured it might be useful to generate an encryption key, even if it’s not output by default. Users can always ask for it later if they have a need for it.

Why only a signing key? Nobody should be using OpenPGP for encryption anymore. Use better tools instead and retire the 20th century cryptography. If you don’t have an encryption subkey, nobody can send you OpenPGP-encrypted messages.

In contrast, OpenPGP signatures are still kind of useful and lack a practical alternative. The Web of Trust failed to reach critical mass, but that doesn’t seem to matter much in practice. Important OpenPGP keys can be bootstrapped off TLS by strategically publishing them on HTTPS servers. Keybase.io has done interesting things in this area.

Further, GitHub officially supports OpenPGP signatures, and I believe GitLab does too. This is another way to establish trust for a key. IMHO, there’s generally too much emphasis on binding a person’s legal identity to their OpenPGP key (e.g. the idea behind key-signing parties). I suppose that’s useful for holding a person legally accountable if they do something wrong. I’d prefer trust a key with has an established history of valuable community contributions, even if done so only under a pseudonym.

So sometime in the future I may again advertise an OpenPGP public key. If I do, those keys would certainly be generated with passphrase2pgp. I may not even store the secret keys on a keyring, and instead generate them on the fly only when I occasionally need them.

Go Slices are Fat Pointers

This article was discussed on Hacker News.

One of the frequent challenges in C is that pointers are nothing but a memory address. A callee who is passed a pointer doesn’t truly know anything other than the type of object being pointed at, which says some things about alignment and how that pointer can be used… maybe. If it’s a pointer to void (void *) then not even that much is known.

The number of consecutive elements being pointed at is also not known. It could be as few as zero, so dereferencing would be illegal. This can be true even when the pointer is not null. Pointers can go one past the end of an array, at which point it points to zero elements. For example:

void foo(int *);

void bar(void)
{
    int array[4];
    foo(array + 4);  // pointer one past the end
}

In some situations, the number of elements is known, at least to the programmer. For example, the function might have a contract that says it must be passed at least N elements, or exactly N elements. This could be communicated through documentation.

/** Foo accepts 4 int values. */
void foo(int *);

Or it could be implied by the function’s prototype. Despite the following function appearing to accept an array, that’s actually a pointer, and the “4” isn’t relevant to the prototype.

void foo(int[4]);

C99 introduced a feature to make this a formal part of the prototype, though, unfortunately, I’ve never seen a compiler actually use this information.

void foo(int[static 4]);  // >= 4 elements, cannot be null

Another common pattern is for the callee to accept a count parameter. For example, the POSIX write() function:

ssize_t write(int fd, const void *buf, size_t count);

The necessary information describing the buffer is split across two arguments. That can become tedious, and it’s also a source of serious bugs if the two parameters aren’t in agreement (buffer overflow, information disclosure, etc.). Wouldn’t it be nice if this information was packed into the pointer itself? That’s essentially the definition of a fat pointer.

Fat pointers via bit hacks

If we assume some things about the target platform, we can encode fat pointers inside a plain pointer with some dirty pointer tricks, exploiting unused bits in the pointer value. For example, currently on x86-64, only the lower 48 bits of a pointer are actually used. The other 16 bits could carefully be used for other information, like communicating the number of elements or bytes:

// NOTE: x86-64 only!
unsigned char buf[1000];
uintptr addr = (uintptr_t)buf & 0xffffffffffff;
uintptr pack = (sizeof(buf) << 48) | addr;
void *fatptr = (void *)pack;

The other side can unpack this to get the components back out. Obviously 16 bits for the count will often be insufficient, so this would more likely be used for baggy bounds checks.

Further, if we know something about the alignment — say, that it’s 16-byte aligned — then we can also encode information in the least significant bits, such as a type tag.

Fat pointers via a struct

That’s all fragile, non-portable, and rather limited. A more robust approach is to lift pointers up into a richer, heavier type, like a structure.

struct fatptr {
    void *ptr;
    size_t len;
};

Functions accepting these fat pointers no longer need to accept a count parameter, and they’d generally accept the fat pointer by value.

fatptr_write(int fd, struct fatptr);

In typical C implementations, the structure fields would be passed practically, if not exactly, same way as the individual parameters would have been passed, so it’s really no less efficient.

To help keep this straight, we might employ some macros:

#define COUNTOF(array) \
    (sizeof(array) / sizeof(array[0]))

#define FATPTR(ptr, count) \
    (struct fatptr){ptr, count}

#define ARRAYPTR(array) \
    FATPTR(array, COUNTOF(array))

/* ... */

unsigned char buf[40];
fatptr_write(fd, ARRAYPTR(buf));

There are obvious disadvantages of this approach, like type confusion due to that void pointer, the inability to use const, and just being weird for C. I wouldn’t use it in a real program, but bear with me.

Before I move on, I want to add one more field to that fat pointer struct: capacity.

struct fatptr {
    void *ptr;
    size_t len;
    size_t cap;
};

This communicates not how many elements are present (len), but how much additional space is left in the buffer. This allows callees know how much room is left for, say, appending new elements.

// Fix the remainder of an int buffer with a value.
void
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
}

Since the callee modifies the fat pointer, it should be returned:

struct fatptr
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
    ptr.len = ptr.cap;
    return ptr;
}

Congratulations, you’ve got slices! Except that in Go they’re a proper part of the language and so doesn’t rely on hazardous hacks or tedious bookkeeping. The fatptr_write() function above nearly functionally equivalent to the Writer.Write() method in Go, which accepts a slice:

type Writer interface {
	Write(p []byte) (n int, err error)
}

The buf and count parameters are packed together as a slice, and fd parameter is instead the receiver (the object being acted upon by the method).

Go slices

Go famously has pointers, including internal pointers, but not pointer arithmetic. You can take the address of (nearly) anything, but you can’t make that pointer point at anything else, even if you took the address of an array element. Pointer arithmetic would undermine Go’s type safety, so it can only be done through special mechanisms in the unsafe package.

But pointer arithmetic is really useful! It’s handy to take an address of an array element, pass it to a function, and allow that function to modify a slice (wink, wink) of the array. Slices are pointers that support exactly this sort of pointer arithmetic, but safely. Unlike the & operator which creates a simple pointer, the slice operator derives a fat pointer.

func fill([]int, int) []int

var array [8]int

// len == 0, cap == 8, like &array[0]
fill(array[:0], 1)
// array is [1, 1, 1, 1, 1, 1, 1, 1]

// len == 0, cap == 4, like &array[4]
fill(array[4:4], 2)
// array is [1, 1, 1, 1, 2, 2, 2, 2]

The fill function could take a slice of the slice, effectively moving the pointer around with pointer arithmetic, but without violating memory safety due to the additional “fat pointer” information. In other words, fat pointers can be derived from other fat pointers.

Slices aren’t as universal as pointers, at least at the moment. You can take the address of any variable using &, but you can’t take a slice of any variable, even if it would be logically sound.

var foo int

// attempt to make len = 1, cap = 1 slice backed by foo
var fooslice []int = foo[:]   // compile-time error!

That wouldn’t be very useful anyway. However, if you really wanted to do this, the unsafe package can accomplish it. I believe the resulting slice would be perfectly safe to use:

// Convert to one-element array, then slice
fooslice = (*[1]int)(unsafe.Pointer(&foo))[:]

Update: Chris Siebenmann speculated about why this requires unsafe.

Of course, slices are super flexible and have many more uses that look less like fat pointers, but this is still how I tend to reason about slices when I write Go.

UTF-8 String Indexing Strategies

This article was discussed on Hacker News.

When designing or, in some cases, implementing a programming language with built-in support for Unicode strings, an important decision must be made about how to represent or encode those strings in memory. Not all representations are equal, and there are trade-offs between different choices.

One issue to consider is that strings typically feature random access indexing of code points with a time complexity resembling constant time (O(1)). However, not all string representations actually support this well. Strings using variable length encoding, such as UTF-8 or UTF-16, have O(n) time complexity indexing, ignoring special cases (discussed below). The most obvious choice to achieve O(1) time complexity — an array of 32-bit values, as in UCS-4 — makes very inefficient use of memory, especially with typical strings.

Despite this, UTF-8 is still chosen in a number of programming languages, or at least in their implementations. In this article I’ll discuss three examples — Emacs Lisp, Julia, and Go — and how each takes a slightly different approach.

Emacs Lisp

Emacs Lisp has two different types of strings that generally can be used interchangeably: unibyte and multibyte. In fact, the difference between them is so subtle that I bet that most people writing Emacs Lisp don’t even realize there are two kinds of strings.

Emacs Lisp uses UTF-8 internally to encode all “multibyte” strings and buffers. To fully support arbitrary sequences of bytes in the files being edited, Emacs uses its own extension of Unicode to precisely and unambiguously represent raw bytes intermixed with text. Any arbitrary sequence of bytes can be decoded into Emacs’ internal representation, then losslessly re-encoded back into the exact same sequence of bytes.

Unibyte strings and buffers are really just byte-strings. In practice, they’re essentially ISO/IEC 8859-1, a.k.a. Latin-1. It’s a Unicode string where all code points are below 256. Emacs prefers the smallest and simplest string representation when possible, similar to CPython 3.3+.

(multibyte-string-p "hello")
;; => nil

(multibyte-string-p "π ≈ 3.14")
;; => t

Emacs Lisp strings are mutable, and therein lies the kicker: As soon as you insert a code point above 255, Emacs quietly converts the string to multibyte.

(defvar fish "fish")

(multibyte-string-p fish)
;; => nil

(setf (aref fish 2) ?ŝ
      (aref fish 3) ?o)

fish
;; => "fiŝo"

(multibyte-string-p fish)
;; => t

Constant time indexing into unibyte strings is straightforward, and Emacs does the obvious thing when indexing into unibyte strings. It helps that most strings in Emacs are probably unibyte, even when the user isn’t working in English.

Most buffers are multibyte, even if those buffers are generally just ASCII. Since Emacs uses gap buffers it generally doesn’t matter: Nearly all accesses are tightly clustered around the point, so O(n) indexing doesn’t often matter.

That leaves multibyte strings. Consider these idioms for iterating across a string in Emacs Lisp:

(dotimes (i (length string))
  (let ((c (aref string i)))
    ...))

(cl-loop for c being the elements of string
         ...)

The latter expands into essentially the same as the former: An incrementing index that uses aref to index to that code point. So is iterating over a multibyte string — a common operation — an O(n^2) operation?

The good news is that, at least in this case, no! It’s essentially just as efficient as iterating over a unibyte string. Before going over why, consider this little puzzle. Here’s a little string comparison function that compares two strings a code point at a time, returning their first difference:

(defun compare (string-a string-b)
  (cl-loop for a being the elements of string-a
           for b being the elements of string-b
           unless (eql a b)
           return (cons a b)))

Let’s examine benchmarks with some long strings (100,000 code points):

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (compare a b)))
;; => (0.012568031 0 0.0)

With using two, zeroed unibyte strings it takes 13ms. How about changing the last code point in one of them to 256, converting it to a multibyte string:

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (setf (aref a (1- (length a))) 256)
      (compare a b)))
;; => (0.012680513 0 0.0)

Same running time, so that multibyte string cost nothing more to iterate across. Let’s try making them both multibyte:

(benchmark-run
    (let ((a (make-string 100000 0))
          (b (make-string 100000 0)))
      (setf (aref a (1- (length a))) 256
            (aref b (1- (length b))) 256)
      (compare a b)))
;; => (2.327959762 0 0.0)

That took 2.3 seconds: about 2000x longer to run! Iterating over two multibyte strings concurrently seems to have broken an optimization. Can you reason about what’s happened?

To avoid the O(n) cost on this common indexing operating, Emacs keeps a “bookmark” for the last indexing location into a multibyte string. If the next access is nearby, it can starting looking from this bookmark, forwards or backwards. Like a gap buffer, this gives a big advantage to clustered accesses, including iteration.

However, this string bookmark is global, one per Emacs instance, not once per string. In the last benchmark, the two multibyte strings are constantly fighting over a single string bookmark, and indexing in comparison function is reduced to O(n^2) time complexity.

So, Emacs pretends it has constant time access into its UTF-8 text data, but it’s only faking it with some simple optimizations. This usually works out just fine.

Julia

Another approach is to not pretend at all, and to make this limitation of UTF-8 explicit in the interface. Julia took this approach, and it was one of my complaints about the language. I don’t think this is necessarily a bad choice, but I do still think it’s inappropriate considering Julia’s target audience (i.e. Matlab users).

Julia strings are explicitly byte strings containing valid UTF-8 data. All indexing occurs on bytes, which is trivially constant time, and always decodes the multibyte code point starting at that byte. But it is an error to index to a byte that doesn’t begin a code point. That error is also trivially checked in constant time.

s = "π"

s[1]
# => 'π'

s[2]
# ERROR: UnicodeError: invalid character index
#  in getindex at ./strings/basic.jl:37

Slices are still over bytes, but they “round up” to the end of the current code point:

s[1:1]
# => "π"

Iterating over a string requires helper functions which keep an internal “bookmark” so that each access is constant time:

for i in eachindex(string)
    c = string[i]
    # ...
end

So Julia doesn’t pretend, it makes the problem explicit.

Go

Go is very similar to Julia, but takes an even more explicit view of strings. All strings are byte strings and there are no restrictions on their contents. Conventionally strings contain UTF-8 encoded text, but this is not strictly required. There’s a unicode/utf8 package for working with strings containing UTF-8 data.

Beyond convention, the range clause also assumes the string contains UTF-8 data, and it’s not an error if it does not. Bytes not containing valid UTF-8 data appear as a REPLACEMENT CHARACTER (U+FFFD).

func main() {
    s := \xff"
    for _, r := range s {
        fmt.Printf("U+%04x\n", r)
    }
}

// U+03c0
// U+fffd

A further case of the language favoring UTF-8 is that casting a string to []rune decodes strings into code points, like UCS-4, again using REPLACEMENT CHARACTER:

func main() {
    s := \xff"
    r := []rune(s)
    fmt.Printf("U+%04x\n", r[0])
    fmt.Printf("U+%04x\n", r[1])
}

// U+03c0
// U+fffd

So, like Julia, there’s no pretending, and the programmer explicitly must consider the problem.

Preferences

All-in-all I probably prefer how Julia and Go are explicit with UTF-8’s limitations, rather than Emacs Lisp’s attempt to cover it up with an internal optimization. Since the abstraction is leaky, it may as well be made explicit.

Looking for Entropy in All the Wrong Places

Imagine we’re writing a C program and we need some random numbers. Maybe it’s for a game, or for a Monte Carlo simulation, or for cryptography. The standard library has a rand() function for some of these purposes.

int r = rand();

There are some problems with this. Typically the implementation is a rather poor PRNG, and we can do much better. It’s a poor choice for Monte Carlo simulations, and outright dangerous for cryptography. Furthermore, it’s usually a dynamic function call, which has a high overhead compared to how little the function actually does. In glibc, it’s also synchronized, adding even more overhead.

But, more importantly, this function returns the same sequences of values each time the program runs. If we want different numbers each time the program runs, it needs to be seeded — but seeded with what? Regardless of what PRNG we ultimately use, we need inputs unique to this particular execution.

The right places

On any modern unix-like system, the classical approach is to open /dev/urandom and read some bytes. It’s not part of POSIX but it is a de facto standard. These random bits are seeded from the physical world by the operating system, making them highly unpredictable and uncorrelated. They’re are suitable for keying a CSPRNG and, from there, generating all the secure random bits you will ever need (perhaps with fast-key-erasure). Why not /dev/random? Because on Linux it’s pointlessly superstitious, which has basically ruined that path for everyone.

/* Returns zero on failure. */
int
getbits(void *buf, size_t len)
{
    int result = 0;
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        result = fread(buf, len, 1, f);
        fclose(f);
    }
    return result;
}

int
main(void)
{
    unsigned seed;
    if (getbits(&seed, sizeof(seed))) {
        srand(seed);
    } else {
        die();
    }

    /* ... */
}

Note how there are two different places getbits() could fail, with multiple potential causes.

The need for creating a file descriptor a serious issue for libraries. Libraries that quietly create and close file descriptors can interfere with the main program, especially if its asynchronous. The main program might rely on file descriptors being consecutive, predictable, or monotonic (example). File descriptors are also a limited resource, so it may exhaust a file descriptor slot needed for the main program. For a network service, a remote attacker could perhaps open enough sockets to deny a file descriptor to getbits(), blocking the program from gathering entropy.

/dev/urandom is simple, but it’s not an ideal API.

getentropy(2)

Wouldn’t it be nicer if our program could just directly ask the operating system to fill a buffer with random bits? That’s what the OpenBSD folks thought, so they introduced a getentropy(2) system call. When called correctly it cannot fail!

int getentropy(void *buf, size_t buflen);

Other operating systems followed suit, including Linux, though on Linux getentropy(2) is a library function implemented using getrandom(2), the actual system call. It’s been in the Linux kernel since version 3.17 (October 2014), but the libc wrapper didn’t appear in glibc until version 2.25 (February 2017). So as of this writing, there are still many systems where it’s still not practical to use even if their kernel is new enough.

For now on Linux you may still want to check, and have a strategy in place, for an ENOSYS result. Some systems are still running kernels that are 5 years old, or older.

OpenBSD also has another trick up its trick-filled sleeves: the .openbsd.randomdata section. Just as the .bss section is filled with zeros, the .openbsd.randomdata section is filled with securely-generated random bits. You could put your PRNG state in this section and it will be seeded as part of loading the program. Cool!

RtlGenRandom()

Windows doesn’t have /dev/urandom. Instead it has:

Though in typical Win32 fashion, the API is ugly, overly-complicated, and has multiple possible failure points. It’s essentially impossible to use without referencing documentation. Ugh.

However, Windows 98 and later has RtlGenRandom(), which has a much more reasonable interface. Looks an awful lot like getentropy(2), eh?

BOOLEAN RtlGenRandom(
  PVOID RandomBuffer,
  ULONG RandomBufferLength
);

The problem is that it’s not quite an official API, and no promises are made about it. In practice, far too much software now depends on it that the API is unlikely to ever break. Despite the prototype above, this function is actually named SystemFunction036(), and you have to supply your own prototype. Here’s my little drop-in snippet that turns it nearly into getentropy(2):

#ifdef _WIN32
#  define WIN32_LEAN_AND_MEAN
#  include <windows.h>
#  pragma comment(lib, "advapi32.lib")
   BOOLEAN NTAPI SystemFunction036(PVOID, ULONG);
#  define getentropy(buf, len) (SystemFunction036(buf, len) ? 0 : -1)
#endif

It works in Wine, too, where, at least in my version, it reads from /dev/urandom.

The wrong places

That’s all well and good, but suppose we’re masochists. We want our program to be maximally portable so we’re sticking strictly to functionality found in the standard C library. That means no getentropy(2) and no RtlGenRandom(). We can still try to open /dev/urandom, but it might fail, or it might not actually be useful, so we’ll want a backup.

The usual approach found in a thousand tutorials is time(3):

srand(time(NULL));

It would be better to use an integer hash function to mix up the result from time(0) before using it as a seed. Otherwise two programs started close in time may have similar initial sequences.

srand(triple32(time(NULL)));

The more pressing issue is that time(3) has a resolution of one second. If the program is run twice inside of a second, they’ll both have the same sequence of numbers. It would be better to use a higher resolution clock, but, standard C doesn’t provide a clock with greater than one second resolution. That normally requires calling into POSIX or Win32.

So, we need to find some other sources of entropy unique to each execution of the program.

Quick and dirty “string” hash function

Before we get into that, we need a way to mix these different sources together. Here’s a small, 32-bit “string” hash function. The loop is the same algorithm as Java’s hashCode(), and I appended my own integer hash as a finalizer for much better diffusion.

uint32_t
hash32s(const void *buf, size_t len, uint32_t h)
{
    const unsigned char *p = buf;
    for (size_t i = 0; i < len; i++)
        h = h * 31 + p[i];
    h ^= h >> 17;
    h *= UINT32_C(0xed5ad4bb);
    h ^= h >> 11;
    h *= UINT32_C(0xac4c1b51);
    h ^= h >> 15;
    h *= UINT32_C(0x31848bab);
    h ^= h >> 14;
    return h;
}

It accepts a starting hash value, which is essentially a “context” for the digest that allows different inputs to be appended together. The finalizer acts as an implicit “stop” symbol in between inputs.

I used fixed-width integers, but it could be written nearly as concisely using only unsigned long and some masking to truncate to 32-bits. I leave this as an exercise to the reader.

Some of the values to be mixed in will be pointers themselves. These could instead be cast to integers and passed through an integer hash function, but using string hash avoids various caveats. Besides, one of the inputs will be a string, so we’ll need this function anyway.

Randomized pointers (ASLR, random stack gap, etc.)

Attackers can use predictability to their advantage, so modern systems use unpredictability to improve security. Memory addresses for various objects and executable code are randomized since some attacks require an attacker to know their addresses. We can skim entropy from these pointers to seed our PRNG.

Address Space Layout Randomization (ASLR) is when executable code and its associated data is loaded to a random offset by the loader. Code designed for this is called Position Independent Code (PIC). This has long been used when loading dynamic libraries so that all of the libraries on a system don’t have to coordinate with each other to avoid overlapping.

To improve security, it has more recently been extended to programs themselves. On both modern unix-like systems and Windows, position-independent executables (PIE) are now the default.

To skim entropy from ASLR, we just need the address of one of our functions. All the functions in our program will have the same relative offset, so there’s no reason to use more than one. An obvious choice is main():

    uint32_t h = 0;  /* initial hash value */
    int (*mainptr)() = main;
    h = hash32s(&mainptr, sizeof(mainptr), h);

Notice I had to store the address of main() in a variable, and then treat the pointer itself as a buffer for the hash function? It’s not hashing the machine code behind main, just its address. The symbol main doesn’t store an address, so it can’t be given to the hash function to represent its address. This is analogous to an array versus a pointer.

On a typical x86-64 Linux system, and when this is a PIE, that’s about 3 bytes worth of entropy. On 32-bit systems, virtual memory is so tight that it’s worth a lot less. We might want more entropy than that, and we want to cover the case where the program isn’t compiled as a PIE.

On unix-like systems, programs are typically dynamically linked against the C library, libc. Each shared object gets its own ASLR offset, so we can skim more entropy from each shared object by picking a function or variable from each. Let’s do malloc(3) for libc ASLR:

    void *(*mallocptr)() = malloc;
    h = hash32s(&mallocptr, sizeof(mallocptr), h);

Allocators themselves often randomize the addresses they return so that data objects are stored at unpredictable addresses. In particular, glibc uses different strategies for small (brk(2)) versus big (mmap(2)) allocations. That’s two different sources of entropy:

    void *small = malloc(1);        /* 1 byte */
    h = hash32s(&small, sizeof(small), h);
    free(small);

    void *big = malloc(1UL << 20);  /* 1 MB */
    h = hash32s(&big, sizeof(big), h);
    free(big);

Finally the stack itself is often mapped at a random address, or at least started with a random gap, so that local variable addresses are also randomized.

    void *ptr = &ptr;
    h = hash32s(&ptr, sizeof(ptr), h);

Time sources

We haven’t used time(3) yet! Let’s still do that, using the full width of time_t this time around:

    time_t t = time(0);
    h = hash32s(&t, sizeof(t), h);

We do have another time source to consider: clock(3). It returns an approximation of the processor time used by the program. There’s a tiny bit of noise and inconsistency between repeated calls. We can use this to extract a little bit of entropy over many repeated calls.

Naively we might try to use it like this:

    /* Note: don't use this */
    for (int i = 0; i < 1000; i++) {
        clock_t c = clock();
        h = hash32s(&c, sizeof(c), h);
    }

The problem is that the resolution for clock() is typically rough enough that modern computers can execute multiple instructions between ticks. On Windows, where CLOCKS_PER_SEC is low, that entire loop will typically complete before the result from clock() increments even once. With that arrangement we’re hardly getting anything from it! So here’s a better version:

    for (int i = 0; i < 1000; i++) {
        unsigned long counter = 0;
        clock_t start = clock();
        while (clock() == start)
            counter++;
        h = hash32s(&start, sizeof(start), h);
        h = hash32s(&counter, sizeof(counter), h);
    }

The counter makes the resolution of the clock no longer important. If it’s low resolution, then we’ll get lots of noise from the counter. If it’s high resolution, then we get noise from the clock value itself. Running the hash function an extra time between overall clock(3) samples also helps with noise.

A legitimate use of tmpnam(3)

We’ve got one more source of entropy available: tmpnam(3). This function generates a unique, temporary file name. It’s dangerous to use as intended because it doesn’t actually create the file. There’s a race between generating the name for the file and actually creating it.

Fortunately we don’t actually care about the name as a filename. We’re using this to sample entropy not directly available to us. In attempt to get a unique name, the standard C library draws on its own sources of entropy.

    char buf[L_tmpnam] = {0};
    tmpnam(buf);
    h = hash32s(buf, sizeof(buf), h);

The rather unfortunately downside is that lots of modern systems produce a linker warning when it sees tmpnam(3) being linked, even though in this case it’s completely harmless.

So what goes into a temporary filename? It depends on the implementation.

glibc and musl

Both get a high resolution timestamp and generate the filename directly from the timestamp (no hashing, etc.). Unfortunately glibc does a very poor job of also mixing getpid(2) into the timestamp before using it, and probably makes things worse by doing so.

On these platforms, this is is a way to sample a high resolution timestamp without calling anything non-standard.

dietlibc

In the latest release as of this writing it uses rand(3), which makes this useless. It’s also a bug since the C library isn’t allowed to affect the state of rand(3) outside of rand(3) and srand(3). I submitted a bug report and this has since been fixed.

In the next release it will use a generator seeded by the ELF AT_RANDOM value if available, or ASLR otherwise. This makes it moderately useful.

libiberty

Generated from getpid(2) alone, with a counter to handle multiple calls. It’s basically a way to sample the process ID without actually calling getpid(2).

BSD libc / Bionic (Android)

Actually gathers real entropy from the operating system (via arc4random(2)), which means we’re getting a lot of mileage out of this one.

uclibc

Its implementation is obviously forked from glibc. However, it first tries to read entropy from /dev/urandom, and only if that fails does it fallback to glibc’s original high resolution clock XOR getpid(2) method (still not hashing it).

Finishing touches

Finally, still use /dev/urandom if it’s available. This doesn’t require us to trust that the output is anything useful since it’s just being mixed into the other inputs.

    char rnd[4];
    FILE *f = fopen("/dev/urandom", "rb");
    if (f) {
        if (fread(rnd, sizeof(rnd), 1, f))
            h = hash32s(rnd, sizeof(rnd), h);
        fclose(f);
    }

When we’re all done gathering entropy, set the seed from the result.

    srand(h);   /* or whatever you're seeding */

That’s bound to find some entropy on just about any host. Though definitely don’t rely on the results for cryptography.

Lua

I recently tackled this problem in Lua. It has a no-batteries-included design, demanding very little of its host platform: nothing more than an ANSI C implementation. Because of this, a Lua program has even fewer options for gathering entropy than C. But it’s still not impossible!

To further complicate things, Lua code is often run in a sandbox with some features removed. For example, Lua has os.time() and os.clock() wrapping the C equivalents, allowing for the same sorts of entropy sampling. When run in a sandbox, os might not be available. Similarly, io might not be available for accessing /dev/urandom.

Have you ever printed a table, though? Or a function? It evaluates to a string containing the object’s address.

$ lua -e 'print(math)'
table: 0x559577668a30
$ lua -e 'print(math)'
table: 0x55e4a3679a30

Since the raw pointer values are leaked to Lua, we can skim allocator entropy like before. Here’s the same hash function in Lua 5.3:

local function hash32s(buf, h)
    for i = 1, #buf do
        h = h * 31 + buf:byte(i)
    end
    h = h & 0xffffffff
    h = h ~ (h >> 17)
    h = h * 0xed5ad4bb
    h = h & 0xffffffff
    h = h ~ (h >> 11)
    h = h * 0xac4c1b51
    h = h & 0xffffffff
    h = h ~ (h >> 15)
    h = h * 0x31848bab
    h = h & 0xffffffff
    h = h ~ (h >> 14)
    return h
end

Now hash a bunch of pointers in the global environment:

local h = hash32s({}, 0)  -- hash a new table
for varname, value in pairs(_G) do
    h = hash32s(varname, h)
    h = hash32s(tostring(value), h)
    if type(value) == 'table' then
        for k, v in pairs(value) do
            h = hash32s(tostring(k), h)
            h = hash32s(tostring(v), h)
        end
    end
end

math.randomseed(h)

Unfortunately this doesn’t actually work well on one platform I tested: Cygwin. Cygwin has few security features, notably lacking ASLR, and having a largely deterministic allocator.

When to use it

In practice it’s not really necessary to use these sorts of tricks of gathering entropy from odd places. It’s something that comes up more in coding challenges and exercises than in real programs. I’m probably already making platform-specific calls in programs substantial enough to need it anyway.

On a few occasions I have thought about these things when debugging. ASLR makes return pointers on the stack slightly randomized on each run, which can change the behavior of some kinds of bugs. Allocator and stack randomization does similar things to most of your pointers. GDB tries to disable some of these features during debugging, but it doesn’t get everything.

Fibers: the Most Elegant Windows API

This article was discussed on Hacker News.

The Windows API — a.k.a. Win32 — is notorious for being clunky, ugly, and lacking good taste. Microsoft has done a pretty commendable job with backwards compatibility, but the trade-off is that the API is filled to the brim with historical cruft. Every hasty, poor design over the decades is carried forward forever, and, in many cases, even built upon, which essentially doubles down on past mistakes. POSIX certainly has its own ugly corners, but those are the exceptions. In the Windows API, elegance is the exception.

That’s why, when I recently revisited the Fibers API, I was pleasantly surprised. It’s one of the exceptions — much cleaner than the optional, deprecated, and now obsolete POSIX equivalent. It’s not quite an apples-to-apples comparison since the POSIX version is slightly more powerful, and more complicated as a result. I’ll cover the difference in this article.

For the last part of this article, I’ll walk through an async/await framework build on top of fibers. The framework allows coroutines in C programs to await on arbitrary kernel objects.

Fiber Async/await Demo

Fibers

Windows fibers are really just stackful, symmetric coroutines. From a different point of view, they’re cooperatively scheduled threads, which is the source of the analogous name, fibers. They’re symmetric because all fibers are equal, and no fiber is the “main” fiber. If any fiber returns from its start routine, the program exits. (Older versions of Wine will crash when this happens, but it was recently fixed.) It’s equivalent to the process’ main thread returning from main(). The initial fiber is free to create a second fiber, yield to it, then the second fiber destroys the first.

For now I’m going to focus on the core set of fiber functions. There are some additional capabilities I’m going to ignore, including support for fiber local storage. The important functions are just these five:

void *CreateFiber(size_t stack_size, void (*proc)(void *), void *arg);
void  SwitchToFiber(void *fiber);
bool  ConvertFiberToThread(void);
void *ConvertThreadToFiber(void *arg);
void  DeleteFiber(void *fiber);

To emphasize its simplicity, I’ve shown them here with more standard prototypes than seen in their formal documentation. That documentation uses the clunky Windows API typedefs still burdened with its 16-bit heritage — e.g. LPVOID being a “long pointer” from the segmented memory of the 8086:

Fibers are represented using opaque, void pointers. Maybe that’s a little too simple since it’s easy to misuse in C, but I like it. The return values for CreateFiber() and ConvertThreadToFiber() are void pointers since these both create fibers.

The fiber start routine returns nothing and takes a void “user pointer”. That’s nearly what I’d expect, except that it would probably make more sense for a fiber to return int, which is more in line with main / WinMain / mainCRTStartup / WinMainCRTStartup. As I said, when any fiber returns from its start routine, it’s like returning from the main function, so it should probably have returned an integer.

A fiber may delete itself, which is the same as exiting the thread. However, a fiber cannot yield (e.g. SwitchToFiber()) to itself. That’s undefined behavior.

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

void
coup(void *king)
{
    puts("Long live the king!");
    DeleteFiber(king);
    ConvertFiberToThread(); /* seize the main thread */
    /* ... */
}

int
main(void)
{
    void *king = ConvertThreadToFiber(0);
    void *pretender = CreateFiber(0, coup, king);
    SwitchToFiber(pretender);
    abort(); /* unreachable */
}

Only fibers can yield to fibers, but when the program starts up, there are no fibers. At least one thread must first convert itself into a fiber using ConvertThreadToFiber(), which returns the fiber object that represents itself. It takes one argument analogous to the last argument of CreateFiber(), except that there’s no start routine to accept it. The process is reversed with ConvertFiberToThread().

Fibers don’t belong to any particular thread and can be scheduled on any thread if properly synchronized. Obviously one should never yield to the same fiber in two different threads at the same time.

Contrast with POSIX

The equivalent POSIX systems was context switching. It’s also stackful and symmetric, but it has just three important functions: getcontext(3), makecontext(3), and swapcontext.

int  getcontext(ucontext_t *ucp);
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
int  swapcontext(ucontext_t *oucp, const ucontext_t *ucp);

These are roughly equivalent to GetCurrentFiber(), CreateFiber(), and SwitchToFiber(). There is no need for ConvertFiberToThread() since threads can context switch without preparation. There’s also no DeleteFiber() because the resources are managed by the program itself. That’s where POSIX contexts are a little bit more powerful.

The first argument to CreateFiber() is the desired stack size, with zero indicating the default stack size. The stack is allocated and freed by the operating system. The downside is that the caller doesn’t have a choice in managing the lifetime of this stack and how it’s allocated. If you’re frequently creating and destroying coroutines, those stacks are constantly being allocated and freed.

In makecontext(3), the caller allocates and supplies the stack. Freeing that stack is equivalent to destroying the context. A program that frequently creates and destroys contexts can maintain a stack pool or otherwise more efficiently manage their allocation. This makes it more powerful, but it also makes it a little more complicated. It would be hard to remember how to do all this without a careful reading of the documentation:

/* Create a context */
ucontext_t ctx;
ctx.uc_stack.ss_sp = malloc(SIGSTKSZ);
ctx.uc_stack.ss_size = SIGSTKSZ;
ctx.uc_link = 0;
getcontext(&ctx);
makecontext(&ctx, proc, 0);

/* Destroy a context */
free(ctx.uc_stack.ss_sp);

Note how makecontext(3) is variadic (...), passing its arguments on to the start routine of the context. This seems like it might be better than a user pointer. Unfortunately it’s not, since those arguments are strictly limited to integers.

Ultimately I like the fiber API better. The first time I tried it out, I could guess my way through it without looking closely at the documentation.

Async / await with fibers

Why was I looking at the Fiber API? I’ve known about coroutines for years but I didn’t understand how they could be useful. Sure, the function can yield, but what other coroutine should it yield to? It wasn’t until I was recently bit by the async/await bug that I finally saw a “killer feature” that justified their use. Generators come pretty close, though.

Windows fibers are a coroutine primitive suitable for async/await in C programs, where it can also be useful. To prove that it’s possible, I built async/await on top of fibers in 95 lines of code.

The alternatives are to use a third-party coroutine library or to do it myself with some assembly programming. However, having it built into the operating system is quite convenient! It’s unfortunate that it’s limited to Windows. Ironically, though, everything I wrote for this article, including the async/await demonstration, was originally written on Linux using Mingw-w64 and tested using Wine. Only after I was done did I even try it on Windows.

Before diving into how it works, there’s a general concept about the Windows API that must be understood: All kernel objects can be in either a signaled or unsignaled state. The API provides functions that block on a kernel object until it is signaled. The two important ones are WaitForSingleObject() and WaitForMultipleObjects(). The latter behaves very much like poll(2) in POSIX.

Usually the signal is tied to some useful event, like a process or thread exiting, the completion of an I/O operation (i.e. asynchronous overlapped I/O), a semaphore being incremented, etc. It’s a generic way to wait for some event. However, instead of blocking the thread, wouldn’t it be nice to await on the kernel object? In my aio library for Emacs, the fundamental “wait” object was a promise. For this API it’s a kernel object handle.

So, the await function will take a kernel object, register it with the scheduler, then yield to the scheduler. The scheduler — which is a global variable, so there’s only one scheduler per process — looks like this:

struct {
    void *main_fiber;
    HANDLE handles[MAXIMUM_WAIT_OBJECTS];
    void *fibers[MAXIMUM_WAIT_OBJECTS];
    void *dead_fiber;
    int count;
} async_loop;

While fibers are symmetric, coroutines in my async/await implementation are not. One fiber is the scheduler, main_fiber, and the other fibers always yield to it.

There is an array of kernel object handles, handles, and an array of fibers. The elements in these arrays are paired with each other, but it’s convenient to store them separately, as I’ll show soon. fibers[0] is waiting on handles[0], and so on.

The array is a fixed size, MAXIMUM_WAIT_OBJECTS (64), because there’s a hard limit on the number of fibers that can wait at once. This pathetically small limitation is an unfortunate, hard-coded restriction of the Windows API. It kills most practical uses of my little library. Fortunately there’s no limit on the number of handles we might want to wait on, just the number of co-existing fibers.

When a fiber is about to return from its start routine, it yields one last time and registers itself on the dead_fiber member. The scheduler will delete this fiber as soon as it’s given control. Fibers never truly return since that would terminate the program.

With this, the await function, async_await(), is pretty simple. It registers the handle with the scheduler, then yields to the scheduler fiber.

void
async_await(HANDLE h)
{
    async_loop.handles[async_loop.count] = h;
    async_loop.fibers[async_loop.count] = GetCurrentFiber();
    async_loop.count++;
    SwitchToFiber(async_loop.main_fiber);
}

Caveat: The scheduler destroys this handle with CloseHandle() after it signals, so don’t try to reuse it. This made my demonstration simpler, but it might be better to not do this.

A fiber can exit at any time. Such an exit is inserted implicitly before a fiber actually returns:

void
async_exit(void)
{
    async_loop.dead_fiber = GetCurrentFiber();
    SwitchToFiber(async_loop.main_fiber);
}

The start routine given to async_start() is actually wrapped in the real start routine. This is how async_exit() is injected:

struct fiber_wrapper {
    void (*func)(void *);
    void *arg;
};

static void
fiber_wrapper(void *arg)
{
    struct fiber_wrapper *fw = arg;
    fw->func(fw->arg);
    async_exit();
}

int
async_start(void (*func)(void *), void *arg)
{
    if (async_loop.count == MAXIMUM_WAIT_OBJECTS) {
        return 0;
    } else {
        struct fiber_wrapper fw = {func, arg};
        SwitchToFiber(CreateFiber(0, fiber_wrapper, &fw));
        return 1;
    }
}

The library provides a single awaitable function, async_sleep(). It creates a “waitable timer” object, starts the countdown, and returns it. (Notice how SetWaitableTimer() is a typically-ugly Win32 function with excessive parameters.)

HANDLE
async_sleep(double seconds)
{
    HANDLE promise = CreateWaitableTimer(0, 0, 0);
    LARGE_INTEGER t;
    t.QuadPart = (long long)(seconds * -10000000.0);
    SetWaitableTimer(promise, &t, 0, 0, 0, 0);
    return promise;
}

A more realistic example would be overlapped I/O. For example, you’d open a file (CreateFile()) in overlapped mode, then when you, say, read from that file (ReadFile()) you create an event object (CreateEvent()), populate an overlapped I/O structure with the event, offset, and length, then finally await on the event object. The fiber will be resumed when the operation is complete.

Side note: Unfortunately overlapped I/O doesn’t work correctly for files, and many operations can’t be done asynchronously, like opening files. When it comes to files, you’re better off using dedicated threads as libuv does instead of overlapped I/O. You can still await on these operations. You’d just await on the signal from the thread doing synchronous I/O, not from overlapped I/O.

The most complex part is the scheduler, and it’s really not complex at all:

void
async_run(void)
{
    while (async_loop.count) {
        /* Wait for next event */
        DWORD nhandles = async_loop.count;
        HANDLE *handles = async_loop.handles;
        DWORD r = WaitForMultipleObjects(nhandles, handles, 0, INFINITE);

        /* Remove event and fiber from waiting array */
        void *fiber = async_loop.fibers[r];
        CloseHandle(async_loop.handles[r]);
        async_loop.handles[r] = async_loop.handles[nhandles - 1];
        async_loop.fibers[r] = async_loop.fibers[nhandles - 1];
        async_loop.count--;

        /* Run the fiber */
        SwitchToFiber(fiber);

        /* Destroy the fiber if it exited */
        if (async_loop.dead_fiber) {
            DeleteFiber(async_loop.dead_fiber);
            async_loop.dead_fiber = 0;
        }
    }
}

This is why the handles are in their own array. The array can be passed directly to WaitForMultipleObjects(). The return value indicates which handle was signaled. The handle is closed, the entry removed from the scheduler, and then the fiber is resumed.

That WaitForMultipleObjects() is what limits the number of fibers. It’s not possible to wait on more than 64 handles at once! This is hard-coded into the API. How? A return value of 64 is an error code, and changing this would break the API. Remember what I said about being locked into bad design decisions of the past?

To be fair, WaitForMultipleObjects() was a doomed API anyway, just like select(2) and poll(2) in POSIX. It scales very poorly since the entire array of objects being waited on must be traversed on each call. That’s terribly inefficient when waiting on large numbers of objects. This sort of problem is solved by interfaces like kqueue (BSD), epoll (Linux), and IOCP (Windows). Unfortunately IOCP doesn’t really fit this particular problem well — awaiting on kernel objects — so I couldn’t use it.

When the awaiting fiber count is zero and the scheduler has control, all fibers must have completed and there’s nothing left to do. However, the caller can schedule more fibers and then restart the scheduler if desired.

That’s all there is to it. Have a look at demo.c to see how the API looks in some trivial examples. On Linux you can see it in action with make check. On Windows, you just need to compile it, then run it like a normal program. If there was a better function than WaitForMultipleObjects() in the Windows API, I would have considered turning this demonstration into a real library.

Endlessh: an SSH Tarpit

This article was discussed on Hacker News, on reddit (also), and featured in BSD Now 294.

I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.

The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost

Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.

How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF

The RFC also notes:

The server MAY send other lines of data before sending the version string.

There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.

This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a fork server: accept a connection, fork, and the child simply alternates between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}

A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a single-threaded poll(2) server, analogous to stackless green threads. The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their only overhead is the socket object in the kernel, and another 78 bytes to track them in Endlessh. Most of those bytes are used only for accurate logging. Only those clients that are overdue for a new line are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is essentially used in place of sleep(3). Though since it still needs to manage the accept server socket, it (almost) never actually waits on nothing.

There’s an option to limit the total number of client connections so that it doesn’t get out of hand. In this case it will stop polling the accept socket until a client disconnects. I probably shouldn’t have bothered with this option and instead relied on ulimit, a feature already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much more efficient than poll(2). The problem with poll(2) is that it’s constantly registering and unregistering Endlessh on each of the overdue sockets each time around the main loop. This is by far the most CPU-intensive part of Endlessh, and it’s all inflicted on the kernel. Most of the time, even with thousands of clients trapped in the tarpit, only a small number of them at polled at once, so I opted for better portability instead.

One consequence of not polling connections that are waiting is that disconnections aren’t noticed in a timely fashion. This makes the logs less accurate than I like, but otherwise it’s pretty harmless. Unforunately even if I wanted to fix this, the poll(2) interface isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the kernel, where it allocates send and receive buffers for each client and manages the proper TCP state. The next step to reducing this overhead is Endlessh opening a raw socket and speaking TCP itself, bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asycio, and it’s actually a near perfect fit for this problem. I should have just used it in the first place. The hard part is already implemented within asyncio, and the problem isn’t CPU-bound, so being written in Python doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asycio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Try it out for yourself. Firefox and Chrome will spin on that server for hours before giving up. I have yet to see curl actually timeout on its own in the default settings (--max-time/-m does work correctly, though).

Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.

null program

Chris Wellons

(PGP)