nullprogram.com/blog/2019/07/22/
Over the last couple weeks I’ve spent a lot more time working with
OpenPGP keys. It’s a consequence of polishing my passphrase-derived
PGP key generator. I’ve tightened up the internals, and it’s
enabled me to explore the corners of the format, try interesting
things, and observe how various OpenPGP implementations respond to
weird inputs.
For one particularly cool trick, take a look at these two (private)
keys I generated yesterday. Here’s the first:
-----BEGIN PGP PRIVATE KEY BLOCK-----
xVgEXTU3gxYJKwYBBAHaRw8BAQdAjJgvdh3N2pegXPEuMe25nJ3gI7k8gEgQvCor
AExppm4AAQC0TNsuIRHkxaGjLNN6hQowRMxLXAMrkZfMcp1DTG8GBg1TzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAGq0APwOtCFVCxpv
d/gzKUg0SkdygmriV1UmrQ+KYx9dhzC6xwEAqwDGsSgSbCqPdkwqi/tOn+MwZ5N9
jYxy48PZGZ2V3ws=
=bBGR
-----END PGP PRIVATE KEY BLOCK-----
And the second:
-----BEGIN PGP PRIVATE KEY BLOCK-----
xVgEXTU3gxYJKwYBBAHaRw8BAQdAzjSPKjpOuJoLP6G0z7pptx4sBNiqmgEI0xiH
Z4Xb16kAAP0Qyon06UB2/gOeV/KjAjCi91MeoUd7lsA5yn82RR5bOxAkzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAEv4AQDLRqx10v3M
bwVnJ8BDASAOzrPw+Rz1tKbjG9r45iE7NQEAhm9QVtFd8SN337kIWcq8wXA6j1tY
+UeEsjg+SHzkqA4=
=QnLn
-----END PGP PRIVATE KEY BLOCK-----
Concatenate these and then import them into GnuPG to have a look at
them. To avoid littering in your actual keyring, especially with
private keys, use the --homedir
option to set up a temporary
keyring. I’m going to omit that option in the examples.
$ gpg --import < keys.asc
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: Total number processed: 2
gpg: imported: 2
gpg: secret keys read: 2
gpg: secret keys imported: 2
The user ID is “nullprogram.com” since I made these and that’s me
taking credit. “992A5EEE1D1049FA” is called the long key ID: a
64-bit value that identifies the key. It’s the lowest 64 bits of the
full key ID, a 160-bit SHA-1 hash. In the old days everyone used a
short key ID to identify keys, which was the lowest 32 bits of the
key. For these keys, that would be “1D1049FA”. However, this was
deemed way too short, and everyone has since switched to long key
IDs, or even the full 160-bit key ID.
The key ID is nothing more than a SHA-1 hash of the key creation date —
unsigned 32-bit unix epoch seconds — and the public key material. So
secret keys have the same key ID as their associated public key. This
makes sense since they’re a key pair and they go together.
Look closely and you’ll notice that both keypairs have the same long
key ID. If you hadn’t already guessed from the title of this article,
these are two different keys with the same long key ID. In other
words, I’ve created a long key ID collision. The GnuPG
--list-keys
command prints the full key ID since it’s so important:
$ gpg --list-keys
---------------------
pub ed25519 2019-07-22 [SCA]
A422F8B0E1BF89802521ECB2992A5EEE1D1049FA
uid [ unknown] nullprogram.com
pub ed25519 2019-07-22 [SCA]
F43BC80C4FC2603904E7BE02992A5EEE1D1049FA
uid [ unknown] nullprogram.com
I was only targeting the lower 64 bits, but I actually managed to
collide the lowest 68 bits by chance. So a long key ID still isn’t
enough to truly identify any particular key.
This isn’t news, of course. Nor am I the first person to create a long
key ID collision. In 2013, David Leon Gil published a long key ID
collision for two 4096-bit RSA public keys. However, that is the
only other example I was able to find. He did not include the private
keys and did not elaborate on how he did it. I know he did generate
viable keys, not just garbage for the public key portions, since they’re
both self-signed.
Creating these keys was trickier than I had anticipated, and there’s an
old, clever trick that makes it work. Building atop the work I did for
passphrase2pgp, I created a standalone tool that will create a long key
ID collision and print the two keypairs to standard output:
Example usage:
$ go get -u github.com/skeeto/pgpcollider
$ pgpcollider --verbose > keys.asc
This can take up to a day to complete when run like this. The tool can
optionally coordinate many machines — see the --server
/ -S
and
--client
/ -C
options — to work together, greatly reducing the total
time. It took around 4 hours to create the keys above on a single
machine, generating a around 1 billion extra keys in the process. As
discussed below, I actually got lucky that it only took 1 billion. If
you modify the program to do short key ID collisions, it only takes a
few seconds.
The rest of this article is about how it works.
Birthday Attacks
An important detail is that this technique doesn’t target any specific
key ID. Cloning someone’s long key ID is still very expensive. No,
this is a birthday attack. To find a collision in a space of
2^64, on average I only need to generate 2^32 samples — the square root
of that space. That’s perfectly feasible on a regular desktop computer.
To collide long key IDs, I need only generate about 4 billion IDs and
efficiently do membership tests on that set as I go.
That last step is easier said than done. Naively, that might look like
this (pseudo-code):
seen := map of long key IDs to keys
loop forever {
key := generateKey()
longID := key.ID[12:20]
if longID in seen {
output seen[longID]
output key
break
} else {
seen[longID] = key
}
}
Consider the size of that map. Each long ID is 8 bytes, and we expect
to store around 2^32 of them. That’s at minimum 32 GB of storage
just to track all the long IDs. The map itself is going to have some
overhead, too. Since these are literally random lookups, this all
mostly needs to be in RAM or else lookups are going to be very slow
and impractical.
And I haven’t even counted the keys yet. As a saving grace, these are
Ed25519 keys, so that’s 32 bytes for the public key and 32 bytes for the
private key, which I’ll need if I want to make a self-signature. (The
signature itself will be larger than the secret key.) That’s around
256GB more storage, though at least this can be stored on the hard
drive. However, to address these from the map I’d need at least 38 bits,
plus some more in case it goes over. Just call it another 8 bytes.
So that’s, at a bare minimum, 64GB of RAM plus 256GB of other storage.
Since nothing is ideal, we’ll need more than this. This is all still
feasible, but will require expensive hardware. We can do a lot better.
Keys from seeds
The first thing you might notice is that we can jettison that 256GB of
storage by being a little more clever about how we generate keys. Since
we don’t actually care about the security of these keys, we can generate
each key from a seed much smaller than the key itself. Instead of using
8 bytes to reference a key in storage, just use those 8 bytes to store
the seed used to make the key.
counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
seed := counter
counter++
key := generateKey(seed)
longID := key.ID[12:20]
if longID in seen {
output generateKey(seen[longID])
output key
break
} else {
seen[longID] = seed
}
}
I’m incrementing a counter to generate the seeds because I don’t want to
experience the birthday paradox to apply to my seeds. Each really must
be unique. I’m using SplitMix64 for the PRNG since I learned it’s the
fastest for Go, so a simple increment to generate seeds is
perfectly fine.
Ultimately, this still uses utterly excessive amounts of memory.
Wouldn’t it be crazy if we could somehow get this 64GB map down to just
a few MBs of RAM? Well, we can!
Rainbow tables
For decades, password crackers have faced a similar problem. They want
to precompute the hashes for billions of popular passwords so that they
can efficiently reverse those password hashes later. However, storing
all those hashes would be unnecessarily expensive, or even infeasible.
So they don’t. Instead they use rainbow tables. Password hashes
are chained together into a hash chain, where a password hash leads to a
new password, then to a hash, and so on. Then only store the beginning
and the end of each chain.
To lookup a hash in the rainbow table, run the hash chain algorithm
starting from the target hash and, for each hash, check if it matches
the end of one of the chains. If so, recompute that chain and note the
step just before the target hash value. That’s the corresponding
password.
For example, suppose the password “foo” hashes to 9bfe98eb
, and we
have a reduction function that maps a hash to some password. In this
case, it maps 9bfe98eb
to “bar”. A trivial reduction function could
just be an index into a list of passwords. A hash chain starting from
“foo” might look like this:
foo -> 9bfe98eb -> bar -> 27af0841 -> baz -> d9d4bbcb
In reality a chain would be a lot longer. Another chain starting from
“apple” might look like this:
apple -> 7bbc06bc -> candle -> 82a46a63 -> dog -> 98c85d0a
We only store the tuples (foo, d9d4bbcb
) and (apple, 98c85d0a
) in
our database. If the chains had been one million hashes long, we’d
still only store those two tuples. That’s literally a 1:1000000
compression ratio!
Later on we’re faced with reversing the hash 27af0841
, which isn’t
listed directly in the database. So we run the chain forward from that
hash until either I hit the maximum chain length (i.e. password not in
the table), or we recognize a hash:
27af0841 -> baz -> d9d4bbcb
That d9d4bbcb
hash is listed as being in the “foo” hash chain. So I
regenerate that hash chain to discover that “bar” leads to 27af0841
.
Password cracked!
Collider rainbow table
My collider works very similarly. A hash chain works like this: Start
with a 64-bit seed as before, generate a key, get the long key ID,
then use the long key ID as the seed for the next key.
There’s one big difference. In the rainbow table the purpose is to run
the hash function backwards by looking at the previous step in the
chain. For the collider, I want to know if any of the hash chains
collide. So long as each chain starts from a unique seed, it would mean
we’ve found two different seeds that lead to the same long key ID.
Alternatively, it could be two different seeds that lead to the same
key, which wouldn’t be useful, but that’s trivial to avoid.
A simple and efficient way to check if two chains contain the same
sequence is to stop them at the same place in that sequence. Rather than
run the hash chains for some fixed number of steps, they stop when they
reach a distinguishing point. In my collider a distinguishing point is
where the long key ID ends with at least N 0 bits, where N determines
the average chain length. I chose 17 bits.
func computeChain(seed) {
loop forever {
key := generateKey(seed)
longID := key.ID[12:20]
if distinguished(longID) {
return longID
}
seed = longID
}
}
If two different hash chains end on the same distinguishing point,
they’re guaranteed to have collided somewhere in the middle.
To determine where two chains collided, regenerate each chain and find
the first long key ID that they have in common. The step just before are
the colliding keys.
counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
seed := counter
counter++
longID := computeChain(seed)
if longID in seen {
output findCollision(seed, seen[longID])
break
} else {
seen[longID] = seed
}
}
Hash chains computation is embarrassingly parallel, so the load can be
spread efficiently across CPU cores. With these rainbow(-like) tables,
my tool can generate and track billions of keys in mere megabytes of
memory. The additional computational cost is the time it takes to
generate a couple more chains than otherwise necessary.