Infectious Executable Stacks

This article was discussed on Hacker News.

In software development there are many concepts that at first glance seem useful and sound, but, after considering the consequences of their implementation and use, are actually horrifying. Examples include thread cancellation, variable length arrays, and memory aliasing. GCC’s closure extension to C is another, and this little feature compromises the entire GNU toolchain.

GNU C nested functions

GCC has its own dialect of C called GNU C. One feature unique to GNU C is nested functions, which allow C programs to define functions inside other functions:

void intsort1(int *base, size_t nmemb)
{
    int cmp(const void *a, const void *b)
    {
        return *(int *)a - *(int *)b;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
}

The nested function above is straightforward and harmless. It’s nothing groundbreaking, and it is trivial for the compiler to implement. The cmp function is really just a static function whose scope is limited to the containing function, no different than a local static variable.

With one slight variation the nested function turns into a closure. This is where things get interesting:

void intsort2(int *base, size_t nmemb, _Bool invert)
{
    int cmp(const void *a, const void *b)
    {
        int r = *(int *)a - *(int *)b;
        return invert ? -r : r;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
}

The invert variable from the outer scope is accessed from the inner scope. This has clean, proper closure semantics and works correctly just as you’d expect. It fits quite well with traditional C semantics. The closure itself is re-entrant and thread-safe. It’s automatically (read: stack) allocated, and so it’s automatically freed when the function returns, including when the stack is unwound via longjmp(). It’s a natural progression to support closures like this via nested functions. The eventual caller, qsort, doesn’t even know it’s calling a closure!

While this seems so useful and easy, its implementation has serious consequences that, in general, outweigh its benefits. In fact, in order to make this work, the whole GNU toolchain has been specially rigged!

How does it work? The function pointer, cmp, passed to qsort must somehow be associated with its lexical environment, specifically the invert variable. A static address won’t do. When I implemented closures as a toy library, I talked about the function address for each closure instance somehow needing to be unique.

GCC accomplishes this by constructing a trampoline on the stack. That trampoline has access to the local variables stored adjacent to it, also on the stack. GCC also generates a normal cmp function, like the simple nested function before, that accepts invert as an additional argument. The trampoline calls this function, passing the local variable as this additional argument.

Trampoline illustration

To illustrate this, I’ve manually implemented intsort2() below for x86-64 (System V ABI) without using GCC’s nested function extension:

int cmp(const void *a, const void *b, _Bool invert)
{
    int r = *(int *)a - *(int *)b;
    return invert ? -r : r;
}

void intsort3(int *base, size_t nmemb, _Bool invert)
{
    unsigned long fp = (unsigned long)cmp;
    volatile unsigned char buf[] = {
        // mov  edx, invert
        0xba, invert, 0x00, 0x00, 0x00,
        // mov  rax, cmp
        0x48, 0xb8, fp >>  0, fp >>  8, fp >> 16, fp >> 24,
                    fp >> 32, fp >> 40, fp >> 48, fp >> 56,
        // jmp  rax
        0xff, 0xe0
    };
    int (*trampoline)(const void *, const void *) = (void *)buf;
    qsort(base, nmemb, sizeof(*base), trampoline);
}

Here’s a complete example you can try yourself on nearly any x86-64 unix-like system: trampoline.c. It even works with Clang. The two notable systems where stack trampolines won’t work are OpenBSD and WSL.

(Note: The volatile is necessary because C compilers rightfully do not see the contents of buf as being consumed. Execution of the contents isn’t considered.)

In case you hadn’t already caught it, there’s a catch. The linker needs to link a binary that asks the loader for an executable stack (-z execstack):

$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c

That’s because buf contains x86 code implementing the trampoline:

mov  edx, invert    ; assign third argument
mov  rax, cmp       ; store cmp address in RAX register
jmp  rax            ; jump to cmp

(Note: The absolute jump through a 64-bit register is necessary because the trampoline on the stack and the jump target will be very far apart. Further, these days the program will likely be compiled as a Position Independent Executable (PIE), so cmp might itself have an high address rather than load into the lowest 32 bits of the address space.)

However, executable stacks were phased out ~15 years ago because it makes buffer overflows so much more dangerous! Attackers can inject and execute whatever code they like, typically shellcode. That’s why we need this unusual linker option.

You can see that the stack will be executable using our old friend, readelf:

$ readelf -l a.out
...
  GNU_STACK  0x00000000 0x00000000 0x00000000
             0x00000000 0x00000000 RWE   0x10
...

Note the “RWE” at the bottom right, meaning read-write-execute. This is a really bad sign in a real binary. Do any binaries installed on your system right now have an executable stack? I found one on mine.

When compiling the original version using a nested function there’s no need for that special linker option. That’s because GCC saw that it would need an executable stack and used this option automatically.

Or, more specifically, GCC stopped requesting a non-executable stack in the object file it produced. For the GNU Binutils linker, the default is an executable stack.

Fail open design

Since this is the default, the only way to get a non-executable stack is if every object file input to the linker explicitly declares that it does not need an executable stack. To request a non-executable stack, an object file must contain the (empty) section .note.GNU-stack. If even a single object file fails to do this, then the final program gets an executable stack.

Not only does one contaminated object file infect the binary, everything dynamically linked with it also gets an executable stack. Entire processes are infected! This occurs even via dlopen(), where the stack is dynamically made executable to accomodate the new shared object.

I’ve been bit myself. In Baking Data with Serialization I did it completely by accident, and I didn’t notice my mistake until three years later. The GNU linker outputs object files without the special note by default even though the object file only contains data.

$ echo hello world >hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$

This is fixed with -z noexecstack:

$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
  [ 2] .note.GNU-stack  PROGBITS  00000000  0000004c
$

This may happen any time you link object files not produced by GCC, such as output from the NASM assembler or hand-crafted object files.

Nested C closures are super slick, but they’re just not worth the risk of an executable stack, and they’re certainly not worth an entire toolchain being fail open about it.

Legitimate-ish Use of alloca()

Yesterday I wrote about a legitimate use for variable length arrays. While recently discussing this topic with a co-worker, I also thought of a semi-legitimate use for alloca(), a non-standard “function” for dynamically allocating memory on the stack.

void *alloca(size_t);

I say “function” in quotes because it’s not truly a function and cannot be implemented as a function or by a library. It’s implemented in the compiler and is essentially part of the language itself. It’s a tool allowing a function to manipulate its own stack frame.

Like VLAs, it has the problem that if you’re able to use alloca() safely, then you really don’t need it in the first place. Allocation failures are undetectable and once they happen it’s already too late.

Opaque structs

To set the scene, let’s talk about opaque structs. Suppose you’re writing a C library with a clean interface. It’s set up so that changing your struct fields won’t break the Application Binary Interface (ABI), and callers are largely unable to depend on implementation details, even by accident. To achieve this, it’s likely you’re making use of opaque structs in your interface. Callers only ever receive pointers to library structures, which are handed back into the interface when they’re used. The internal details are hidden away.

/* opaque float stack API */
struct stack *stack_create(void);
void          stack_destroy(struct stack *);
int           stack_push(struct stack *, float v);
float         stack_pop(struct stack *);

Callers can use the API above without ever knowing the layout or even the size of struct stack. Only a pointer to the struct is ever needed. However, in order for this to work, the library must allocate the struct itself. If this is a concern, then the library will typically allow the caller to supply an allocator via function pointers. To see a really slick version of this in practice, check out Lua’s lua_Alloc, a single function allocator API.

Suppose we wanted to support something simpler: The library will advertise the size of the struct so the caller can allocate it.

/* API additions */
size_t stack_sizeof(void);
void   stack_init(struct stack *);  // like stack_create()
void   stack_free(struct stack *);  // like stack_destroy()

The implementation of stack_sizeof() would literally just be return sizeof struct stack. The caller might use it like so:

size_t len = stack_sizeof();
struct stack *s = malloc(len);
if (s) {
    stack_init(s);
    /* ... */
    stack_free(s);
    free(s);
}

However, that’s still a heap allocation. If this wasn’t an opaque struct, the caller could very naturally use automatic (i.e. stack) allocation, which is likely even preferred in this case. Is this still possible? Idea: Allocate it via a generic char array (VLA in this case).

size_t len = stack_sizeof();
char buf[len];
struct stack *s = (struct stack *)buf;
stack_init(s);

However, this is technically undefined behavior. While a char pointer is special and permitted to alias with anything, the inverse isn’t true. Pointers to other types don’t get a free pass to alias with a char array. Accessing a char value as if it were a different type just isn’t allowed. Why? Because the standard says so. If you want one of the practical reasons: the alignment might be incorrect.

Hmmm, is there another option? Maybe with alloca()!

size_t len = stack_sizeof();
struct stack *s = alloca(len);
stack_init(s);

Since len is expected to be small, it’s not any less safe than the non-opaque alternative. It doesn’t undermine the type system, either, since alloca() has the same semantics as malloc(). The downsides are:

Optimizing out alloca()?

The second issue can possibly be resolved if the size is available as a compile time constant. This starts to break the abstraction provided by opaque structs, but they’re still mostly opaque. For example:

/* API additions */
#define STACK_SIZE 24

/* In practice, this would likely be horrific #ifdef spaghetti! */

The caller might use it like this:

struct stack *s = alloca(STACK_SIZE);
stack_init(s);

Now the compiler can see the allocation size, and potentially optimize away the alloca(). As of this writing, Clang (all versions) can optimize these fixed-size alloca() usages, but GCC (9.2) still does not. Here’s a simple example:

#include <alloca.h>

void
foo(void)
{
#ifdef ALLOCA
    volatile char *s = alloca(64);
#else
    volatile char s[64];
#endif
    s[63] = 0;
}

With the char array version, both GCC and Clang produce optimal code:

0000000000000000 <foo>:
   0:	c6 44 24 f8 00       	mov    BYTE PTR [rsp-0x1],0x0
   5:	c3                   	ret

Side note: This is on x86-64 Linux, which uses the System V ABI. The entire array falls within the red zone, so it doesn’t need to be explicitly allocated.

With -DALLOCA, Clang does the same, but GCC does the allocation inefficiently as if it were dynamic:

0000000000000000 <foo>:
   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   4:	48 83 ec 50          	sub    rsp,0x50
   8:	48 8d 44 24 0f       	lea    rax,[rsp+0xf]
   d:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  11:	c6 40 3f 00          	mov    BYTE PTR [rax+0x3f],0x0
  15:	c9                   	leave
  16:	c3                   	ret

It would make a slightly better case for alloca() here if GCC was better about optimizing it. Regardless, this is another neat little trick that I probably wouldn’t use in practice.

Legitimate Use of Variable Length Arrays

The C99 (ISO/IEC 9899:1999) standard of C introduced a new, powerful feature called Variable Length Arrays (VLAs). The size of an array with automatic storage duration (i.e. stack allocated) can be determined at run time. Each instance of the array may even have a different length. Unlike alloca(), they’re a sanctioned form of dynamic stack allocation.

At first glance, VLAs seem convenient, useful, and efficient. Heap allocations have a small cost because the allocator needs to do some work to find or request some free memory, and typically the operation must be synchronized since there may be other threads also making allocations. Stack allocations are trivial and fast by comparison: Allocation is a matter of bumping the stack pointer, and no synchronization is needed.

For example, here’s a function that non-destructively finds the median of a buffer of floats:

/* note: nmemb must be non-zero */
float
median(const float *a, size_t nmemb)
{
    float copy[nmemb];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

It uses a VLA, copy, as a temporary copy of the input for sorting. The function doesn’t know at compile time how big the input will be, so it cannot just use a fixed size. With a VLA, it efficiently allocates exactly as much memory as needed on the stack.

Well, sort of. If nmemb is too large, then the VLA will silently overflow the stack. By silent I mean that the program has no way to detect it and avoid it. In practice, it can be a lot louder, from a segmentation fault in the best case, to an exploitable vulnerability in the worst case: stack clashing. If an attacker can control nmemb, they might choose a value that causes copy to overlap with other allocations, giving them control over those values as well.

If there’s any risk that nmemb is too large, it must be guarded.

#define COPY_MAX 4096

float
median(const float *a, size_t nmemb)
{
    if (nmemb > COPY_MAX)
        abort();  /* or whatever */
    float copy[nmemb];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

However, if median is expected to safely accommodate COPY_MAX elements, it may as well always allocate an array of this size. If it can’t, then that’s not a safe maximum.

float
median(const float *a, size_t nmemb)
{
    if (nmemb > COPY_MAX)
        abort();
    float copy[COPY_MAX];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

And rather than abort, you might still want to support arbitrary input sizes:

float
median(const float *a, size_t nmemb)
{
    float buf[COPY_MAX];
    float *copy = buf;
    if (nmemb > COPY_MAX)
        copy = malloc(sizeof(a[0]) * nmemb);
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    float result = copy[nmemb / 2];
    if (copy != buf)
        free(copy);
    return result;
}

Then small inputs are fast, but large inputs still work correctly. This is called small size optimization.

If the correct solution ultimately didn’t use a VLA, then what good are they? In general, VLAs not useful. They’re time bombs. VLAs are nearly always the wrong choice. You must be careul to check that they don’t exceed some safe maximum, and there’s no reason not to always use the maximum. This problem was realized for the C11 standard (ISO/IEC 9899:2011) where VLAs were made optional. A program containing a VLA will not necessarily compile on a C11 compiler.

Some purists also object to a special exception required for VLAs: The sizeof operator may evaluate its operand, and so it does not always evaluate to compile-time constant. If the operand contains a VLA, then the result depends on a run-time value.

Because they’re optional, it’s best to avoid even trivial VLAs like this:

float
median(const float *a, size_t nmemb)
{
    int max = 4096;
    if (nmemb > max)
        abort();
    float copy[max];
    memcpy(copy, a, sizeof(a[0]) * nmemb);
    qsort(copy, nmemb, sizeof(copy[0]), floatcmp);
    return copy[nmemb / 2];
}

It’s easy to prove that the array length is always 4096, but technically this is still a VLA. That would still be true even if max were const int, because the array length still isn’t a constant integral expression.

VLA overhead

Finally, there’s also the problem that VLAs just aren’t as efficient as you might hope. A function that does dynamic stack allocation requires additional stack management. It must track additional memory addresses and will require extra instructions.

void
fixed(int n)
{
    if (n <= 1<<14) {
        volatile char buf[1<<14];
        buf[n - 1] = 0;
    }
}

void
dynamic(int n)
{
    if (n <= 1<<14) {
        volatile char buf[n];
        buf[n - 1] = 0;
    }
}

Compiled with gcc -Os and viewed with objdump -d -Mintel:

0000000000000000 <fixed>:
   0:	81 ff 00 40 00 00    	cmp    edi,0x4000
   6:	7f 19                	jg     21 <fixed+0x21>
   8:	ff cf                	dec    edi
   a:	48 81 ec 88 3f 00 00 	sub    rsp,0x3f88
  11:	48 63 ff             	movsxd rdi,edi
  14:	c6 44 3c 88 00       	mov    BYTE PTR [rsp+rdi*1-0x78],0x0
  19:	48 81 c4 88 3f 00 00 	add    rsp,0x3f88
  20:	c3                   	ret    
  21:	c3                   	ret    

0000000000000022 <dynamic>:
  22:	81 ff 00 40 00 00    	cmp    edi,0x4000
  28:	7f 23                	jg     4d <dynamic+0x2b>
  2a:	55                   	push   rbp
  2b:	48 63 c7             	movsxd rax,edi
  2e:	ff cf                	dec    edi
  30:	48 83 c0 0f          	add    rax,0xf
  34:	48 63 ff             	movsxd rdi,edi
  37:	48 83 e0 f0          	and    rax,0xfffffffffffffff0
  3b:	48 89 e5             	mov    rbp,rsp
  3e:	48 89 e2             	mov    rdx,rsp
  41:	48 29 c4             	sub    rsp,rax
  44:	c6 04 3c 00          	mov    BYTE PTR [rsp+rdi*1],0x0
  48:	48 89 d4             	mov    rsp,rdx
  4b:	c9                   	leave  
  4c:	c3                   	ret    
  4d:	c3                   	ret    

Note the use of a base pointer, rbp and leave, in the second function in order to dynamically track the stack frame. (Hmm, in both cases GCC could easily shave off the extra ret at the end of each function. Missed optimization?)

The story is even worse when stack clash protection is enabled (-fstack-clash-protection). The compiler generates extra code to probe every page of allocation in case one of those pages is a guard page. That’s also more complex when the allocation is dynamic. The VLA version more than doubles in size (from 44 bytes to 101 bytes)!

Safe and useful variable length arrays

There is one convenient, useful, and safe form of VLAs: a pointer to a VLA. It’s convenient and useful because it makes some expressions simpler. It’s safe because there’s no arbitrary stack allocation.

Pointers to arrays are a rare sight in C code, whether variable length or not. That’s because, the vast majority of the time, C programmers implicitly rely on array decay: arrays quietly “decay” into pointers to their first element the moment you do almost anything with them. Also because they’re really awkward to use.

For example, the function sum3 takes a pointer to an array of exactly three elements.

int
sum3(int (*array)[3])
{
    return (*array)[0] + (*array)[1] + (*array)[2];
}

The parentheses are necessary because, without them, array would be an array of pointers — a type far more common than a pointer to an array. To index into the array, first the pointer to the array must be dereferenced to the array value itself, then this intermediate array is indexed triggering array decay. Conceptually there’s quite a bit to it, but, in practice, it’s all as efficient as the conventional approach to sum3 that accepts a plain int *.

The caller must take the address of an array of exactly the right length:

int buf[] = {1, 2, 4};
int r = sum3(&buf);

Or if dynamically allocating the array:

int (*array)[3] = malloc(sizeof(*array));
(*array)[0] = 1;
(*array)[1] = 2;
(*array)[2] = 4;
int r = sum3(array);
free(array);

The mandatory parentheses and strict type requirements make this awkward and rarely useful. However, with VLAs perhaps it’s worth the trouble! Consider an NxN matrix expressed using a pointer to a VLA:

int n = /* run-time value */;
/* TODO: Check for integer overflow? See note. */
float (*identity)[n][n] = malloc(sizeof(*identity));
if (identity) {
    for (int y = 0; y < n; y++) {
        for (int x = 0; x < n; x++) {
            (*identity)[y][x] = x == y;
        }
    }
}

When indexing, the parentheses are weird, but the indices have the convenient [y][x] format. The non-VLA alternative is to compute a 1D index manually from 2D indices (y*n+x):

int n = /* run-time value */;
/* TODO: Check for integer overflow. */
float *identity = malloc(sizeof(*identity) * n * n);
if (identity) {
    for (int y = 0; y < n; y++) {
        for (int x = 0; x < n; x++) {
            identity[y*n + x] = x == y;
        }
    }
}

Note: What’s the behavior in the VLA version when n is so large that sizeof(*identity) doesn’t fit in a size_t? I couldn’t find anything in the standard about it. Both GCC and Clang both implicitly check for overflow and, when it occurs, the expression evaluates to zero. The undefined behavior sanitizer doesn’t complain when this happens.

So VLAs might be worth the trouble when using pointers to multi-dimensional, dynamically-allocated arrays. However, I’m still judicious about their use due to reduced portability. As a practical example, MSVC famously does not, and likely will never will, support VLAs.

No, PHP Doesn't Have Closures

The PHP programming language is bizarre and, if nothing else, worthy of anthropological study. The only consistent property of PHP is how badly it’s designed, yet it somehow remains widely popular. There’s a social dynamic at play here that science has yet to unlock.

I don’t say this because I hate PHP. There’s no reason for that: I don’t write programs in PHP, never had to use it, and don’t expect to ever need it. Despite this, I just can’t look away from PHP in the same way I can’t look away from a car accident.

I recently came across a link to the PHP manual, and morbid curiosity that caused me to look through it. It’s fun to pick an arbitrary section of the manual and see how many crazy design choices I can spot, or at least see what sort of strange terminology the manual has invented to describe a common concept. This time around, one such section was on anonymous functions, including closures. It was even worse than I expected.

In some circumstances, closures can be a litmus test. Closure semantics are not complex, but they’re subtle and a little tricky until you get hang of them. If you’re interviewing a candidate, toss in a question or two about closures. Either they’re familiar and get it right away, or they’re unfamiliar and get nothing right. The latter is when it’s most informative. PHP itself falls clearly into the latter. Not only that, the example of a “closure” in the manual demonstrates a “closure” closing over a global variable!

I’d been told for years that PHP has closures, and I took that claim at face value. In fact, PHP has had “closures” since 5.3.0, released in June 2009, so I’m over a decade late in investigating it. However, as far as I can tell, nobody’s ever pointed out that PHP “closures” are, in fact, not actually closures.

Anonymous functions and closures

Before getting into why they’re not closures, let’s go over how it works, starting with a plain old anonymous function. PHP does have anonymous functions — the easy part.

function foo() {
    return function() {
        return 1;
    };
}

The function foo returns a function that returns 1. In PHP 7 you can call the returned function immediately like so:

$r = foo()();  // $r = 1

In PHP 5 this is a syntax error because, well, it’s PHP and its parser is about as clunky as Matlab’s.

In a well-designed language, you’d expect that this could also be a closure. That is, it closes over local variables, and the function may continue to access those variables later. For example:

function bar($n) {
    return function() {
        return $n;
    };
}

bar(1)();  // error: Undefined variable: n

This fails because you must explicitly tell PHP what variables you intend to access inside the anonymous function with use:

function bar($n) {
    return function() use ($n) {
        return $n;
    };
}

bar(1)();  // 1

If this actually closed over $n, this would be a legitimate closure. Having to tell the language exactly which variables are being closed over would be pretty dumb, but it still meets the definition of a closure.

But here’s the catch: It’s not actually closing over any variables. The names listed in use are actually extra, hidden parameters bound to the current value of those variables. In other words, this is nothing more than partial function evaluation.

function bar($n) {
    $f = function() use ($n) {
        return $n;
    };
    $n++;  // never used!
    return $f;
}

$r = bar(1)();  // $r = 1

Here’s the equivalent in JavaScript using the bind() method:

function bar(n) {
    let f = function(m) {
        return m;
    };
    return f.bind(null, n);
}

This is actually more powerful than PHP’s “closures” since any arbitrary expression can be used for the bound argument. In PHP it’s limited to a couple of specific forms. If JavaScript didn’t have proper closures, and instead we all had to rely on bind(), nobody would claim that JavaScript had closures. It shouldn’t be different for PHP.

References

PHP does have references, and binding a reference to an anonymous function is kinda, sorta like a closure. But that’s still just partial function evaluation, but where that argument is a reference.

Here’s how to tell these reference captures aren’t actually closures: They work equally well for global variables as local variables. So it’s still not closing over a lexical environment, just binding a reference to a parameter.

$counter = 0;

function bar($n) {
    global $counter;
    $f = function() use (&$n, &$counter) {
        $counter++;
        return $n;
    };
    $n++;  // now has an effect
    return $f;
}

$r = bar(1)();  // $r = 2, $counter = 1

In the example above, there’s no difference between $n, a local variable, and $counter, a global variable. It wouldn’t make sense for a closure to close over a global variable.

Emacs Lisp partial function application

Emacs Lisp famously didn’t get lexical scope, and therefore closures, until fairly recently. It was — and still is by default — a dynamic scope oddball. However, it’s long had an apply-partially function for partial function application. It returns a closure-like object, and did so when the language didn’t have proper closures. So it can be used to create a “closure” just like PHP:

(defun bar (n)
  (apply-partially (lambda (m) m) n))

This works regardless of lexical or dynamic scope, which is because this construct isn’t really a closure, just like PHP’s isn’t a closure. In PHP, its partial function evaluation is built directly into the language with special use syntax.

Monkey see, monkey do

Why does the shell command language use sigils? Because it’s built atop interactive command line usage, where bare words are taken literally and variables are the exception. Why does Perl use sigils? Because it was originally designed as an alternative to shell scripts, so it mimicked that syntax. Why does PHP use sigils? Because Perl did.

The situation with closures follows that pattern, and it comes up all over PHP. Its designers see a feature in another language, but don’t really understand its purpose or semantics. So when they attempt to add that feature to PHP, they get it disastrously wrong.

Keyringless GnuPG

This article was discussed on Hacker News.

My favorite music player is Audacious. It follows the Winamp Classic tradition of not trying to manage my music library. Instead it waits patiently for me to throw files and directories at it. These selections will be informally grouped into transient, disposable playlists of whatever I fancy that day.

This matters to me because my music collection is the result of around 25 years of hoarding music files from various sources including CD rips, Napster P2P sharing, and, most recently, YouTube downloads. It’s not well-organized, but it’s organized well enough. Each album has its own directory, and related albums are sometimes grouped together under a directory for a particular artist.

Over the years I’ve tried various music players, and some have either wanted to manage this library or hide the underlying file-organized nature of my collection. Both situations are annoying because I really don’t want or need that abstraction. I’m going just fine thinking of my music library in terms of files, thank you very much. Same goes for ebooks.

GnuPG is like a media player that wants to manage your whole music library. Rather than MP3s, it’s crypto keys on a keyring. Nearly every operation requires keys that have been imported into the keyring. Until GnuPG 2.2.8 (June 2018), which added the --show-keys command, you couldn’t even be sure what you were importing until after it was already imported. Hopefully it wasn’t garbage.

GnuPG does has a pretty good excuse. It’s oriented around the Web of Trust model, and it can’t follow this model effectively without having all the keys at once. However, even if you don’t buy into the Web of Trust, the GnuPG interface still requires you to play by its rules. Sometimes I’ve got a message, a signature, and a public key and I just want to verify that they’re all consistent with each other, damnit.

$ gpg --import foo.asc
gpg: key 1A719EF63AEB2CFE: public key "foo" imported
gpg: Total number processed: 1
gpg:               imported: 1
$ gpg --verify --trust-model always message.txt.sig message.txt
gpg: Signature made Fri 09 Aug 2019 05:44:43 PM EDT
gpg:                using EDDSA key ...1A719EF63AEB2CFE
gpg: Good signature from "foo" [unknown]
gpg: WARNING: Using untrusted key!
$ gpg --batch --yes --delete-key 1A719EF63AEB2CFE

Three commands and seven lines of output when one of each would do. Plus there’s a false warning: Wouldn’t an “always” trust model mean that this key is indeed trusted?

Signify

Compare this to OpenBSD’s signify (also). There’s no keyring, and it’s up to the user — or the program shelling out to signify — to supply the appropriate key. It’s like the music player that just plays whatever I give it. Here’s a simplified usage overview:

signify -G [-c comment] -p pubkey -s seckey
signify -S [-x sigfile] -s seckey -m message
signify -V [-x sigfile] -p pubkey -m message

When generating a new keypair (-G), the user must choose the destination files for the public and secret keys. When signing a message (a file), the user must supply the secret key and the message. When verifying a file, the user must supply the public key and the message. This is a popular enough model that other, compatible implementations with the same interface have been developed.

Signify is deliberately incompatible with OpenPGP and uses its own simpler, and less featureful, format. Wouldn’t it be nice to have a similar interface to verify OpenPGP signatures?

SimpleGPG

Well, I thought so. So I put together a shell script that wraps GnuPG and provides such an interface:

SimpleGPG

The interface is nearly identical to signify, and the GnuPG keyring is hidden away as if it didn’t exist. The main difference is that the keys and signatures produced and consumed by this tool are fully compatible with OpenPGP. You could use this script without requiring anyone else to adopt something new or different.

To avoid touching your real keyring, the script creates a temporary keyring directory each time it’s run. The GnuPG option --homedir instructs it to use this temporary keyring and ignore the usual one. The temporary keyring is destroyed when the script exits. This is kind of clunky, but there’s no way around it.

Verification looks roughly like this in the script:

$ tmp=$(mktemp -d simplegpg-XXXXXX)
$ gpg --homedir $tmp
$ gpg --homedir $tmp --import foo.asc
$ gpg --homedir $tmp --verify message.txt.sig message.txt
$ rm -rf $tmp

Generating a key is trivial, and there’s only a prompt for the protection passphrase. Like signify, it will generate an Ed25519 key and all outputs are ASCII-armored.

$ simplegpg -G -p keyname.asc -s keyname.pgp
passphrase:
passphrase (confirm):

Since signify doesn’t have a concept of a user ID for a key, just an “untrusted comment”, the user ID is not emphasized here. The default user ID will be “simplegpg key”, so, if you plan to share the key with regular GnuPG users who will need to import it into a keyring, you probably want to use -c to give it a more informative name.

Unfortunately due GnuPG’s very limited, keyring-oriented interface, key generation is about three times slower than it should be. That’s because the protection key is run though the String-to-Key (S2K) algorithm three times:

  1. Immediately after the key is generated, the passphrase is converted to a key, the key is encrypted, and it’s put onto the temporary keyring.

  2. When exporting, the key passphrase is again run through the S2K to get the protection key to decrypt it.

  3. The export format uses a slightly different S2K algorithm, so this export S2K is now used to create yet another protection key.

Technically the second could be avoided since gpg-agent, which is always required, could be holding the secret key material. As far as I can tell, gpg-agent simply does not learn freshly-generated keys. I do not know why this is the case.

This is related to another issue. If you’re accustomed to GnuPG, you may notice that the passphrase prompt didn’t come from pinentry, a program specialized for passphrase prompts. GnuPG normally uses it for this. Instead, the script handles the passphrase prompt and passes the passphrase to GnuPG (via a file descriptor). This would not be necessary if gpg-agent did its job. Without this part of the script, users are prompted three times, via pinentry, for their passphrase when generating a key.

When signing messages, the passphrase prompt comes from pinentry since it’s initiated by GnuPG.

$ simplegpg -S -s keyname.pgp -m message.txt
passphrase:

This will produce message.txt.sig with an OpenPGP detached signature.

The passphrase prompt is for --import, not --detach-sign. As with key generation, the S2K is run more than necessary: twice instead of once. First to generate the decryption key, then a second time to generate a different encryption key for the keyring since the export format and keyring use different algorithms. Ugh.

But at least gpg-agent does its job this time, so only one passphrase prompt is necessary. In general, a downside of these temporary keyrings is that gpg-agent treats each as different keys, and you will need to enter your passphrase once for each message signed. Just like signify.

Verification, of course, requires no prompting and no S2K.

$ simplegpg -V -p keyname.asc -m message.txt

That’s all there is to keyringless OpenPGP signatures. Since I’m not interested in the Web of Trust or keyservers, I wish GnuPG was more friendly to this model of operation.

passphrase2pgp

I mentioned that SimpleGPG is fully compatible with other OpenPGP systems. This includes my own passphrase2pgp, where your secret key is stored only in your brain. No need for a secret key file. In the time since I first wrote about it, passphrase2pgp has gained the ability to produce signatures itself!

I’ve got my environment set up — $REALNAME, $EMAIL, and $KEYID per the README — so I don’t need to supply a user ID argument, nor will I be prompted to confirm my passphrase since it’s checked against a known fingerprint. Generating the public key, for sharing, looks like this:

$ passphrase2pgp -K --armor --public >keyname.asc

Or just:

$ passphrase2pgp -ap >keyname.asc

Like with signify and SimplePGP, to sign a message I’m prompted for my passphrase. It takes longer since the “S2K” here is much stronger by necessity. The passphrase is used to generate the secret key, then from that the signature on the message:

$ passphrase2pgp -S message.txt

For the SimpleGPG user on the other side it all looks the same as before:

$ simplegpg -V -p keyname.asc -m message.txt

I’m probably going to start signing my open source software releases, and this is how I intend to do it.

The Long Key ID Collider

Over the last couple weeks I’ve spent a lot more time working with OpenPGP keys. It’s a consequence of polishing my passphrase-derived PGP key generator. I’ve tightened up the internals, and it’s enabled me to explore the corners of the format, try interesting things, and observe how various OpenPGP implementations respond to weird inputs.

For one particularly cool trick, take a look at these two (private) keys I generated yesterday. Here’s the first:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAjJgvdh3N2pegXPEuMe25nJ3gI7k8gEgQvCor
AExppm4AAQC0TNsuIRHkxaGjLNN6hQowRMxLXAMrkZfMcp1DTG8GBg1TzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAGq0APwOtCFVCxpv
d/gzKUg0SkdygmriV1UmrQ+KYx9dhzC6xwEAqwDGsSgSbCqPdkwqi/tOn+MwZ5N9
jYxy48PZGZ2V3ws=
=bBGR
-----END PGP PRIVATE KEY BLOCK-----

And the second:

-----BEGIN PGP PRIVATE KEY BLOCK-----

xVgEXTU3gxYJKwYBBAHaRw8BAQdAzjSPKjpOuJoLP6G0z7pptx4sBNiqmgEI0xiH
Z4Xb16kAAP0Qyon06UB2/gOeV/KjAjCi91MeoUd7lsA5yn82RR5bOxAkzQ9udWxs
cHJvZ3JhbS5jb23CXgQTFggAEAUCXTU3gwkQmSpe7h0QSfoAAEv4AQDLRqx10v3M
bwVnJ8BDASAOzrPw+Rz1tKbjG9r45iE7NQEAhm9QVtFd8SN337kIWcq8wXA6j1tY
+UeEsjg+SHzkqA4=
=QnLn
-----END PGP PRIVATE KEY BLOCK-----

Concatenate these and then import them into GnuPG to have a look at them. To avoid littering in your actual keyring, especially with private keys, use the --homedir option to set up a temporary keyring. I’m going to omit that option in the examples.

$ gpg --import < keys.asc
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: key 992A5EEE1D1049FA: public key "nullprogram.com" imported
gpg: key 992A5EEE1D1049FA: secret key imported
gpg: Total number processed: 2
gpg:               imported: 2
gpg:       secret keys read: 2
gpg:   secret keys imported: 2

The user ID is “nullprogram.com” since I made these and that’s me taking credit. “992A5EEE1D1049FA” is called the long key ID: a 64-bit value that identifies the key. It’s the lowest 64 bits of the full key ID, a 160-bit SHA-1 hash. In the old days everyone used a short key ID to identify keys, which was the lowest 32 bits of the key. For these keys, that would be “1D1049FA”. However, this was deemed way too short, and everyone has since switched to long key IDs, or even the full 160-bit key ID.

The key ID is nothing more than a SHA-1 hash of the key creation date — unsigned 32-bit unix epoch seconds — and the public key material. So secret keys have the same key ID as their associated public key. This makes sense since they’re a key pair and they go together.

Look closely and you’ll notice that both keypairs have the same long key ID. If you hadn’t already guessed from the title of this article, these are two different keys with the same long key ID. In other words, I’ve created a long key ID collision. The GnuPG --list-keys command prints the full key ID since it’s so important:

$ gpg --list-keys
---------------------
pub   ed25519 2019-07-22 [SCA]
      A422F8B0E1BF89802521ECB2992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

pub   ed25519 2019-07-22 [SCA]
      F43BC80C4FC2603904E7BE02992A5EEE1D1049FA
uid           [ unknown] nullprogram.com

I was only targeting the lower 64 bits, but I actually managed to collide the lowest 68 bits by chance. So a long key ID still isn’t enough to truly identify any particular key.

This isn’t news, of course. Nor am I the first person to create a long key ID collision. In 2013, David Leon Gil published a long key ID collision for two 4096-bit RSA public keys. However, that is the only other example I was able to find. He did not include the private keys and did not elaborate on how he did it. I know he did generate viable keys, not just garbage for the public key portions, since they’re both self-signed.

Creating these keys was trickier than I had anticipated, and there’s an old, clever trick that makes it work. Building atop the work I did for passphrase2pgp, I created a standalone tool that will create a long key ID collision and print the two keypairs to standard output:

Example usage:

$ go get -u github.com/skeeto/pgpcollider
$ pgpcollider --verbose > keys.asc

This can take up to a day to complete when run like this. The tool can optionally coordinate many machines — see the --server / -S and --client / -C options — to work together, greatly reducing the total time. It took around 4 hours to create the keys above on a single machine, generating a around 1 billion extra keys in the process. As discussed below, I actually got lucky that it only took 1 billion. If you modify the program to do short key ID collisions, it only takes a few seconds.

The rest of this article is about how it works.

Birthday Attacks

An important detail is that this technique doesn’t target any specific key ID. Cloning someone’s long key ID is still very expensive. No, this is a birthday attack. To find a collision in a space of 2^64, on average I only need to generate 2^32 samples — the square root of that space. That’s perfectly feasible on a regular desktop computer. To collide long key IDs, I need only generate about 4 billion IDs and efficiently do membership tests on that set as I go.

That last step is easier said than done. Naively, that might look like this (pseudo-code):

seen := map of long key IDs to keys
loop forever {
    key := generateKey()
    longID := key.ID[12:20]
    if longID in seen {
        output seen[longID]
        output key
        break
    } else {
        seen[longID] = key
    }
}

Consider the size of that map. Each long ID is 8 bytes, and we expect to store around 2^32 of them. That’s at minimum 32 GB of storage just to track all the long IDs. The map itself is going to have some overhead, too. Since these are literally random lookups, this all mostly needs to be in RAM or else lookups are going to be very slow and impractical.

And I haven’t even counted the keys yet. As a saving grace, these are Ed25519 keys, so that’s 32 bytes for the public key and 32 bytes for the private key, which I’ll need if I want to make a self-signature. (The signature itself will be larger than the secret key.) That’s around 256GB more storage, though at least this can be stored on the hard drive. However, to address these from the map I’d need at least 38 bits, plus some more in case it goes over. Just call it another 8 bytes.

So that’s, at a bare minimum, 64GB of RAM plus 256GB of other storage. Since nothing is ideal, we’ll need more than this. This is all still feasible, but will require expensive hardware. We can do a lot better.

Keys from seeds

The first thing you might notice is that we can jettison that 256GB of storage by being a little more clever about how we generate keys. Since we don’t actually care about the security of these keys, we can generate each key from a seed much smaller than the key itself. Instead of using 8 bytes to reference a key in storage, just use those 8 bytes to store the seed used to make the key.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    key := generateKey(seed)
    longID := key.ID[12:20]
    if longID in seen {
        output generateKey(seen[longID])
        output key
        break
    } else {
        seen[longID] = seed
    }
}

I’m incrementing a counter to generate the seeds because I don’t want to experience the birthday paradox to apply to my seeds. Each really must be unique. I’m using SplitMix64 for the PRNG since I learned it’s the fastest for Go, so a simple increment to generate seeds is perfectly fine.

Ultimately, this still uses utterly excessive amounts of memory. Wouldn’t it be crazy if we could somehow get this 64GB map down to just a few MBs of RAM? Well, we can!

Rainbow tables

For decades, password crackers have faced a similar problem. They want to precompute the hashes for billions of popular passwords so that they can efficiently reverse those password hashes later. However, storing all those hashes would be unnecessarily expensive, or even infeasible.

So they don’t. Instead they use rainbow tables. Password hashes are chained together into a hash chain, where a password hash leads to a new password, then to a hash, and so on. Then only store the beginning and the end of each chain.

To lookup a hash in the rainbow table, run the hash chain algorithm starting from the target hash and, for each hash, check if it matches the end of one of the chains. If so, recompute that chain and note the step just before the target hash value. That’s the corresponding password.

For example, suppose the password “foo” hashes to 9bfe98eb, and we have a reduction function that maps a hash to some password. In this case, it maps 9bfe98eb to “bar”. A trivial reduction function could just be an index into a list of passwords. A hash chain starting from “foo” might look like this:

foo -> 9bfe98eb -> bar -> 27af0841 -> baz -> d9d4bbcb

In reality a chain would be a lot longer. Another chain starting from “apple” might look like this:

apple -> 7bbc06bc -> candle -> 82a46a63 -> dog -> 98c85d0a

We only store the tuples (foo, d9d4bbcb) and (apple, 98c85d0a) in our database. If the chains had been one million hashes long, we’d still only store those two tuples. That’s literally a 1:1000000 compression ratio!

Later on we’re faced with reversing the hash 27af0841, which isn’t listed directly in the database. So we run the chain forward from that hash until either I hit the maximum chain length (i.e. password not in the table), or we recognize a hash:

27af0841 -> baz -> d9d4bbcb

That d9d4bbcb hash is listed as being in the “foo” hash chain. So I regenerate that hash chain to discover that “bar” leads to 27af0841. Password cracked!

Collider rainbow table

My collider works very similarly. A hash chain works like this: Start with a 64-bit seed as before, generate a key, get the long key ID, then use the long key ID as the seed for the next key.

There’s one big difference. In the rainbow table the purpose is to run the hash function backwards by looking at the previous step in the chain. For the collider, I want to know if any of the hash chains collide. So long as each chain starts from a unique seed, it would mean we’ve found two different seeds that lead to the same long key ID.

Alternatively, it could be two different seeds that lead to the same key, which wouldn’t be useful, but that’s trivial to avoid.

A simple and efficient way to check if two chains contain the same sequence is to stop them at the same place in that sequence. Rather than run the hash chains for some fixed number of steps, they stop when they reach a distinguishing point. In my collider a distinguishing point is where the long key ID ends with at least N 0 bits, where N determines the average chain length. I chose 17 bits.

func computeChain(seed) {
    loop forever {
        key := generateKey(seed)
        longID := key.ID[12:20]
        if distinguished(longID) {
            return longID
        }
        seed = longID
    }
}

If two different hash chains end on the same distinguishing point, they’re guaranteed to have collided somewhere in the middle.

To determine where two chains collided, regenerate each chain and find the first long key ID that they have in common. The step just before are the colliding keys.

counter := rand64()
seen := map of long key IDs to 64-bit seeds
loop forever {
    seed := counter
    counter++
    longID := computeChain(seed)
    if longID in seen {
        output findCollision(seed, seen[longID])
        break
    } else {
        seen[longID] = seed
    }
}

Hash chains computation is embarrassingly parallel, so the load can be spread efficiently across CPU cores. With these rainbow(-like) tables, my tool can generate and track billions of keys in mere megabytes of memory. The additional computational cost is the time it takes to generate a couple more chains than otherwise necessary.

Predictable, Passphrase-Derived PGP Keys

tl;dr: passphrase2pgp.

One of my long-term concerns has been losing my core cryptographic keys, or just not having access to them when I need them. I keep my important data backed up, and if that data is private then I store it encrypted. My keys are private, but how am I supposed to encrypt them? The chicken or the egg?

The OpenPGP solution is to (optionally) encrypt secret keys using a key derived from a passphrase. GnuPG prompts the user for this passphrase when generating keys and when using secret keys. This protects the keys at rest, and, with some caution, they can be included as part of regular backups. The OpenPGP specification, RFC 4880 has many options for deriving a key from this passphrase, called String-to-Key, or S2K, algorithms. None of the options are great.

In 2012, I selected the strongest S2K configuration and, along with a very strong passphrase, put my GnuPG keyring on the internet as part of my public dotfiles repository. It was a kind of super-backup that would guarantee their availability anywhere I’d need them.

My timing was bad because, with the release of GnuPG 2.1 in 2014, GnuPG fundamentally changed its secret keyring format. S2K options are now (quietly!) ignored when deriving the protection keys. Instead it auto-calibrates to much weaker settings. With this new version of GnuPG, I could no longer update the keyring in my dotfiles repository without significantly downgrading its protection.

By 2017 I was pretty irritated with the whole situation. I let my OpenPGP keys expire, and then I wrote my own tool to replace the only feature of GnuPG I was actively using: encrypting my backups with asymmetric encryption. One of its core features is that the asymmetric keypair can be derived from a passphrase using a memory-hard key derivation function (KDF). Attackers must commit a significant quantity of memory (expensive) when attempting to crack the passphrase, making the passphrase that much more effective.

Since the asymmetric keys themselves, not just the keys protecting them, are derived from a passphrase, I never need to back them up! They’re also always available whenever I need them. My keys are essentially stored entirely in my brain as if I was a character in a William Gibson story.

Tackling OpenPGP key generation

At the time I had expressed my interest in having this feature for OpenPGP keys. It’s something I’ve wanted for a long time. I first took a crack at it in 2013 (now the the old-version branch) for generating RSA keys. RSA isn’t that complicated but it’s very easy to screw up. Since I was rolling it from scratch, I didn’t really trust myself not to subtly get it wrong. Plus I never figured out how to self-sign the key. GnuPG doesn’t accept secret keys that aren’t self-signed, so it was never useful.

I took another crack at it in 2018 with a much more brute force approach. When a program needs to generate keys, it will either read from /dev/u?random or, on more modern systems, call getentropy(3). These are all ultimately system calls, and I know how to intercept those with Ptrace. If I want to control key generation for any program, not just GnuPG, I could intercept these inputs and replace them with the output of a CSPRNG keyed by a passphrase.

Keyed: Linux Entropy Interception

In practice this doesn’t work at all. Real programs like GnuPG and OpenSSH’s ssh-keygen don’t rely solely on these entropy inputs. They also grab entropy from other places, like getpid(2), gettimeofday(2), and even extract their own scheduler and execution time noise. Without modifying these programs I couldn’t realistically control their key generation.

Besides, even if it did work, it would still be fragile and unreliable since these programs could always change how they use the inputs. So, ultimately, it was more of an experiment than something practical.

passphrase2pgp

For regular readers, it’s probably obvious that I recently learned Go. While searching for good projects idea for cutting my teeth, I noticed that Go’s “extended” standard library has a lot of useful cryptographic support, so the idea of generating the keys myself may be worth revisiting.

Something else also happened since my previous attempt: The OpenPGP ecosystem now has widespread support for elliptic curve cryptography. So instead of RSA, I could generate a Curve25519 keypair, which, by design, is basically impossible to screw up. Not only would I be generating keys on my own terms, I’d being doing it in style, baby.

There are two different ways to use Curve25519:

  1. Digital signatures: Ed25519 (EdDSA)
  2. Diffie–Hellman (encryption): X25519 (ECDH)

In GnuPG terms, the first would be a “sign only” key and the second is an “encrypt only” key. But can’t you usually do both after you generate a new OpenPGP key? If you’ve used GnuPG, you’ve probably seen the terms “primary key” and “subkey”, but you probably haven’t had think about them since it’s all usually automated.

The primary key is the one associated directly with your identity. It’s always a signature key. The OpenPGP specification says this is a signature key only by convention, but, practically speaking, it really must be since signatures is what holds everything together. Like packaging tape.

If you want to use encryption, independently generate an encryption key, then sign that key with the primary key, binding that key as a subkey to the primary key. This all happens automatically with GnuPG.

Fun fact: Two different primary keys can have the same subkey. Anyone could even bind any of your subkeys to their primary key! They only need to sign the public key! Though, of course, they couldn’t actually use your key since they’d lack the secret key. It would just be really confusing, and could, perhaps in certain situations, even cause some OpenPGP clients to malfunction. (Note to self: This demands investigation!)

It’s also possible to have signature subkeys. What good is that? Paranoid folks will keep their primary key only on a secure, air-gapped, then use only subkeys on regular systems. The subkeys can be revoked and replaced independently of the primary key if something were to go wrong.

In Go, generating an X25519 key pair is this simple (yes, it actually takes array pointers, which is rather weird):

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/curve25519"
)

func main() {
	var seckey, pubkey [32]byte
	rand.Read(seckey[:]) // FIXME: check for error
	seckey[0] &= 248
	seckey[31] &= 127
	seckey[31] |= 64
	curve25519.ScalarBaseMult(&pubkey, &seckey)
	fmt.Printf("pub %x\n", pubkey[:])
	fmt.Printf("sec %x\n", seckey[:])
}

The three bitwise operations are optional since it will do these internally, but it ensures that the secret key is in its canonical form. The actual Diffie–Hellman exchange requires just one more function call: curve25519.ScalarMult().

For Ed25519, the API is higher-level:

package main

import (
	"crypto/rand"
	"fmt"

	"golang.org/x/crypto/ed25519"
)

func main() {
	seed := make([]byte, ed25519.SeedSize)
	rand.Read(seed) // FIXME: check for error
	key := ed25519.NewKeyFromSeed(seed)
	fmt.Printf("pub %x\n", key[32:])
	fmt.Printf("sec %x\n", key[:32])
}

Signing a message with this key is just one function call: ed25519.Sign().

Unfortunately that’s the easy part. The other 400 lines of the real program are concerned only with encoding these values in the complex OpenPGP format. That’s the hard part. GnuPG’s --list-packets option was really useful for debugging this part.

OpenPGP specification

(Feel free to skip this section if the OpenPGP wire format isn’t interesting to you.)

Following the specification was a real challenge, especially since many of the details for Curve25519 only appear in still incomplete (and still erroneous) updates to the specification. I certainly don’t envy the people who have to parse arbitrary OpenPGP packets. It’s finicky and has arbitrary parts that don’t seem to serve any purpose, such as redundant prefix and suffix bytes on signature inputs. Fortunately I only had to worry about the subset that represents an unencrypted secret key export.

OpenPGP data is broken up into packets. Each packet begins with a tag identifying its type, followed by a length, which itself is a variable length. All the packets produced by passphrase2pgp are short, so I could pretend lengths were all a single byte long.

For a secret key export with one subkey, we need the following packets in this order:

  1. Secret-Key: Public-Key packet with secret key appended
  2. User ID: just a length-prefixed, UTF-8 string
  3. Signature: binds Public-Key packet (1) and User ID packet (2)
  4. Secret-Subkey: Public-Subkey packet with secret subkey appended
  5. Signature: binds Public-Key packet (1) and Public-Subkey packet (4)

A Public-Key packet contains the creation date, key type, and public key data. A Secret-Key packet is the same, but with the secret key literally appended on the end and a different tag. The Key ID is (essentially) a SHA-1 hash of the Public-Key packet, meaning the creation date is part of the Key ID. That’s important for later.

I had wondered if the SHAttered attack could be used to create two different keys with the same full Key ID. However, there’s no slack space anywhere in the input, so I doubt it.

User IDs are usually a RFC 2822 name and email address, but that’s only convention. It can literally be an empty string, though that wouldn’t be useful. OpenPGP clients that require anything more than an empty string, such as GnuPG during key generation, are adding artificial restrictions.

The first Signature packet indicates the signature date, the signature issuer’s Key ID, and then optional metadata about how the primary key is to be used and the capabilities the key owner’s client. The signature itself is formed by appending the Public-Key packet portion of the Secret-Key packet, the User ID packet, and the previously described contents of the signature packet. The concatenation is hashed, the hash is signed, and the signature is appended to the packet. Since the options are included in the signature, they can’t be changed by another person.

In theory the signature is redundant. A client could accept the Secret-Key packet and User ID packet and consider the key imported. It would then create its own self-signature since it has everything it needs. However, my primary target for passphrase2pgp is GnuPG, and it will not accept secret keys that are not self-signed.

The Secret-Subkey packet is exactly the same as the Secret-Key packet except that it uses a different tag to indicate it’s a subkey.

The second Signature packet is constructed the same as the previous signature packet. However, it signs the concatenation of the Public-Key and Public-Subkey packets, binding the subkey to that primary key. This key may similarly have its own options.

To create a public key export from this input, a client need only chop off the secret keys and fix up the packet tags and lengths. The signatures remain untouched since they didn’t include the secret keys. That’s essentially what other people will receive about your key.

If someone else were to create a Signature packet binding your Public-Subkey packet with their Public-Key packet, they could set their own options on their version of the key. So my question is: Do clients properly track this separate set of options and separate owner for the same key? If not, they have a problem!

The format may not sound so complex from this description, but there are a ton of little details that all need to be correct. To make matters worse, the feedback is usually just a binary “valid” or “invalid”. The world could use an OpenPGP format debugger.

Usage

There is one required argument: either --uid (-u) or --load (-l). The former specifies a User ID since a key with an empty User ID is pretty useless. It’s my own artificial restriction on the User ID. The latter loads a previously-generated key which will come with a User ID.

To generate a key for use in GnuPG, just pipe the output straight into GnuPG:

$ passphrase2pgp --uid "Foo <foo@example.com>" | gpg --import

You will be prompted for a passphrase. That passphrase is run through Argon2id, a memory-hard KDF, with the User ID as the salt. Deriving the key requires 8 passes over 1GB of state, which takes my current computers around 8 seconds. With the --paranoid (-x) option enabled, that becomes 16 passes over 2GB (perhaps not paranoid enough?). The output is 64 bytes: 32 bytes to seed the primary key and 32 bytes to seed the subkey.

Despite the aggressive KDF settings, you will still need to choose a strong passphrase. Anyone who has your public key can mount an offline attack. A 10-word Diceware or Pokerware passphrase is more than sufficient (~128 bits) while also being quite reasonable to memorize.

Since the User ID is the salt, an attacker couldn’t build a single rainbow table to attack passphrases for different people. (Though your passphrase really should be strong enough that this won’t matter!) The cost is that you’ll need to use exactly the same User ID again to reproduce the key. In theory you could change the User ID afterward to whatever you like without affecting the Key ID, though it will require a new self-signature.

The keys are not encrypted (no S2K), and there are few options you can choose when generating the keys. If you want to change any of this, use GnuPG’s --edit-key tool after importing. For example, to set a protection passphrase:

$ gpg --edit-key Foo
gpg> passwd

There’s a lot that can be configured from this interface.

If you just need the public key to publish or share, the --public (-p) option will suppress the private parts and output only a public key. It works well in combination with ASCII armor, --armor (-a). For example, to put your public key on the clipboard:

$ passphrase2pgp -u '...' -ap | xclip

The tool can create detached signatures (--sign, -S) entirely on its own, too, so you don’t need to import the keys into GnuPG just to make signatures:

$ passphrase2pgp --sign --uid '...' program.exe

This would create a file named program.exe.sig with the detached signature, ready to be verified by another OpenPGP implementation. In fact, you can hook it directly up to Git for signing your tags and commits without GnuPG:

$ git config --global gpg.program passphrase2pgp

This only works for signing, and it cannot verify (verify-tag or verify-commit).

It’s pretty tedious to enter the --uid option all the time, so, if omitted, passphrase2pgp will infer the User ID from the environment variables REALNAME and EMAIL. Combined with the KEYID environment variable (see the README for details), you can easily get away with never storing your keys: only generate them on demand when needed.

That’s how I intend to use passphrase2pgp. When I want to sign a file, I’ll only need one option, one passphrase prompt, and a few seconds of patience:

$ passphrase2pgp -S path/to/file

January 1, 1970

The first time you run the tool you might notice one offensive aspect of its output: Your key will be dated January 1, 1970 — i.e. unix epoch zero. This predates PGP itself by more than two decades, so it might alarm people who receive your key.

Why do this? As I noted before, the creation date is part of the Key ID. Use a different date, and, as far as OpenPGP is concerned, you have a different key. Since users probably don’t want to remember a specific datetime, at seconds resolution, in addition to their passphrase, passphrase2pgp uses the same hard-coded date by default. A date of January 1, 1970 is like NULL in a database: no data.

If you don’t like this, you can override it with the --time (-t) or --now (-n) options, but it’s up to you to remain consistent.

Vanity Keys

If you’re interested in vanity keys — e.g. where the Key ID spells out words or looks unusual — it wouldn’t take much work to hack up the passphrase2pgp source into generating your preferred vanity keys. It would easily beat anything else I could find online.

Reconsidering limited OpenPGP

Initially my intention was never to output an encryption subkey, and passphrase2pgp would only be useful for signatures. By default it still only produces a sign key, but you can still get an encryption subkey with the --subkey (-s) option. I figured it might be useful to generate an encryption key, even if it’s not output by default. Users can always ask for it later if they have a need for it.

Why only a signing key? Nobody should be using OpenPGP for encryption anymore. Use better tools instead and retire the 20th century cryptography. If you don’t have an encryption subkey, nobody can send you OpenPGP-encrypted messages.

In contrast, OpenPGP signatures are still kind of useful and lack a practical alternative. The Web of Trust failed to reach critical mass, but that doesn’t seem to matter much in practice. Important OpenPGP keys can be bootstrapped off TLS by strategically publishing them on HTTPS servers. Keybase.io has done interesting things in this area.

Further, GitHub officially supports OpenPGP signatures, and I believe GitLab does too. This is another way to establish trust for a key. IMHO, there’s generally too much emphasis on binding a person’s legal identity to their OpenPGP key (e.g. the idea behind key-signing parties). I suppose that’s useful for holding a person legally accountable if they do something wrong. I’d prefer trust a key with has an established history of valuable community contributions, even if done so only under a pseudonym.

So sometime in the future I may again advertise an OpenPGP public key. If I do, those keys would certainly be generated with passphrase2pgp. I may not even store the secret keys on a keyring, and instead generate them on the fly only when I occasionally need them.

Go Slices are Fat Pointers

This article was discussed on Hacker News.

One of the frequent challenges in C is that pointers are nothing but a memory address. A callee who is passed a pointer doesn’t truly know anything other than the type of object being pointed at, which says some things about alignment and how that pointer can be used… maybe. If it’s a pointer to void (void *) then not even that much is known.

The number of consecutive elements being pointed at is also not known. It could be as few as zero, so dereferencing would be illegal. This can be true even when the pointer is not null. Pointers can go one past the end of an array, at which point it points to zero elements. For example:

void foo(int *);

void bar(void)
{
    int array[4];
    foo(array + 4);  // pointer one past the end
}

In some situations, the number of elements is known, at least to the programmer. For example, the function might have a contract that says it must be passed at least N elements, or exactly N elements. This could be communicated through documentation.

/** Foo accepts 4 int values. */
void foo(int *);

Or it could be implied by the function’s prototype. Despite the following function appearing to accept an array, that’s actually a pointer, and the “4” isn’t relevant to the prototype.

void foo(int[4]);

C99 introduced a feature to make this a formal part of the prototype, though, unfortunately, I’ve never seen a compiler actually use this information.

void foo(int[static 4]);  // >= 4 elements, cannot be null

Another common pattern is for the callee to accept a count parameter. For example, the POSIX write() function:

ssize_t write(int fd, const void *buf, size_t count);

The necessary information describing the buffer is split across two arguments. That can become tedious, and it’s also a source of serious bugs if the two parameters aren’t in agreement (buffer overflow, information disclosure, etc.). Wouldn’t it be nice if this information was packed into the pointer itself? That’s essentially the definition of a fat pointer.

Fat pointers via bit hacks

If we assume some things about the target platform, we can encode fat pointers inside a plain pointer with some dirty pointer tricks, exploiting unused bits in the pointer value. For example, currently on x86-64, only the lower 48 bits of a pointer are actually used. The other 16 bits could carefully be used for other information, like communicating the number of elements or bytes:

// NOTE: x86-64 only!
unsigned char buf[1000];
uintptr addr = (uintptr_t)buf & 0xffffffffffff;
uintptr pack = (sizeof(buf) << 48) | addr;
void *fatptr = (void *)pack;

The other side can unpack this to get the components back out. Obviously 16 bits for the count will often be insufficient, so this would more likely be used for baggy bounds checks.

Further, if we know something about the alignment — say, that it’s 16-byte aligned — then we can also encode information in the least significant bits, such as a type tag.

Fat pointers via a struct

That’s all fragile, non-portable, and rather limited. A more robust approach is to lift pointers up into a richer, heavier type, like a structure.

struct fatptr {
    void *ptr;
    size_t len;
};

Functions accepting these fat pointers no longer need to accept a count parameter, and they’d generally accept the fat pointer by value.

fatptr_write(int fd, struct fatptr);

In typical C implementations, the structure fields would be passed practically, if not exactly, same way as the individual parameters would have been passed, so it’s really no less efficient.

To help keep this straight, we might employ some macros:

#define COUNTOF(array) \
    (sizeof(array) / sizeof(array[0]))

#define FATPTR(ptr, count) \
    (struct fatptr){ptr, count}

#define ARRAYPTR(array) \
    FATPTR(array, COUNTOF(array))

/* ... */

unsigned char buf[40];
fatptr_write(fd, ARRAYPTR(buf));

There are obvious disadvantages of this approach, like type confusion due to that void pointer, the inability to use const, and just being weird for C. I wouldn’t use it in a real program, but bear with me.

Before I move on, I want to add one more field to that fat pointer struct: capacity.

struct fatptr {
    void *ptr;
    size_t len;
    size_t cap;
};

This communicates not how many elements are present (len), but how much additional space is left in the buffer. This allows callees know how much room is left for, say, appending new elements.

// Fix the remainder of an int buffer with a value.
void
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
}

Since the callee modifies the fat pointer, it should be returned:

struct fatptr
fill(struct fatptr ptr, int value)
{
    int *buf = ptr.ptr;
    for (size_t i = ptr.len; i < ptr.cap; i++) {
        buf[i] = value;
    }
    ptr.len = ptr.cap;
    return ptr;
}

Congratulations, you’ve got slices! Except that in Go they’re a proper part of the language and so doesn’t rely on hazardous hacks or tedious bookkeeping. The fatptr_write() function above is nearly functionally equivalent to the Writer.Write() method in Go, which accepts a slice:

type Writer interface {
	Write(p []byte) (n int, err error)
}

The buf and count parameters are packed together as a slice, and fd parameter is instead the receiver (the object being acted upon by the method).

Go slices

Go famously has pointers, including internal pointers, but not pointer arithmetic. You can take the address of (nearly) anything, but you can’t make that pointer point at anything else, even if you took the address of an array element. Pointer arithmetic would undermine Go’s type safety, so it can only be done through special mechanisms in the unsafe package.

But pointer arithmetic is really useful! It’s handy to take an address of an array element, pass it to a function, and allow that function to modify a slice (wink, wink) of the array. Slices are pointers that support exactly this sort of pointer arithmetic, but safely. Unlike the & operator which creates a simple pointer, the slice operator derives a fat pointer.

func fill([]int, int) []int

var array [8]int

// len == 0, cap == 8, like &array[0]
fill(array[:0], 1)
// array is [1, 1, 1, 1, 1, 1, 1, 1]

// len == 0, cap == 4, like &array[4]
fill(array[4:4], 2)
// array is [1, 1, 1, 1, 2, 2, 2, 2]

The fill function could take a slice of the slice, effectively moving the pointer around with pointer arithmetic, but without violating memory safety due to the additional “fat pointer” information. In other words, fat pointers can be derived from other fat pointers.

Slices aren’t as universal as pointers, at least at the moment. You can take the address of any variable using &, but you can’t take a slice of any variable, even if it would be logically sound.

var foo int

// attempt to make len = 1, cap = 1 slice backed by foo
var fooslice []int = foo[:]   // compile-time error!

That wouldn’t be very useful anyway. However, if you really wanted to do this, the unsafe package can accomplish it. I believe the resulting slice would be perfectly safe to use:

// Convert to one-element array, then slice
fooslice = (*[1]int)(unsafe.Pointer(&foo))[:]

Update: Chris Siebenmann speculated about why this requires unsafe.

Of course, slices are super flexible and have many more uses that look less like fat pointers, but this is still how I tend to reason about slices when I write Go.

null program

Chris Wellons

(PGP)