nullprogram.com/blog/2017/06/21/
Stack clashing has been in the news lately due to some recently
discovered vulnerablities along with proof-of-concept
exploits. As the announcement itself notes, this is not a new issue,
though this appears to be the first time it’s been given this
particular name. I do know of one “good” use of stack clashing, where
it’s used for something productive than as part of an attack. In this
article I’ll explain how it works.
You can find the complete code for this article here, ready to run:
But first, what is a stack clash? Here’s a rough picture of the
typical way process memory is laid out. The stack starts at a high
memory address and grows downwards. Code and static data sit at low
memory, with a brk
pointer growing upward to make small allocations.
In the middle is the heap, where large allocations and memory mappings
take place.
Below the stack is a slim guard page that divides the stack and the
region of memory reserved for the heap. Reading or writing to that
memory will trap, causing the program to crash or some special action
to be taken. The goal is to prevent the stack from growing into the
heap, which could cause all sorts of trouble, like security issues.
The problem is that this thin guard page isn’t enough. It’s possible to
put a large allocation on the stack, never read or write to it, and
completely skip over the guard page, such that the heap and stack
overlap without detection.
Once this happens, writes into the heap will change memory on the
stack and vice versa. If an attacker can cause the program to make
such a large allocation on the stack, then legitimate writes into
memory on the heap can manipulate local variables or return pointers,
changing the program’s control flow. This can bypass buffer
overflow protections, such as stack canaries.
Binary trees and coroutines
Now, I’m going to abruptly change topics to discuss binary search
trees. We’ll get back to stack clash in a bit. Suppose we have a
binary tree which we would like to iterate depth-first. For this
demonstration, here’s the C interface to the binary tree.
struct tree {
struct tree *left;
struct tree *right;
char *key;
char *value;
};
void tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void tree_visit(struct tree *, void (*f)(char *, char *));
void tree_destroy(struct tree *);
An empty tree is the NULL pointer, hence the double-pointer for
insert. In the demonstration it’s an unbalanced search tree, but this
could very well be a balanced search tree with the addition of another
field on the structure.
For the traversal, first visit the root node, then traverse its left
tree, and finally traverse its right tree. It makes for a simple,
recursive definition — the sort of thing you’d teach a beginner.
Here’s a definition that accepts a callback, which the caller will use
to visit each key/value in the tree. This really is as simple as it
gets.
void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
if (t) {
f(t->key, t->value);
tree_visit(t->left, f);
tree_visit(t->right, f);
}
}
Unfortunately this isn’t so convenient for the caller, who has to
split off a callback function that lacks context, then hand
over control to the traversal function.
void
printer(char *k, char *v)
{
printf("%s = %s\n", k, v);
}
void
print_tree(struct tree *tree)
{
tree_visit(tree, printer);
}
Usually it’s much nicer for the caller if instead it’s provided an
iterator, which the caller can invoke at will. Here’s an interface
for it, just two functions.
struct tree_it *tree_iterator(struct tree *);
int tree_next(struct tree_it *, char **k, char **v);
The first constructs an iterator object, and the second one visits a
key/value pair each time it’s called. It returns 0 when traversal is
complete, automatically freeing any resources associated with the
iterator.
The caller now looks like this:
char *k, *v;
struct tree_it *it = tree_iterator(tree);
while (tree_next(it, &k, &v))
printf("%s = %s\n", k, v);
Notice I haven’t defined struct tree_it
. That’s because I’ve got
four different implementations, each taking a different approach. The
last one will use stack clashing.
Manual State Tracking
With just the standard facilities provided by C, there’s a some manual
bookkeeping that has to take place in order to convert the recursive
definition into an iterator. Depth-first traversal is a stack-oriented
process, and with recursion the stack is implicit in the call stack.
As an iterator, the traversal stack needs to be managed
explicitly. The iterator needs to keep track of the path it
took so that it can backtrack, which means keeping track of parent
nodes as well as which branch was taken.
Here’s my little implementation, which, to keep things simple, has a
hard depth limit of 32. It’s structure definition includes a stack of
node pointers, and 2 bits of information per visited node, stored
across a 64-bit integer.
struct tree_it {
struct tree *stack[32];
unsigned long long state;
int nstack;
};
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack[0] = t;
it->state = 0;
it->nstack = 1;
return it;
}
The 2 bits track three different states for each visited node:
- Visit the current node
- Traverse the left tree
- Traverse the right tree
It works out to the following. Don’t worry too much about trying to
understand how this works. My point is to demonstrate that converting
the recursive definition into an iterator complicates the
implementation.
int
tree_next(struct tree_it *it, char **k, char **v)
{
while (it->nstack) {
int shift = (it->nstack - 1) * 2;
int state = 3u & (it->state >> shift);
struct tree *t = it->stack[it->nstack - 1];
it->state += 1ull << shift;
switch (state) {
case 0:
*k = t->key;
*v = t->value;
if (t->left) {
it->stack[it->nstack++] = t->left;
it->state &= ~(3ull << (shift + 2));
}
return 1;
case 1:
if (t->right) {
it->stack[it->nstack++] = t->right;
it->state &= ~(3ull << (shift + 2));
}
break;
case 2:
it->nstack--;
break;
}
}
free(it);
return 0;
}
Wouldn’t it be nice to keep both the recursive definition while also
getting an iterator? There’s an exact solution to that: coroutines.
Coroutines
C doesn’t come with coroutines, but there are a number of libraries
available. We can also build our own coroutines. One way to do that is
with user contexts (<ucontext.h>
) provided by the X/Open System
Interfaces Extension (XSI), an extension to POSIX. This set of
functions allow programs to create their own call stacks and switch
between them. That’s the key ingredient for coroutines. Caveat: These
functions aren’t widely available, and probably shouldn’t be used in
new code.
Here’s my iterator structure definition.
#define _XOPEN_SOURCE 600
#include <ucontext.h>
struct tree_it {
char *k;
char *v;
ucontext_t coroutine;
ucontext_t yield;
};
It needs one context for the original stack and one context for the
iterator’s stack. Each time the iterator is invoked, it the program
will switch to the other stack, find the next value, then switch back.
This process is called yielding. Values are passed between context
using the k
(key) and v
(value) fields on the iterator.
Before I get into initialization, here’s the actual traversal
coroutine. It’s nearly the same as the original recursive definition
except for the swapcontext()
. This is the yield, pausing execution
and sending control back to the caller. The current context is saved
in the first argument, and the second argument becomes the current
context.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
swapcontext(&it->coroutine, &it->yield);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
While the actual traversal is simple again, initialization is more
complicated. The first problem is that there’s no way to pass pointer
arguments to the coroutine. Technically only int
arguments are
permitted. (All the online tutorials get this wrong.) To work around
this problem, I smuggle the arguments in as global variables. This
would cause problems should two different threads try to create
iterators at the same time, even on different trees.
static struct tree *tree_arg;
static struct tree_it *tree_it_arg;
static void
coroutine_init(void)
{
coroutine(tree_arg, tree_it_arg);
}
The stack has to be allocated manually, which I do with a call to
malloc()
. Nothing fancy is needed, though this means the new
stack won’t have a guard page. For the stack size, I use the suggested
value of SIGSTKSZ
. The makecontext()
function is what creates the
new context from scratch, but the new context must first be
initialized with getcontext()
, even though that particular snapshot
won’t actually be used.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
it->coroutine.uc_stack.ss_size = SIGSTKSZ;
it->coroutine.uc_link = &it->yield;
getcontext(&it->coroutine);
makecontext(&it->coroutine, coroutine_init, 0);
tree_arg = t;
tree_it_arg = it;
return it;
}
Notice I gave it a function pointer, a lot like I’m starting a new
thread. This is no coincidence. There’s a lot of similarity between
coroutines and multiple threads, as you’ll soon see.
Finally the iterator function itself. Since NULL isn’t a valid key, it
initializes the key to NULL before yielding to the iterator context.
If the iterator has no more nodes to visit, it doesn’t set the key,
which can be detected when control returns.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
swapcontext(&it->yield, &it->coroutine);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->coroutine.uc_stack.ss_sp);
free(it);
return 0;
}
}
That’s all it takes to create and operate a coroutine in C, provided
you’re on a system with these XSI extensions.
Semaphores
Instead of a coroutine, we could just use actual threads and a couple
of semaphores to synchronize them. This is a heavy implementation and
also probably shouldn’t be used in practice, but at least it’s fully
portable.
Here’s the structure definition:
struct tree_it {
struct tree *t;
char *k;
char *v;
sem_t visitor;
sem_t main;
pthread_t thread;
};
The main thread will wait on one semaphore and the iterator thread
will wait on the other. This should sound very familiar.
The actual traversal function looks the same, but with sem_post()
and sem_wait()
as the yield.
static void
visit(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
sem_post(&it->main);
sem_wait(&it->visitor);
visit(t->left, it);
visit(t->right, it);
}
}
There’s a separate function to initialize the iterator context again.
static void *
thread_entrance(void *arg)
{
struct tree_it *it = arg;
sem_wait(&it->visitor);
visit(it->t, it);
sem_post(&it->main);
return 0;
}
Creating the iterator only requires initializing the semaphores and
creating the thread:
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->t = t;
sem_init(&it->visitor, 0, 0);
sem_init(&it->main, 0, 0);
pthread_create(&it->thread, 0, thread_entrance, it);
return it;
}
The iterator function looks just like the coroutine version.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
sem_post(&it->visitor);
sem_wait(&it->main);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
pthread_join(it->thread, 0);
sem_destroy(&it->main);
sem_destroy(&it->visitor);
free(it);
return 0;
}
}
Overall, this is almost identical to the coroutine version.
Coroutines using stack clashing
Finally I can tie this back into the topic at hand. Without either XSI
extensions or Pthreads, we can (usually) create coroutines by abusing
setjmp()
and longjmp()
. Technically this violates two of the C’s
rules and relies on undefined behavior, but it generally works. This
is not my own invention, and it dates back to at least 2010.
From the very beginning, C has provided a crude “exception” mechanism
that allows the stack to be abruptly unwound back to a previous state.
It’s a sort of non-local goto. Call setjmp()
to capture an opaque
jmp_buf
object to be used in the future. This function returns 0
this first time. Hand that value to longjmp()
later, even in a
different function, and setjmp()
will return again, this time with a
non-zero value.
It’s technically unsuitable for coroutines because the jump is a
one-way trip. The unwound stack invalidates any jmp_buf
that was
created after the target of the jump. In practice, though, you can
still use these jumps, which is one rule being broken.
That’s where stack clashing comes into play. In order for it to be a
proper coroutine, it needs to have its own stack. But how can we do
that with these primitive C utilities? Extend the stack to overlap
the heap, call setjmp()
to capture a coroutine on it, then return.
Generally we can get away with using longjmp()
to return to this
heap-allocated stack.
Here’s my iterator definition for this one. Like the XSI context
struct, this has two jmp_buf
“contexts.” The stack
holds the
iterator’s stack buffer so that it can be freed, and the gap
field
will be used to prevent the optimizer from spoiling our plans.
struct tree_it {
char *k;
char *v;
char *stack;
volatile char *gap;
jmp_buf coroutine;
jmp_buf yield;
};
The coroutine looks familiar again. This time the yield is performed
with setjmmp()
and longjmp()
, just like swapcontext()
. Remember
that setjmp()
returns twice, hence the branch. The longjmp()
never
returns.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
if (!setjmp(it->coroutine))
longjmp(it->yield, 1);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
Next is the tricky part to cause the stack clash. First, allocate the
new stack with malloc()
so that we can get its address. Then use a
local variable on the stack to determine how much the stack needs to
grow in order to overlap with the allocation. Taking the difference
between these pointers is illegal as far as the language is concerned,
making this the second rule I’m breaking. I can imagine an
implementation where the stack and heap are in two separate
kinds of memory, and it would be meaningless to take the difference. I
don’t actually have to imagine very hard, because this is actually how
it used to work on the 8086 with its segmented memory
architecture.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack = malloc(STACK_SIZE);
char marker;
char gap[&marker - it->stack - STACK_SIZE];
it->gap = gap; // prevent optimization
if (!setjmp(it->yield))
coroutine_init(t, it);
return it;
}
I’m using a variable-length array (VLA) named gap
to indirectly
control the stack pointer, moving it over the heap. I’m assuming the
stack grows downward, since otherwise the sign would be wrong.
The compiler is smart and will notice I’m not actually using gap
,
and it’s happy to throw it away. In fact, it’s vitally important that
I don’t touch it since the guard page, along with a bunch of
unmapped memory, is actually somewhere in the middle of that array. I
only want the array for its side effect, but that side effect isn’t
officially supported, which means the optimizer doesn’t need to
consider it in its decisions. To inhibit the optimizer, I store the
array’s address where someone might potentially look at it, meaning
the array has to exist.
Finally, the iterator function looks just like the others, again.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
if (!setjmp(it->yield))
longjmp(it->coroutine, 1);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->stack);
free(it);
return 0;
}
}
And that’s it: a nasty hack using a stack clash to create a context
for a setjmp()
+longjmp()
coroutine.