nullprogram.com/blog/2015/03/19/
This article was discussed on Hacker News and on reddit.
Monday’s /r/dailyprogrammer challenge was to write a program to
read a recurrence relation definition and, through interpretation,
iterate it to some number of terms. It’s given an initial term
(u(0)
) and a sequence of operations, f
, to apply to the previous
term (u(n + 1) = f(u(n))
) to compute the next term. Since it’s an
easy challenge, the operations are limited to addition, subtraction,
multiplication, and division, with one operand each.
For example, the relation u(n + 1) = (u(n) + 2) * 3 - 5
would be
input as +2 *3 -5
. If u(0) = 0
then,
u(1) = 1
u(2) = 4
u(3) = 13
u(4) = 40
u(5) = 121
- …
Rather than write an interpreter to apply the sequence of operations,
for my submission (mirror) I took the opportunity to
write a simple x86-64 Just-In-Time (JIT) compiler. So rather than
stepping through the operations one by one, my program converts the
operations into native machine code and lets the hardware do the work
directly. In this article I’ll go through how it works and how I did
it.
Update: The follow-up challenge uses Reverse Polish
notation to allow for more complicated expressions. I wrote another
JIT compiler for my submission (mirror).
Allocating Executable Memory
Modern operating systems have page-granularity protections for
different parts of process memory: read, write, and execute.
Code can only be executed from memory with the execute bit set on its
page, memory can only be changed when its write bit is set, and some
pages aren’t allowed to be read. In a running process, the pages
holding program code and loaded libraries will have their write bit
cleared and execute bit set. Most of the other pages will have their
execute bit cleared and their write bit set.
The reason for this is twofold. First, it significantly increases the
security of the system. If untrusted input was read into executable
memory, an attacker could input machine code (shellcode) into the
buffer, then exploit a flaw in the program to cause control flow to
jump to and execute that code. If the attacker is only able to write
code to non-executable memory, this attack becomes a lot harder. The
attacker has to rely on code already loaded into executable pages
(return-oriented programming).
Second, it catches program bugs sooner and reduces their impact, so
there’s less chance for a flawed program to accidentally corrupt user
data. Accessing memory in an invalid way will causes a segmentation
fault, usually leading to program termination. For example, NULL
points to a special page with read, write, and execute disabled.
An Instruction Buffer
Memory returned by malloc()
and friends will be writable and
readable, but non-executable. If the JIT compiler allocates memory
through malloc()
, fills it with machine instructions, and jumps to
it without doing any additional work, there will be a segmentation
fault. So some different memory allocation calls will be made instead,
with the details hidden behind an asmbuf
struct.
#define PAGE_SIZE 4096
struct asmbuf {
uint8_t code[PAGE_SIZE - sizeof(uint64_t)];
uint64_t count;
};
To keep things simple here, I’m just assuming the page size is 4kB. In
a real program, we’d use sysconf(_SC_PAGESIZE)
to discover the page
size at run time. On x86-64, pages may be 4kB, 2MB, or 1GB, but this
program will work correctly as-is regardless.
Instead of malloc()
, the compiler allocates memory as an anonymous
memory map (mmap()
). It’s anonymous because it’s not backed by a
file.
struct asmbuf *
asmbuf_create(void)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
return mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
}
Windows doesn’t have POSIX mmap()
, so on that platform we use
VirtualAlloc()
instead. Here’s the equivalent in Win32.
struct asmbuf *
asmbuf_create(void)
{
DWORD type = MEM_RESERVE | MEM_COMMIT;
return VirtualAlloc(NULL, PAGE_SIZE, type, PAGE_READWRITE);
}
Anyone reading closely should notice that I haven’t actually requested
that the memory be executable, which is, like, the whole point of all
this! This was intentional. Some operating systems employ a security
feature called W^X: “write xor execute.” That is, memory is either
writable or executable, but never both at the same time. This makes
the shellcode attack I described before even harder. For well-behaved
JIT compilers it means memory protections need to be adjusted
after code generation and before execution.
The POSIX mprotect()
function is used to change memory protections.
void
asmbuf_finalize(struct asmbuf *buf)
{
mprotect(buf, sizeof(*buf), PROT_READ | PROT_EXEC);
}
Or on Win32 (that last parameter is not allowed to be NULL
),
void
asmbuf_finalize(struct asmbuf *buf)
{
DWORD old;
VirtualProtect(buf, sizeof(*buf), PAGE_EXECUTE_READ, &old);
}
Finally, instead of free()
it gets unmapped.
void
asmbuf_free(struct asmbuf *buf)
{
munmap(buf, PAGE_SIZE);
}
And on Win32,
void
asmbuf_free(struct asmbuf *buf)
{
VirtualFree(buf, 0, MEM_RELEASE);
}
I won’t list the definitions here, but there are two “methods” for
inserting instructions and immediate values into the buffer. This will
be raw machine code, so the caller will be acting a bit like an
assembler.
asmbuf_ins(struct asmbuf *, int size, uint64_t ins);
asmbuf_immediate(struct asmbuf *, int size, const void *value);
Calling Conventions
We’re only going to be concerned with three of x86-64’s many
registers: rdi
, rax
, and rdx
. These are 64-bit (r
) extensions
of the original 16-bit 8086 registers. The sequence of
operations will be compiled into a function that we’ll be able to call
from C like a normal function. Here’s what it’s prototype will look
like. It takes a signed 64-bit integer and returns a signed 64-bit
integer.
The System V AMD64 ABI calling convention says that the first
integer/pointer function argument is passed in the rdi
register.
When our JIT compiled program gets control, that’s where its input
will be waiting. According to the ABI, the C program will be expecting
the result to be in rax
when control is returned. If our recurrence
relation is merely the identity function (it has no operations), the
only thing it will do is copy rdi
to rax
.
There’s a catch, though. You might think all the mucky
platform-dependent stuff was encapsulated in asmbuf
. Not quite. As
usual, Windows is the oddball and has its own unique calling
convention. For our purposes here, the only difference is that the
first argument comes in rcx
rather than rdi
. Fortunately this only
affects the very first instruction and the rest of the assembly
remains the same.
The very last thing it will do, assuming the result is in rax
, is
return to the caller.
So we know the assembly, but what do we pass to asmbuf_ins()
? This
is where we get our hands dirty.
Finding the Code
If you want to do this the Right Way, you go download the x86-64
documentation, look up the instructions we’re using, and manually work
out the bytes we need and how the operands fit into it. You know, like
they used to do out of necessity back in the 60’s.
Fortunately there’s a much easier way. We’ll have an actual assembler
do it and just copy what it does. Put both of the instructions above
in a file peek.s
and hand it to nasm
. It will produce a raw binary
with the machine code, which we’ll disassemble with nidsasm
(the
NASM disassembler).
$ nasm peek.s
$ ndisasm -b64 peek
00000000 4889F8 mov rax,rdi
00000003 C3 ret
That’s straightforward. The first instruction is 3 bytes and the
return is 1 byte.
asmbuf_ins(buf, 3, 0x4889f8); // mov rax, rdi
// ... generate code ...
asmbuf_ins(buf, 1, 0xc3); // ret
For each operation, we’ll set it up so the operand will already be
loaded into rdi
regardless of the operator, similar to how the
argument was passed in the first place. A smarter compiler would embed
the immediate in the operator’s instruction if it’s small (32-bits or
fewer), but I’m keeping it simple. To sneakily capture the “template”
for this instruction I’m going to use 0x0123456789abcdef
as the
operand.
mov rdi, 0x0123456789abcdef
Which disassembled with ndisasm
is,
00000000 48BFEFCDAB896745 mov rdi,0x123456789abcdef
-2301
Notice the operand listed little endian immediately after the
instruction. That’s also easy!
long operand;
scanf("%ld", &operand);
asmbuf_ins(buf, 2, 0x48bf); // mov rdi, operand
asmbuf_immediate(buf, 8, &operand);
Apply the same discovery process individually for each operator you
want to support, accumulating the result in rax
for each.
switch (operator) {
case '+':
asmbuf_ins(buf, 3, 0x4801f8); // add rax, rdi
break;
case '-':
asmbuf_ins(buf, 3, 0x4829f8); // sub rax, rdi
break;
case '*':
asmbuf_ins(buf, 4, 0x480fafc7); // imul rax, rdi
break;
case '/':
asmbuf_ins(buf, 3, 0x4831d2); // xor rdx, rdx
asmbuf_ins(buf, 3, 0x48f7ff); // idiv rdi
break;
}
As an exercise, try adding support for modulus operator (%
), XOR
(^
), and bit shifts (<
, >
). With the addition of these
operators, you could define a decent PRNG as a recurrence relation. It
will also eliminate the closed form solution to this problem so
that we actually have a reason to do all this! Or, alternatively,
switch it all to floating point.
Calling the Generated Code
Once we’re all done generating code, finalize the buffer to make it
executable, cast it to a function pointer, and call it. (I cast it as
a void *
just to avoid repeating myself, since that will implicitly
cast to the correct function pointer prototype.)
asmbuf_finalize(buf);
long (*recurrence)(long) = (void *)buf->code;
// ...
x[n + 1] = recurrence(x[n]);
That’s pretty cool if you ask me! Now this was an extremely simplified
situation. There’s no branching, no intermediate values, no function
calls, and I didn’t even touch the stack (push, pop). The recurrence
relation definition in this challenge is practically an assembly
language itself, so after the initial setup it’s a 1:1 translation.
I’d like to build a JIT compiler more advanced than this in the
future. I just need to find a suitable problem that’s more complicated
than this one, warrants having a JIT compiler, but is still simple
enough that I could, on some level, justify not using LLVM.