nullprogram.com/blog/2018/11/15/
This is a debugging war story.
Once upon a time I wrote a fancy data conversion utility. The input
was a complex binary format defined by a data dictionary supplied at
run time by the user alongside the input data. Since the converter was
typically used to process massive quantities of input, and the nature
of that input wasn’t known until run time, I wrote an x86-64 JIT
compiler to speed it up. The converter generated a fast, native
binary parser in memory according to the data dictionary
specification. Processing data now took much less time and everyone
rejoiced.
Then along came SELinux, Sheriff of Pedantry. Not liking all the
shenanigans with page protections, SELinux huffed and puffed and made
mprotect(2)
return EACCES
(“Permission denied”). Believing I was
following all the rules and so this would never happen, I foolishly
did not check the result and the converter was now crashing for its
users. What made SELinux so unhappy, and could this somehow be
resolved?
Allocating memory
Before going further, let’s back up and review how this works. Suppose I
want to generate code at run time and execute it. In the old days this
was as simple as writing some machine code into a buffer and jumping to
that buffer — e.g. by converting the buffer to a function pointer and
calling it.
typedef int (*jit_func)(void);
/* NOTE: This doesn't work anymore! */
jit_func
jit_compile(int retval)
{
unsigned char *buf = malloc(6);
if (buf) {
/* mov eax, retval */
buf[0] = 0xb8;
buf[1] = retval >> 0;
buf[2] = retval >> 8;
buf[3] = retval >> 16;
buf[4] = retval >> 24;
/* ret */
buf[5] = 0xc3;
}
return (jit_func)buf;
}
int
main(void)
{
jit_func f = jit_compile(1001);
printf("f() = %d\n", f());
free(f);
}
This situation was far too easy for malicious actors to abuse. An
attacker could supply instructions of their own choosing — i.e. shell
code — as input and exploit a buffer overflow vulnerability to execute
the input buffer. These exploits were trivial to craft.
Modern systems have hardware checks to prevent this from happening.
Memory containing instructions must have their execute protection bit
set before those instructions can be executed. This is useful both for
making attackers work harder and for catching bugs in programs — no more
executing data by accident.
This is further complicated by the fact that memory protections have
page granularity. You can’t adjust the protections for a 6-byte
buffer. You do it for the entire surrounding page — typically 4kB, but
sometimes as large as 2MB. This requires replacing that malloc(3)
with a more careful allocation strategy. There are a few ways to go
about this.
Anonymous memory mapping
The most common and most sensible is to create an anonymous memory
mapping: a file memory map that’s not actually backed by a file. The
mmap(2)
function has a flag specifically for this purpose:
MAP_ANONYMOUS
.
#include <sys/mman.h>
void *
anon_alloc(size_t len)
{
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
void *p = mmap(0, len, prot, flags, -1, 0);
return p != MAP_FAILED ? p : 0;
}
void
anon_free(void *p, size_t len)
{
munmap(p, len);
}
Unfortunately, MAP_ANONYMOUS
not part of POSIX. If you’re being super
strict with your includes — as I tend to be — this flag won’t be
defined, even on systems where it’s supported.
#define _POSIX_C_SOURCE 200112L
#include <sys/mman.h>
// MAP_ANONYMOUS undefined!
To get the flag, you must use the _BSD_SOURCE
, or, more recently,
the _DEFAULT_SOURCE
feature test macro to explicitly enable that
feature.
#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE /* for MAP_ANONYMOUS */
#include <sys/mman.h>
The POSIX way to do this is to instead map /dev/zero
. So, wanting to
be Mr. Portable, this is what I did in my tool. Take careful note of
this.
#define _POSIX_C_SOURCE 200112L
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
void *
anon_alloc(size_t len)
{
int fd = open("/dev/zero", O_RDWR);
if (fd == -1)
return 0;
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_PRIVATE;
void *p = mmap(0, len, prot, flags, fd, 0);
close(fd);
return p != MAP_FAILED ? p : 0;
}
Aligned allocation
Another, less common (and less portable) strategy is to lean on the
existing C memory allocator, being careful to allocate on page
boundaries so that the page protections don’t affect other allocations.
The classic allocation functions, like malloc(3)
, don’t allow for this
kind of control. However, there are a couple of aligned allocation
alternatives.
The first is posix_memalign(3)
:
int posix_memalign(void **ptr, size_t alignment, size_t size);
By choosing page alignment and a size that’s a multiple of the page
size, it’s guaranteed to return whole pages. When done, pages are freed
with free(3)
. Though, unlike unmapping, the original page protections
must first be restored since those pages may be reused.
#define _POSIX_C_SOURCE 200112L
#include <stdlib.h>
#include <unistd.h>
void *
anon_alloc(size_t len)
{
void *p;
long pagesize = sysconf(_SC_PAGE_SIZE); // TODO: cache this
size_t roundup = (len + pagesize - 1) / pagesize * pagesize;
return posix_memalign(&p, pagesize, roundup) ? 0 : p;
}
If you’re using C11, there’s also aligned_alloc(3)
. This is the most
uncommon of all since most C programmers refuse to switch to a new
standard until it’s at least old enough to drive a car.
Changing page protections
So we’ve allocated our memory, but it’s not going to start in an
executable state. Why? Because a W^X (“write xor execute”)
policy is becoming increasingly common. Attempting to set both write
and execute protections at the same time may be denied. (In fact,
there’s an SELinux policy for this.)
As a JIT compiler, we need to write to a page and execute it. Again,
there are two strategies. The complicated strategy is to map the same
memory at two different places, one with the execute protection,
one with the write protection. This allows the page to be modified as
it’s being executed without violating W^X.
The simpler and more secure strategy is to write the machine
instructions, then swap the page over to executable using mprotect(2)
once it’s ready. This is what I was doing in my tool.
unsigned char *buf = anon_alloc(len);
/* ... write instructions into the buffer ... */
mprotect(buf, len, PROT_EXEC);
jit_func func = (jit_func)buf;
func();
At a high level, That’s pretty close to what I was actually doing. That
includes neglecting to check the result of mprotect(2)
. This worked
fine and dandy for several years, when suddenly (shown here in the style
of strace):
mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)
Then the program would crash trying to execute the buffer. Suddenly it
wasn’t allowed to make this buffer executable. My program hadn’t
changed. What had changed was the SELinux security policy on this
particular system.
Asking for help
The problem is that I don’t administer this (Red Hat) system. I can’t
access the logs and I didn’t set the policy. I don’t have any insight
on why this call was suddenly being denied. To make this more
challenging, the folks that manage this system didn’t have the
necessary knowledge to help with this either.
So to figure this out, I need to treat it like a black box and probe
at system calls until I can figure out just what SELinux policy I’m up
against. I only have practical experience administrating Debian
systems (and its derivatives like Ubuntu), which means I’ve hardly
ever had to deal with SELinux. I’m flying fairly blind here.
Since my real application is large and complicated, I code up a
minimal example, around a dozen lines of code: allocate a single page
of memory, write a single return (ret
) instruction into it, set it
as executable, and call it. The program checks for errors, and I can
run it under strace if that’s not insightful enough. This program is
also something simple I could provide to the system administrators,
since they were willing to turn some of the knobs to help narrow down
the problem.
However, here’s where I made a major mistake. Assuming the problem
was solely in mprotect(2)
, and wanting to keep this as absolutely
simple as possible, I used posix_memalign(3)
to allocate that page. I
saw the same EACCES
as before, and assumed I was demonstrating the
same problem. Take note of this, too.
Finding a resolution
Eventually I’d need to figure out what policy was blocking my JIT
compiler, then see if there was an alternative route. The system
loader still worked after all, and I could plainly see that with
strace. So it wasn’t a blanket policy that completely blocked the
execute protection. Perhaps the loader was given an exception?
However, the very first order of business was to actually check the
result from mprotect(2)
and do something more graceful rather than
crash. In my case, that meant falling back to executing a byte-code
virtual machine. I added the check, and now the program ran slower
instead of crashing.
The program runs on both Linux and Windows, and the allocation and
page protection management is abstracted. On Windows it uses
VirtualAlloc()
and VirtualProtect()
instead of mmap(2)
and
mprotect(2)
. Neither implementation checked that the protection
change succeeded, so I fixed the Windows implementation while I was at
it.
Thanks to Mingw-w64, I actually do most of my Windows
development on Linux. And, thanks to Wine, I mean
everything, including running and debugging. Calling
VirtualProtect()
in Wine would ultimately call mprotect(2)
in the
background, which I expected would be denied. So running the Windows
version with Wine under this SELinux policy would be the perfect test.
Right?
Except that mprotect(2)
succeeded under Wine! The Windows version
of my JIT compiler was working just fine on Linux. Huh?
This system doesn’t have Wine installed. I had built and packaged it
myself. This Wine build definitely has no SELinux exceptions.
Not only did the Wine loader work correctly, it can change page
protections in ways my own Linux programs could not. What’s different?
Debugging this with all these layers is starting to look silly, but
this is exactly why doing Windows development on Linux is so useful. I
run my program under Wine under strace:
$ strace wine ./mytool.exe
I study the system calls around mprotect(2)
. Perhaps there’s some
stricter alignment issue? No. Perhaps I need to include PROT_READ
?
No. The only difference I can find is they’re using the
MAP_ANONYMOUS
flag. So, armed with this knowledge, I modify my
minimal example to allocate 1024 pages instead of just one, and
suddenly it works correctly. I was most of the way to figuring this
all out.
Inside glibc allocation
Why did increasing the allocation size change anything? This is a
typical Linux system, so my program is linked against the GNU C
library, glibc. This library allocates memory from two places
depending on the allocation size.
For small allocations, glibc uses brk(2)
to extend the executable
image — i.e. to extend the .bss
section. These resources are not
returned to the operating system after they’re freed with free(3)
.
They’re reused.
For large allocations, glibc uses mmap(2)
to create a new, anonymous
mapping for that allocation. When freed with free(3)
, that memory is
unmapped and its resources are returned to the operating system.
By increasing the allocation size, it became a “large” allocation and
was backed by an anonymous mapping. Even though I didn’t use mmap(2)
,
to the operating system this would be indistinguishable to what Wine was
doing (and succeeding at).
Consider this little example program:
int
main(void)
{
printf("%p\n", malloc(1));
printf("%p\n", malloc(1024 * 1024));
}
When not compiled as a Position Independent Executable (PIE), here’s
what the output looks like. The first pointer is near where the program
was loaded, low in memory. The second pointer is a randomly selected
address high in memory.
And if you run it under strace, you’ll see that the first allocation
comes from brk(2)
and the second comes from mmap(2)
.
Two SELinux policies
With a little bit of research, I found the two SELinux policies
at play here. In my minimal example, I was blocked by allow_execheap
.
/selinux/booleans/allow_execheap
This prohibits programs from setting the execute protection on any
“heap” page.
The POSIX specification does not permit it, but the Linux
implementation of mprotect
allows changing the access protection of
memory on the heap (e.g., allocated using malloc
). This error
indicates that heap memory was supposed to be made executable. Doing
this is really a bad idea. If anonymous, executable memory is needed
it should be allocated using mmap
which is the only portable
mechanism.
Obviously this is pretty loose since I was still able to do it with
posix_memalign(3)
, which, technically speaking, allocates from the
heap. So this policy applies to pages mapped by brk(2)
.
The second policy was allow_execmod
.
/selinux/booleans/allow_execmod
The program mapped from a file with mmap
and the MAP_PRIVATE
flag
and write permission. Then the memory region has been written to,
resulting in copy-on-write (COW) of the affected page(s). This memory
region is then made executable […]. The mprotect
call will fail
with EACCES
in this case.
I don’t understand what purpose this policy serves, but this is what
was causing my original problem. Pages mapped to /dev/zero
are not
actually considered anonymous by Linux, at least as far as this
policy is concerned. I think this is a mistake, and that mapping the
special /dev/zero
device should result in effectively anonymous
pages.
From this I learned a little lesson about baking assumptions — that
mprotect(2)
was solely at fault — into my minimal debugging examples.
And the fix was ultimately easy: I just had to suck it up and use the
slightly less pure MAP_ANONYMOUS
flag.