nullprogram.com/blog/2018/06/23/
The ptrace(2)
(“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing strace — system call trace. With Ptrace, tracers
can pause tracees, inspect and set registers and memory, monitor
system calls, or even intercept system calls.
By intercept, I mean that the tracer can mutate system call arguments,
mutate the system call return value, or even block certain system calls.
Reading between the lines, this means a tracer can fully service system
calls itself. This is particularly interesting because it also means a
tracer can emulate an entire foreign operating system. This is done
without any special help from the kernel beyond Ptrace.
The catch is that a process can only have one tracer attached at a time,
so it’s not possible emulate a foreign operating system while also
debugging that process with, say, GDB. The other issue is that emulated
systems calls will have higher overhead.
For this article I’m going to focus on Linux’s Ptrace on
x86-64, and I’ll be taking advantage of a few Linux-specific extensions.
For the article I’ll also be omitting error checks, but the full source
code listings will have them.
You can find runnable code for the examples in this article here:
https://github.com/skeeto/ptrace-examples
strace
Before getting into the really interesting stuff, let’s start by
reviewing a bare bones implementation of strace. It’s no
DTrace, but strace is still incredibly useful.
Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The ptrace(2)
prototype generally looks something like this, though the specific
types may be different.
long ptrace(int request, pid_t pid, void *addr, void *data);
The pid
is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.
The request
field selects a specific Ptrace function, just like the
ioctl(2)
interface. For strace, only two are needed:
PTRACE_TRACEME
: This process is to be traced by its parent.
PTRACE_SYSCALL
: Continue, but stop at the next system call
entrance or exit.
PTRACE_GETREGS
: Get a copy of the tracee’s registers.
The other two fields, addr
and data
, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.
The strace interface is essentially a prefix to another command.
$ strace [strace options] program [arguments]
My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is fork(2)
and exec(2)
the
tracee process on the tail of argv
. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.
pid_t pid = fork();
switch (pid) {
case -1: /* error */
FATAL("%s", strerror(errno));
case 0: /* child */
ptrace(PTRACE_TRACEME, 0, 0, 0);
execvp(argv[1], argv + 1);
FATAL("%s", strerror(errno));
}
The parent waits for the child’s PTRACE_TRACEME
using wait(2)
. When
wait(2)
returns, the child will be paused.
Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
PTRACE_O_TRACEFORK
.
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);
All that’s left is a simple, endless loop that catches on system calls
one at a time. The body of the loop has four steps:
- Wait for the process to enter the next system call.
- Print a representation of the system call.
- Allow the system call to execute and wait for the return.
- Print the system call return value.
The PTRACE_SYSCALL
request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
wait(2)
is needed to wait for the tracee to enter the desired state.
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
When wait(2)
returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, the operating system has not yet serviced this system call.
This detail will be important later.
The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, the system call number is
passed in rax
, and the arguments (up to 6) are passed in
rdi
, rsi
, rdx
, r10
, r8
, and r9
. Reading the registers is
another Ptrace call, though there’s no need to wait(2)
since the
tracee isn’t changing state.
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
long syscall = regs.orig_rax;
fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
syscall,
(long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
(long)regs.r10, (long)regs.r8, (long)regs.r9);
There’s one caveat. For internal kernel purposes, the system
call number is stored in orig_rax
rather than rax
. All the other
system call arguments are straightforward.
Next it’s another PTRACE_SYSCALL
and wait(2)
, then another
PTRACE_GETREGS
to fetch the result. The result is stored in rax
.
ptrace(PTRACE_GETREGS, pid, 0, ®s);
fprintf(stderr, " = %ld\n", (long)regs.rax);
The output from this simple program is very crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use process_vm_readv(2)
to
read those buffers from the tracee in order to print them appropriately.
However, this does lay the groundwork for system call interception.
System call interception
Suppose we want to use Ptrace to implement something like OpenBSD’s
pledge(2)
, in which a process pledges to use only a
restricted set of system calls. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.
Before entering this main loop, a process can limit itself to the few
operations that it needs. If the program has a flaw allowing it
to be exploited by bad input, the pledge significantly limits what the
exploit can accomplish.
Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call exit(2)
in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.
The tricky part is that there’s no way to abort a system call once
it’s started. When tracer returns from wait(2)
on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.
However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” EPERM
error
in errno
via the normal in-band signaling.
for (;;) {
/* Enter next system call */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
/* Is this system call permitted? */
int blocked = 0;
if (is_syscall_blocked(regs.orig_rax)) {
blocked = 1;
regs.orig_rax = -1; // set to invalid syscall
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
/* Run system call and stop on exit */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
if (blocked) {
/* errno = EPERM */
regs.rax = -EPERM; // Operation not permitted
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
}
This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (open(2)
) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.
How could the tracee communicate to the tracer? Use an artificial
system call!
Creating an artificial system call
For my new pledge-like system call — which I call xpledge()
to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.
#define SYS_xpledge 10000
Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s pledge(2)
, which uses a string interface.
Actually designing robust and secure sets of privileges is really
complicated, as the pledge(2)
manpage shows. Here’s the entire
interface and implementation of the system call for the tracee:
#define _GNU_SOURCE
#include <unistd.h>
#define XPLEDGE_RDWR (1 << 0)
#define XPLEDGE_OPEN (1 << 1)
#define xpledge(arg) syscall(SYS_xpledge, arg)
If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. brk(2)
). The
PLEDGE_RDWR
bit allows various read and write system calls
(read(2)
, readv(2)
, pread(2)
, preadv(2)
, etc.). The
PLEDGE_OPEN
bit allows open(2)
.
To prevent privileges from being escalated back, pledge()
blocks
itself — though this also prevents dropping more privileges later down
the line.
In the xpledge tracer, I just need to check for this system call:
/* Handle entrance */
switch (regs.orig_rax) {
case SYS_pledge:
register_pledge(regs.rdi);
break;
}
The operating system will return ENOSYS
(Function not implemented)
since this isn’t a real system call. So on the way out I overwrite
this with a success (0).
/* Handle exit */
switch (regs.orig_rax) {
case SYS_pledge:
ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
break;
}
I wrote a little test program that opens /dev/urandom
, makes a read,
tries to pledge, then tries to open /dev/urandom
a second time, then
confirms it can read from the original /dev/urandom
file descriptor.
Running without a pledge tracer, the output looks like this:
$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
Making an invalid system call doesn’t crash an application. It just
fails, which is a rather convenient fallback. When run under the
tracer, it looks like this:
$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
The pledge succeeds but the second fopen(3)
does not since the tracer
blocked it with EPERM
.
This concept could be taken much further, to, say, change file paths or
return fake results. A tracer could effectively chroot its tracee,
prepending some chroot path to the root of any path passed through a
system call. It could even lie to the process about what user it is,
claiming that it’s running as root. In fact, this is exactly how the
Fakeroot NG program works.
Foreign system emulation
Suppose you don’t just want to intercept some system calls, but
all system calls. You’ve got a binary intended to run on another
operating system, so none of the system calls it makes will ever
work.
You could manage all this using only what I’ve described so far. The
tracer would always replace the system call number with a dummy, allow
it to fail, then service the system call itself. But that’s really
inefficient. That’s essentially three context switches for each system
call: one to stop on the entrance, one to make the always-failing
system call, and one to stop on the exit.
The Linux version of PTrace has had a more efficient operation for
this technique since 2005: PTRACE_SYSEMU
. PTrace stops only once
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.
for (;;) {
ptrace(PTRACE_SYSEMU, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
switch (regs.orig_rax) {
case OS_read:
/* ... */
case OS_write:
/* ... */
case OS_open:
/* ... */
case OS_exit:
/* ... */
/* ... and so on ... */
}
}
To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this PTRACE_SYSEMU
tracer, a loader (to take the place of exec(2)
), and whatever system
libraries the binary needs (or only run static binaries).
In fact, this sounds like a fun weekend project.
See also