Years ago, OpenBSD gained two new security system calls, pledge(2)
(originally tame(2)
) and unveil
. In both, an application
surrenders capabilities at run-time. The idea is to perform initialization
like usual, then drop capabilities before handling untrusted input,
limiting unwanted side effects. This feature is applicable even where type
safety isn’t an issue, such as Python, where a program might still get
tricked into accessing sensitive files or making network connections when
it shouldn’t. So how can a Python program access these system calls?
As discussed previously, it’s quite easy to access C APIs from
Python through its ctypes
package, and this is no exception.
In this article I show how to do it. Here’s the full source if you want to
dive in: openbsd.py
.
I’ve chosen these extra constraints:
As extra safety features, unnecessary for correctness, attempts to call these functions on systems where they don’t exist will silently do nothing, as though they succeeded. They’re provided as a best effort.
Systems other than OpenBSD may support these functions, now or in the future, and it would be nice to automatically make use of them when available. This means no checking for OpenBSD specifically but instead feature sniffing for their presence.
The interfaces should be Pythonic as though they were implemented in Python itself. Raise exceptions for errors, and accept strings since they’re more convenient than bytes.
For reference, here are the function prototypes:
int pledge(const char *promises, const char *execpromises);
int unveil(const char *path, const char *permissions);
The string-oriented interface of pledge
will make this a whole
lot easier to implement.
The first step is to grab functions through ctypes
. Like a lot of Python
documentation, this area is frustratingly imprecise and under-documented.
I want to grab a handle to the already-linked libc and search for either
function. However, getting that handle is a little different on each
platform, and in the process I saw four different exceptions, only one of
which is documented.
I came up with passing None to ctypes.CDLL
, which ultimately just passes
NULL
to dlopen(3)
. That’s really all I wanted. Currently on
Windows this is a TypeError. Once the handle is in hand, try to access the
pledge
attribute, which will fail with AttributeError if it doesn’t
exist. In the event of any exception, just assume the behavior isn’t
available. If found, I also define the function prototype for ctypes
.
_pledge = None
try:
_pledge = ctypes.CDLL(None, use_errno=True).pledge
_pledge.restype = ctypes.c_int
_pledge.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
_pledge = None
Catching a broad Exception isn’t great, but it’s the best we can do since the documentation is incomplete. From this block I’ve seen TypeError, AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if there are more possibilities, and I don’t want to risk missing them.
Note that I’m catching Exception rather than using a bare except
. My
code will not catch KeyboardInterrupt nor SystemExit. This is deliberate,
and I never want to catch these.
The same story for unveil
:
_unveil = None
try:
_unveil = ctypes.CDLL(None, use_errno=True).unveil
_unveil.restype = ctypes.c_int
_unveil.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
_unveil = None
The next and final step is to wrap the low-level call in an interface that
hides their C and ctypes
nature.
Python strings must be encoded to bytes before they can be passed to C
functions. Rather than make the caller worry about this, we’ll let them
pass friendly strings and have the wrapper do the conversion. Either may
also be NULL
, so None is allowed.
def pledge(promises: Optional[str], execpromises: Optional[str]):
if not _pledge:
return # unimplemented
r = _pledge(None if promises is None else promises.encode(),
None if execpromises is None else execpromises.encode())
if r == -1:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
As usual, a return of -1 means there was an error, in which case we fetch
errno
and raise the appropriate OSError.
unveil
works a little differently since the first argument is a path.
Python functions that accept paths, such as open
, generally accept
either strings or bytes. On unix-like systems, paths are fundamentally
bytestrings and not necessarily Unicode, so it’s necessary to accept
bytes. Since strings are nearly always more convenient, they take both.
The unveil
wrapper here will do the same. If it’s a string, encode it,
otherwise pass it straight through.
def unveil(path: Union[str, bytes, None], permissions: Optional[str]):
if not _unveil:
return # unimplemented
r = _unveil(path.encode() if isinstance(path, str) else path,
None if permissions is None else permissions.encode())
if r == -1:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
That’s it!
Let’s start with unveil
. Initially a process has access to the whole
file system with the usual restrictions. On the first call to unveil
it’s immediately restricted to some subset of the tree. Each call reveals
a little more until a final NULL
which locks it in place for the rest of
the process’s existence.
Suppose a program has been tricked into accessing your shell history, perhaps by mishandling a path:
def hackme():
try:
with open(pathlib.Path.home() / ".bash_history"):
print("You've been hacked!")
except FileNotFoundError:
print("Blocked by unveil.")
hackme()
If you’re a Bash user, this prints:
You've been hacked!
Using our new feature to restrict the program’s access first:
# restrict access to static program data
unveil("/usr/share", "r")
unveil(None, None)
hackme()
On OpenBSD this now prints:
Blocked by unveil.
Working just as it should!
With pledge
we declare what abilities we’d like to keep by supplying a
list of promises, pledging to use only those abilities afterward. A
common case is the stdio
promise which allows reading and writing of
open files, but not opening files. A program might open its log file,
then drop the ability to open files while retaining the ability to write
to its log.
An invalid or unknown promise is an error. Does that work?
>>> pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument
So far so good. How about the functionality itself?
pledge("stdio", None)
hackme()
The program is instantly killed when making the disallowed system call:
Abort trap (core dumped)
If you want something a little softer, include the error
promise:
pledge("stdio error", None)
hackme()
Instead it’s an exception, which will be a lot easier to debug when it comes to Python, so you probably always want to use it.
OSError: [Errno 78] Function not implemented
The core dump isn’t going to be much help to a Python program, so you
probably always want to use this promise. In general you need to be extra
careful about pledge
in complex runtimes like Python’s which may
reasonably need to do many arbitrary, undocumented things at any time.
RANDOM
environment
variable that evaluates to a random value between 0 and 32,767 (e.g.
15 bits). Assigment to the variable seeds the generator. This variable
is an extension and did not appear in the original Unix Bourne
shell. Despite this, the different Bourne-like shells that implement
it have converged to the same interface, but only the interface.
Each implementation differs in interesting ways. In this article we’ll
explore how $RANDOM
is implemented in various Bourne-like shells.
Unfortunately I was unable to determine the origin of
Nobody was doing a good job tracking source code changes before the
mid-1990s, so that history appears to be lost. Bash was first released
in 1989, but the earliest version I could find was 1.14.7, released in 1996.
KornShell was first released in 1983, but the earliest source I could
find was from 1993. In both cases $RANDOM
.$RANDOM
already existed. My
guess is that it first appeared in one of these two shells, probably
KornShell.
Update: Quentin Barnes has informed me that his 1986 copy of
KornShell (a.k.a. ksh86) implements $RANDOM
. This predates Bash and
makes it likely that this feature originated in KornShell.
Of all the shells I’m going to discuss, Bash has the most interesting
history. It never made use use of srand(3)
/ rand(3)
and instead
uses its own generator — which is generally what I prefer. Prior
to Bash 4.0, it used the crummy linear congruential generator (LCG)
found in the C89 standard:
static unsigned long rseed = 1;
static int
brand ()
{
rseed = rseed * 1103515245 + 12345;
return ((unsigned int)((rseed >> 16) & 32767));
}
For some reason it was naïvely decided that $RANDOM
should never
produce the same value twice in a row. The caller of brand()
filters
the output and discards repeats before returning to the shell script.
This actually reduces the quality of the generator further since it
increases correlation between separate outputs.
When the shell starts up, rseed
is seeded from the PID and the current
time in seconds. These values are literally summed and used as the seed.
/* Note: not the literal code, but equivalent. */
rseed = getpid() + time(0);
Subshells, which fork and initally share an rseed
, are given similar
treatment:
rseed = rseed + getpid() + time(0);
Notice there’s no hashing or mixing of these values, so there’s no avalanche effect. That would have prevented shells that start around the same time from having related initial random sequences.
With Bash 4.0, released in 2009, the algorithm was changed to a Park–Miller multiplicative LCG from 1988:
static int
brand ()
{
long h, l;
/* can't seed with 0. */
if (rseed == 0)
rseed = 123459876;
h = rseed / 127773;
l = rseed % 127773;
rseed = 16807 * l - 2836 * h;
return ((unsigned int)(rseed & 32767));
}
There’s actually a subtle mistake in this implementation compared to the generator described in the paper. This function will generate different numbers than the paper, and it will generate different numbers on different hosts! More on that later.
This algorithm is a much better choice than the previous LCG. There were many more options available in 2009 compared to 1989, but, honestly, this generator is pretty reasonable for this application. Bash is so slow that you’re never practically going to generate enough numbers for the small state to matter. Since the Park–Miller algorithm is older than Bash, they could have used this in the first place.
I considered submitting a patch to switch to something more modern.
However, given Bash’s constraints, it’s harder said than done.
Portability to weird systems is still a concern, and I expect they’d
reject a patch that started making use of long long
in the PRNG.
They still support pre-ANSI C compilers that don’t have 64-bit
arithmetic.
However, what still really could be improved is seeding. In Bash 4.x here’s what it looks like:
static void
seedrand ()
{
struct timeval tv;
gettimeofday (&tv, NULL);
sbrand (tv.tv_sec ^ tv.tv_usec ^ getpid ());
}
Seeding is both better and worse. It’s better that it’s seeded from a higher resolution clock (milliseconds), so two shells started close in time have more variation. However, it’s “mixed” with XOR, which, in this case, is worse than addition.
For example, imagine two Bash shells started one millsecond apart. Both
tv_usec
and getpid()
are incremented by one. Those increments are
likely to cancel each other out by an XOR, and they end up with the same
seed.
Instead, each of those quantities should be hashed before mixing. Here’s
a rough example using my triple32()
hash (adapted to glorious
GNU-style pre-ANSI C):
static unsigned long
hash32 (x)
unsigned long x;
{
x ^= x >> 17;
x *= 0xed5ad4bbUL;
x &= 0xffffffffUL;
x ^= x >> 11;
x *= 0xac4c1b51UL;
x &= 0xffffffffUL;
x ^= x >> 15;
x *= 0x31848babUL;
x &= 0xffffffffUL;
x ^= x >> 14;
return x;
}
static void
seedrand ()
{
struct timeval tv;
gettimeofday (&tv, NULL);
sbrand (hash32 (tv.tv_sec) ^
hash32 (hash32 (tv.tv_usec) ^ getpid ()));
}
I had said there’s there’s a mistake in the Bash implementation of Park–Miller. Take a closer look at the types and the assignment to rseed:
/* The variables */
long h, l;
unsigned long rseed;
/* The assignment */
rseed = 16807 * l - 2836 * h;
The result of the substraction can be negative, and that negative
value is converted to unsigned long
. The C standard says
ULONG_MAX + 1
is added to make the value positive. ULONG_MAX
varies by platform — typicially long
is either 32 bits or 64 bits —
so the results also vary. Here’s how the paper defined it:
long test;
test = 16807 * l - 2836 * h;
if (test > 0)
rseed = test;
else
rseed = test + 2147483647;
As far as I can tell, this mistake doesn’t hurt the quality of the generator.
$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634
$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115
In contrast to Bash, Zsh is the most straightforward: defer to
rand(3)
. Its $RANDOM
can return the same value twice in a row,
assuming that rand(3)
does.
zlong
randomgetfn(UNUSED(Param pm))
{
return rand() & 0x7fff;
}
void
randomsetfn(UNUSED(Param pm), zlong v)
{
srand((unsigned int)v);
}
A cool feature is that means you could override it if you wanted with a custom generator.
int
rand(void)
{
return 4; // chosen by fair dice roll.
// guaranteed to be random.
}
Usage:
$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4
This trick also applies to the rest of the shells below.
KornShell originated in 1983, but it was finally released under an open source license in 2005. There’s a clone of KornShell called Public Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but I’ll get to that next.
KornShell defers to rand(3)
, but it does some additional naïve
filtering on the output. When the shell starts up, it generates 10
values from rand()
. If any of them are larger than 32,767 then it will
shift right by three all generated numbers.
#define RANDMASK 0x7fff
for (n = 0; n < 10; n++) {
// Don't use lower bits when rand() generates large numbers.
if (rand() > RANDMASK) {
rand_shift = 3;
break;
}
}
Why not just look at RAND_MAX
? I guess they didn’t think of it.
Update: Quentin Barnes pointed out that RAND_MAX
didn’t exist
until POSIX standardization in 1988. The constant first appeared in
Unix in 1990. This KornShell code either predates the standard
or needed to work on systems that predate the standard.
Like Bash, repeated values are not allowed. I suspect one shell got this idea from the other.
do {
cur = (rand() >> rand_shift) & RANDMASK;
} while (cur == last);
Who came up with this strange idea first?
I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I
ever touch in practice, and its $RANDOM
is the most interesting of the
pdksh forks — at least since 2014.
Like Zsh, pdksh simply defers to rand(3)
. However, OpenBSD’s rand(3)
is infamously and proudly non-standard. By default it returns
non-deterministic, cryptographic-quality results seeded from system
entropy (via the misnamed arc4random(3)
), à la /dev/urandom
.
Its $RANDOM
inherits this behavior.
setint(vp, (int64_t) (rand() & 0x7fff));
However, if a value is assigned to $RANDOM
in order to seed it, it
reverts to its old pre-2014 deterministic generation via
srand_deterministic(3)
.
srand_deterministic((unsigned int)intval(vp));
OpenBSD’s deterministic rand(3)
is the crummy LCG from the C89
standard, just like Bash 3.x. So if you assign to $RANDOM
, you’ll get
nearly the same results as Bash 3.x and earlier — the only difference
being that it can repeat numbers.
That’s a slick upgrade to the old interface without breaking anything,
making it my favorite version $RANDOM
for any shell.
ptrace(2)
(“process trace”) system call is usually associated with
debugging. It’s the primary mechanism through which native debuggers
monitor debuggees on unix-like systems. It’s also the usual approach for
implementing strace — system call trace. With Ptrace, tracers
can pause tracees, inspect and set registers and memory, monitor
system calls, or even intercept system calls.
By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.
The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.
For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.
You can find runnable code for the examples in this article here:
https://github.com/skeeto/ptrace-examples
Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.
Ptrace has never been standardized. Its interface is similar across
different operating systems, especially in its core functionality, but
it’s still subtly different from system to system. The ptrace(2)
prototype generally looks something like this, though the specific
types may be different.
long ptrace(int request, pid_t pid, void *addr, void *data);
The pid
is the tracee’s process ID. While a tracee can have only one
tracer attached at a time, a tracer can be attached to many tracees.
The request
field selects a specific Ptrace function, just like the
ioctl(2)
interface. For strace, only two are needed:
PTRACE_TRACEME
: This process is to be traced by its parent.PTRACE_SYSCALL
: Continue, but stop at the next system call
entrance or exit.PTRACE_GETREGS
: Get a copy of the tracee’s registers.The other two fields, addr
and data
, serve as generic arguments for
the selected Ptrace function. One or both are often ignored, in which
case I pass zero.
The strace interface is essentially a prefix to another command.
$ strace [strace options] program [arguments]
My minimal strace doesn’t have any options, so the first thing to do —
assuming it has at least one argument — is fork(2)
and exec(2)
the
tracee process on the tail of argv
. But before loading the target
program, the new process will inform the kernel that it’s going to be
traced by its parent. The tracee will be paused by this Ptrace system
call.
pid_t pid = fork();
switch (pid) {
case -1: /* error */
FATAL("%s", strerror(errno));
case 0: /* child */
ptrace(PTRACE_TRACEME, 0, 0, 0);
execvp(argv[1], argv + 1);
FATAL("%s", strerror(errno));
}
The parent waits for the child’s PTRACE_TRACEME
using wait(2)
. When
wait(2)
returns, the child will be paused.
waitpid(pid, 0, 0);
Before allowing the child to continue, we tell the operating system that
the tracee should be terminated along with its parent. A real strace
implementation may want to set other options, such as
PTRACE_O_TRACEFORK
.
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);
All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:
The PTRACE_SYSCALL
request is used in both waiting for the next system
call to begin, and waiting for that system call to exit. As before, a
wait(2)
is needed to wait for the tracee to enter the desired state.
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
When wait(2)
returns, the registers for the thread that made the
system call are filled with the system call number and its arguments.
However, the operating system has not yet serviced this system call.
This detail will be important later.
The next step is to gather the system call information. This is where
it gets architecture specific. On x86-64, the system call number is
passed in rax
, and the arguments (up to 6) are passed in
rdi
, rsi
, rdx
, r10
, r8
, and r9
. Reading the registers is
another Ptrace call, though there’s no need to wait(2)
since the
tracee isn’t changing state.
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
long syscall = regs.orig_rax;
fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
syscall,
(long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
(long)regs.r10, (long)regs.r8, (long)regs.r9);
There’s one caveat. For internal kernel purposes, the system
call number is stored in orig_rax
rather than rax
. All the other
system call arguments are straightforward.
Next it’s another PTRACE_SYSCALL
and wait(2)
, then another
PTRACE_GETREGS
to fetch the result. The result is stored in rax
.
ptrace(PTRACE_GETREGS, pid, 0, ®s);
fprintf(stderr, " = %ld\n", (long)regs.rax);
The output from this simple program is very crude. There is no
symbolic name for the system call and every argument is printed
numerically, even if it’s a pointer to a buffer. A more complete strace
would know which arguments are pointers and use process_vm_readv(2)
to
read those buffers from the tracee in order to print them appropriately.
However, this does lay the groundwork for system call interception.
Suppose we want to use Ptrace to implement something like OpenBSD’s
pledge(2)
, in which a process pledges to use only a
restricted set of system calls. The idea is that many
programs typically have an initialization phase where they need lots
of system access (opening files, binding sockets, etc.). After
initialization they enter a main loop in which they processing input
and only a small set of system calls are needed.
Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.
Using the same strace model, rather than print out all system calls,
we could either block certain system calls or simply terminate the
tracee when it misbehaves. Termination is easy: just call exit(2)
in
the tracer. Since it’s configured to also terminate the tracee.
Blocking the system call and allowing the child to continue is a
little trickier.
The tricky part is that there’s no way to abort a system call once
it’s started. When tracer returns from wait(2)
on the entrance to
the system call, the only way to stop a system call from happening is
to terminate the tracee.
However, not only can we mess with the system call arguments, we can
change the system call number itself, converting it to a system call
that doesn’t exist. On return we can report a “friendly” EPERM
error
in errno
via the normal in-band signaling.
for (;;) {
/* Enter next system call */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
/* Is this system call permitted? */
int blocked = 0;
if (is_syscall_blocked(regs.orig_rax)) {
blocked = 1;
regs.orig_rax = -1; // set to invalid syscall
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
/* Run system call and stop on exit */
ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);
if (blocked) {
/* errno = EPERM */
regs.rax = -EPERM; // Operation not permitted
ptrace(PTRACE_SETREGS, pid, 0, ®s);
}
}
This simple example only checks against a whitelist or blacklist of
system calls. And there’s no nuance, such as allowing files to be
opened (open(2)
) read-only but not as writable, allowing anonymous
memory maps but not non-anonymous mappings, etc. There’s also no way
to the tracee to dynamically drop privileges.
How could the tracee communicate to the tracer? Use an artificial system call!
For my new pledge-like system call — which I call xpledge()
to
distinguish it from the real thing — I picked system call number 10000,
a nice high number that’s unlikely to ever be used for a real system
call.
#define SYS_xpledge 10000
Just for demonstration purposes, I put together a minuscule interface
that’s not good for much in practice. It has little in common with
OpenBSD’s pledge(2)
, which uses a string interface.
Actually designing robust and secure sets of privileges is really
complicated, as the pledge(2)
manpage shows. Here’s the entire
interface and implementation of the system call for the tracee:
#define _GNU_SOURCE
#include <unistd.h>
#define XPLEDGE_RDWR (1 << 0)
#define XPLEDGE_OPEN (1 << 1)
#define xpledge(arg) syscall(SYS_xpledge, arg)
If it passes zero for the argument, only a few basic system calls are
allowed, including those used to allocate memory (e.g. brk(2)
). The
PLEDGE_RDWR
bit allows various read and write system calls
(read(2)
, readv(2)
, pread(2)
, preadv(2)
, etc.). The
PLEDGE_OPEN
bit allows open(2)
.
To prevent privileges from being escalated back, pledge()
blocks
itself — though this also prevents dropping more privileges later down
the line.
In the xpledge tracer, I just need to check for this system call:
/* Handle entrance */
switch (regs.orig_rax) {
case SYS_pledge:
register_pledge(regs.rdi);
break;
}
The operating system will return ENOSYS
(Function not implemented)
since this isn’t a real system call. So on the way out I overwrite
this with a success (0).
/* Handle exit */
switch (regs.orig_rax) {
case SYS_pledge:
ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
break;
}
I wrote a little test program that opens /dev/urandom
, makes a read,
tries to pledge, then tries to open /dev/urandom
a second time, then
confirms it can read from the original /dev/urandom
file descriptor.
Running without a pledge tracer, the output looks like this:
$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604
Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:
$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4
The pledge succeeds but the second fopen(3)
does not since the tracer
blocked it with EPERM
.
This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.
Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.
You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.
The Linux version of PTrace has had a more efficient operation for
this technique since 2005: PTRACE_SYSEMU
. PTrace stops only once
per a system call, and it’s up to the tracer to service that system
call before allowing the tracee to continue.
for (;;) {
ptrace(PTRACE_SYSEMU, pid, 0, 0);
waitpid(pid, 0, 0);
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, ®s);
switch (regs.orig_rax) {
case OS_read:
/* ... */
case OS_write:
/* ... */
case OS_open:
/* ... */
case OS_exit:
/* ... */
/* ... and so on ... */
}
}
To run binaries for the same architecture from any system with a
stable (enough) system call ABI, you just need this PTRACE_SYSEMU
tracer, a loader (to take the place of exec(2)
), and whatever system
libraries the binary needs (or only run static binaries).
In fact, this sounds like a fun weekend project.
~/.local/bin
, and the package manager itself is just a 110 line Bourne
shell script. It’s is not intended to replace the system’s package
manager but, instead, compliment it in some cases where I need more
flexibility. I use it to run custom versions of specific pieces of
software — newer or older than the system-installed versions, or with my
own patches and modifications — without interfering with the rest of
system, and without a need for root access. It’s worked out really
well so far and I expect to continue making heavy use of it in the
future.
It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):
Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.
I say it’s crude because its “install” (-I
) procedure is little more
than a wrapper around tar. It doesn’t invoke libtool after installing a
library, and there’s no post-install script — or postinst
as Debian
calls it. It doesn’t check for conflicts between packages, though
there’s a command for doing so manually ahead of time. It doesn’t manage
dependencies, nor even have them as a concept. That’s all on the user to
screw up.
In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:
It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.
Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.
It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.
The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.
The second caveat is that some builds tailor themselves by default to
the host (e.g. -march=native
). If care isn’t taken, those packages may
not be very portable. This is more common than I had expected and has
mildly annoyed me.
While the package manager is new, I’ve been building and installing
software in my home directory for years. I’d follow the normal process
of setting the install prefix to $HOME/.local
, running the build,
and then letting the “install” target do its thing.
$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install
This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.
A common difficulty has been handling the release of new versions of
software. I’d like to upgrade to the new version, but lack a way to
cleanly uninstall the previous version. Simply clobbering the old
version by installing it on top usually works. Occasionally it
wouldn’t, and I’d have to blow away ~/.local
and start all over again.
With more and more software installed in my home directory, restarting
has become more and more of a chore that I’d like to avoid.
What I needed was a way to track exactly which files were installed so
that I could remove them later when I needed to uninstall. Fortunately
there’s a widely-used convention for exactly this purpose: DESTDIR
.
It’s expected that when a Makefile provides an “install” target, it
prefixes the installation path with the DESTDIR
macro, which is
assigned to the empty string by default. This allows the user to install
the software to a temporary location for the purposes of packaging.
Unlike the installation prefix (--prefix
) configured before the build
began, the software is not expected to function properly when run in the
DESTDIR
location.
$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install
A different tool will used to copy these files into place and actually
install it. This tool can track what files were installed, allowing them
to be removed later when uninstalling. My package manager uses the tar
program for both purposes. First it creates a package by packing up the
DESTDIR
(at the root of the actual install prefix):
$ tar czf package.tgz -C $DESTDIR$HOME/.local .
So a package is nothing more than a gzipped tarball. To install, it
unpacks the tarball in ~/.local
.
$ cd $HOME/.local
$ tar xzf ~/package.tgz
But how does it uninstall a package? It didn’t keep track of what was
installed. Easy! The tarball itself contains the package list, and it’s
printed with tar’s t
mode.
cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
rm -f "$file"
done
I’m using grep
to skip directories, which are conveniently listed with
a trailing slash. Note that in the example above, there are a couple of
issues with file names containing whitespace. If the file contains a
space character, it will word split incorrectly in the for
loop. A
Makefile couldn’t handle such a file in the first place, but, in case
it’s still necessary, my package manager sets IFS
to just a newline.
If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.
There are five commands, each assigned to a capital letter: -B
, -C
,
-I
, -V
, and -U
. It’s an interface pattern inspired by Ted
Unangst’s signify (see signify(1)
). I also used this
pattern with Blowpipe and, in retrospect, wish I had also used
with Enchive.
-B
)Unlike the other three commands, the “build” command isn’t essential,
and is just for convenience. It assumes the build uses an Autoconfg-like
configure script and runs it automatically, followed by make
with the
appropriate -j
(jobs) option. It automatically sets the --prefix
argument when running the configure script.
If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.
Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.
$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/
In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.
Once qpkg
returns, the program will be fully built — or stuck on a
build error if you’re unlucky. If you need to pass custom configure
options, just tack them on the qpkg
command:
$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses
Since the second and third steps — creating the build directory and
moving into it — is so common, there’s an optional switch for it: -d
.
This option’s argument is the build directory. qpkg
creates that
directory and runs the build inside it. In practice I just use “x” for
the build directory since it’s so quick to add “dx” to the command.
$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/
With the software compiled, the next step is creating the package.
-C
)The “create” command creates the DESTDIR
(_destdir
in the working
directory) and runs the “install” Makefile target to fill it with files.
Continuing with the example above and its x/
build directory:
$ qpkg -Cdx name
Where “name” is the name of the package, without any file name
extension. Like with “build”, extra arguments after the package name are
passed to make
in case there needs to be any additional tweaking.
When the “create” command finishes, there will be new package named
name.tgz
in the working directory. At this point the source and build
directories are no longer needed, assuming everything went fine.
$ rm -rf name-version/
$ rm -rf x/
This package is ready to install, though you may want to verify it first.
-V
)The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.
$ qpkg -V name.tgz
The most common conflict I’ve seen is in the info index (info/dir
)
file, which is safe to ignore since I don’t care about it.
If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.
-I
)The “install” command is just the dumb tar xzf
explained above. It
will clobber anything in its way without warning, which is why, if that
matters, “verify” should be used first.
$ qpkg -I name.tgz
When qpkg
returns, the package has been installed and is probably
ready to go. A lot of packages complain that you need to run libtool to
finalize an installation, but I’ve never had a problem skipping it. This
dumb unpacking generally works fine.
-U
)Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.
$ qpkg -U name.tgz
Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.
I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.
Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):
Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)
To deal with this, I added a “keep” (-k
) option that leaves the
DESTDIR
around after creating the package. To keep things tidy, the
intermediate packages exist and are installed, but the final, big
cross-compiler package accumulates into the DESTDIR
. The final
package at the end is actually the whole cross compiler in one package,
a superset of them all.
Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.
The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.
All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.
]]>For some time Elfeed was experiencing a strange, spurious failure. Every so often users were seeing an error (spoiler warning) when updating feeds: “error in process sentinel: Search failed.” If you use Elfeed, you might have even seen this yourself. From the surface it appeared that curl, tasked with the responsibility for downloading feed data, was producing incomplete output despite reporting a successful run. Since the run was successful, Elfeed assumed certain data was in curl’s output buffer, but, since it wasn’t, it failed hard.
Unfortunately this issue was not reproducible. Manually running curl outside of Emacs never revealed any issues. Asking Elfeed to retry fetching the feeds would work fine. The issue would only randomly rear its head when Elfeed was fetching many feeds in parallel, under stress. By the time the error was discovered, the curl process had exited and vital debugging information was lost. Considering that this was likely to be a bug in Emacs itself, there really wasn’t a reliable way to capture the necessary debugging information from within Emacs Lisp. And, indeed, this later proved to be the case.
A quick-and-dirty work around is to use condition-case
to catch and
swallow the error. When the bizarre issue shows up, rather than fail
badly in front of the user, Elfeed could attempt to swallow the error
— assuming it can be reliably detected — and treat the fetch as simply
a failure. That didn’t sit comfortably with me. Elfeed had done its
due diligence checking for errors already. Someone was lying to
Elfeed, and I intended to catch them with their pants on fire.
Someday.
I’d just need to witness the bug on one of my own machines. Elfeed is part of my daily routine, so surely I’d have to experience this issue myself someday. My plan was, should that day come, to run a modified Elfeed, instrumented to capture extra data. I would have also routinely run Emacs under GDB so that I could inspect the failure more deeply.
For now I just had to wait to hunt that zebra.
Over the holidays I re-discovered Bryan Cantrill, a systems software engineer who worked for Sun between 1996 and 2010, and is most well known for DTrace. My first exposure to him was in a BSD Now interview in 2015. I had re-watched that interview and decided there was a lot more I had to learn from him. He’s become a personal hero to me. So I scoured the internet for more of his writing and talks. Besides what I’ve already linked in this article, here are a couple more great presentations:
You can also find some of his writing scattered around the DTrace blog.
Some interesting operating system technology came out of Sun during its final 15 or so years — most notably DTrace and ZFS — and Bryan speaks about it passionately. Almost as a matter of luck, most of it survived the Oracle acquisition thanks to Sun releasing it as open source in just the nick of time. Otherwise it would have been lost forever. The scattered ex-Sun employees, still passionate about their prior work at Sun, along with some of their old customers have since picked up the pieces and kept going as a community under the name illumos. It’s like an open source flotilla.
Naturally I wanted to get my hands on this stuff to try it out for myself. Is it really as good as they say? Normally I stick to Linux, but it (generally) doesn’t have these Sun technologies. The main reason is license incompatibility. Sun released its code under the CDDL, which is incompatible with the GPL. Ubuntu does infamously include ZFS, but other distributions are unwilling to take that risk. Porting DTrace is a serious undertaking since it’s got its fingers throughout the kernel, which also makes the licensing issues even more complicated.
(Update Feburary 2018: DTrace has been released under the GPLv2, allowing it to be legally integrated with Linux.)
Linux has a reputation for Not Invented Here (NIH) syndrome, and these
licensing issues certainly contribute to that. Rather than adopt ZFS
and DTrace, they’ve been reinvented from scratch: btrfs instead of
ZFS, and a slew of partial options instead of DTrace.
Normally I’m most interested in system call tracing, and my go to is
strace, though it certainly has its limitations — including
this situation of debugging curl under Emacs. Another famous example
of NIH is Linux’s epoll(2)
, which is a broken
version of BSD kqueue(2)
.
So, if I want to try these for myself, I’ll need to install a different operating system. I’ve dabbled with OmniOS, an OS built on illumos, in virtual machines, using it as an alien environment to test some of my software (e.g. enchive). OmniOS has a philosophy called Keep Your Software To Yourself (KYSTY), which is really just code for “we don’t do packaging.” Honestly, you can’t blame them since they’re a tiny community. The best solution to this is probably pkgsrc, which is essentially a universal packaging system. Otherwise you’re on your own.
There’s also openindiana, which is a more friendly desktop-oriented illumos distribution. Still, the short of it is that you’re very much on your own when things don’t work. The situation is like running Linux a couple decades ago, when it was still difficult to do.
If you’re interested in trying DTrace, the easiest option these days is probably FreeBSD. It’s got a big, active community, thorough documentation, and a huge selection of packages. Its license (the BSD license, duh) is compatible with the CDDL, so both ZFS and DTrace have been ported to FreeBSD.
I’ve done all this talking but haven’t yet described what DTrace really is. I won’t pretend to write my own tutorial, but I’ll provide enough information to follow along. DTrace is a tracing framework for debugging production systems in real time, both for the kernel and for applications. The “production systems” part means it’s stable and safe — using DTrace won’t put your system at risk of crashing or damaging data. The “real time” part means it has little impact on performance. You can use DTrace on live, active systems with little impact. Both of these core design principles are vital for troubleshooting those really tricky bugs that only show up in production.
There are DTrace probes scattered all throughout the system: on system calls, scheduler events, networking events, process events, signals, virtual memory events, etc. Using a specialized language called D (unrelated to the general purpose programming language D), you can dynamically add behavior at these instrumentation points. Generally the behavior is to capture information, but it can also manipulate the event being traced.
Each probe is fully identified by a 4-tuple delimited by colons:
provider, module, function, and probe name. An empty element denotes a
sort of wildcard. For example, syscall::open:entry
is a probe at the
beginning (i.e. “entry”) of open(2)
. syscall:::entry
matches all
system call entry probes.
Unlike strace on Linux which monitors a specific process, DTrace applies to the entire system when active. To run curl under strace from Emacs, I’d have to modify Emacs’ behavior to do so. With DTrace I can instrument every curl process without making a single change to Emacs, and with negligible impact to Emacs. That’s a big deal.
So, when it comes to this Elfeed issue, FreeBSD is much better poised for debugging the problem. All I have to do is catch it in the act. However, it’s been months since that bug report and I’m not really making this connection yet. I’m just hoping I eventually find an interesting problem where I can apply DTrace.
So I’ve settled in FreeBSD as the playground for these technologies, I just have to decide where. I could always run it in a virtual machine, but it’s always more interesting to try things out on real hardware. FreeBSD supports the Raspberry Pi 2 as a Tier 2 system, and I had a Raspberry Pi 2 sitting around collecting dust, so I put it to use.
I wrote the image to an SD card, and for a few days I stretched my legs on this new system. I cloned a couple dozen of my own git repositories, ran the builds and the tests, and just got a feel for things. I tried out the ports system for the first time, mainly to discover that the low-powered Raspberry Pi 2 takes days to build some of the packages I want to try.
I mostly program in Vim these days, so it’s some days before I even set up Emacs. Eventually I do build Emacs, clone my configuration, fire it up, and give Elfeed a spin.
And that’s when the “search failed” bug strikes! Not just once, but dozens of times. Perfect! This low-powered platform is the jackpot for this particular bug, triggering it left and right. Given that I’ve got DTrace at my disposal, it’s the perfect place to debug this. Something is lying to Elfeed and DTrace will play the judge.
Before I dive in I see three possibilities:
With Dtrace I can observe what every curl process writes to Emacs, and I can also double check curl’s exit status. I come up with the following (newbie) DTrace script:
syscall::write:entry
/execname == "curl"/
{
printf("%d WRITE %d \"%s\"\n",
pid, arg2, stringof(copyin(arg1, arg2)));
}
syscall::exit:entry
/execname == "curl"/
{
printf("%d EXIT %d\n", pid, arg0);
}
The /execname == "curl"/
is a predicate that (obviously) causes the
behavior to only fire for curl processes. The first probe has DTrace
print a line for every write(2)
from curl. arg0
, arg1
, and
arg2
correspond to the arguments of write(2)
: fd, buf, count. It
logs the process ID (pid) of the write, the length of the write, and
the actual contents written. Remember that these curl processes are
run in parallel by Emacs, so the pid allows me to associate the
separate writes and the exit status.
The second probe prints the pid and the exit status (the first argument
to exit(2)
).
I also want to compare this to exactly what is delivered to Elfeed when
curl exits, so I modify the process sentinel — the callback
that handles a subprocess exiting — to call write-file
before any
action is taken. I can compare these buffer dumps to the logs produced
by DTrace.
There are two important findings.
First, when the “search failed” bug occurs, the buffer was completely empty (95% of the time) or truncated at the end of the HTTP headers (5% of the time), right at the blank line. DTrace indicates that curl did its job to the full, so it’s Emacs who’s the liar. It’s not delivering all of curl’s data to Elfeed. That’s pretty annoying.
Second, curl was line-buffered. Each line was a separate,
independent write(2)
. I was certainly not expecting this. Normally
the C library only does line buffering when the output is a terminal.
That’s because it’s guessing a user may be watching, expecting the
output to arrive a line at a time.
Here’s a sample of what it looked like in the log:
88188 WRITE 32 "Server: Apache/2.4.18 (Ubuntu)
"
88188 WRITE 46 "Location: https://blog.plover.com/index.atom
"
88188 WRITE 21 "Content-Length: 299
"
88188 WRITE 45 "Content-Type: text/html; charset=iso-8859-1
"
88188 WRITE 2 "
"
Why would curl think Emacs is a terminal?
Oh. That’s right. This is the same problem I ran into four years ago when writing EmacSQL. By default Emacs connects to subprocesses through a pseudo-terminal (pty). I called this a mistake in Emacs back then, and I still stand by that claim. The pty causes weird, annoying problems for little benefit:
Just from eyeballing the DTrace log I knew what to do: dump the pty
and switch to a pipe. This is controlled with the
process-connection-type
variable, and fixing it is a
one-liner.
Not only did this completely resolve the truncation issue, Elfeed is noticeably faster at fetching feeds on all machines. It’s no longer receiving mountains of XML one line at a time, like sucking pudding through a straw. It’s now quite zippy even on my Raspberry Pi 2, which had never been the case before (without the “search failed” bug). Even if you were never affected by this bug, you will benefit from the fix.
I haven’t officially reported this as an Emacs bug yet because reproducibility is still an issue. It needs something better than “fire off a bunch of HTTP requests across the internet in parallel from a Raspberry Pi.”
The fix reminds me of that old boilermaker story about charging a lot of money just to swing a hammer. Once the problem arose, DTrace quickly helped to identify the place to hit Emacs with the hammer.
Finally, a big thanks to alphapapa for originally taking the time to report this bug months ago.
]]>