Articles tagged bsd at null program

OpenBSD's pledge and unveil from Python

2021-09-15T02:46:56Z

This article was discussed on Hacker News.

Years ago, OpenBSD gained two new security system calls, pledge(2) (originally tame(2)) and unveil. In both, an application surrenders capabilities at run-time. The idea is to perform initialization like usual, then drop capabilities before handling untrusted input, limiting unwanted side effects. This feature is applicable even where type safety isn’t an issue, such as Python, where a program might still get tricked into accessing sensitive files or making network connections when it shouldn’t. So how can a Python program access these system calls?

As discussed previously, it’s quite easy to access C APIs from Python through its ctypes package, and this is no exception. In this article I show how to do it. Here’s the full source if you want to dive in: openbsd.py.

I’ve chosen these extra constraints:

As extra safety features, unnecessary for correctness, attempts to call these functions on systems where they don’t exist will silently do nothing, as though they succeeded. They’re provided as a best effort.
Systems other than OpenBSD may support these functions, now or in the future, and it would be nice to automatically make use of them when available. This means no checking for OpenBSD specifically but instead feature sniffing for their presence.
The interfaces should be Pythonic as though they were implemented in Python itself. Raise exceptions for errors, and accept strings since they’re more convenient than bytes.

For reference, here are the function prototypes:

int pledge(const char *promises, const char *execpromises);
int unveil(const char *path, const char *permissions);

The string-oriented interface of pledge will make this a whole lot easier to implement.

Finding the functions

The first step is to grab functions through ctypes. Like a lot of Python documentation, this area is frustratingly imprecise and under-documented. I want to grab a handle to the already-linked libc and search for either function. However, getting that handle is a little different on each platform, and in the process I saw four different exceptions, only one of which is documented.

I came up with passing None to ctypes.CDLL, which ultimately just passes NULL to dlopen(3). That’s really all I wanted. Currently on Windows this is a TypeError. Once the handle is in hand, try to access the pledge attribute, which will fail with AttributeError if it doesn’t exist. In the event of any exception, just assume the behavior isn’t available. If found, I also define the function prototype for ctypes.

_pledge = None
try:
    _pledge = ctypes.CDLL(None, use_errno=True).pledge
    _pledge.restype = ctypes.c_int
    _pledge.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _pledge = None

Catching a broad Exception isn’t great, but it’s the best we can do since the documentation is incomplete. From this block I’ve seen TypeError, AttributeError, FileNotFoundError, and OSError. I wouldn’t be surprised if there are more possibilities, and I don’t want to risk missing them.

Note that I’m catching Exception rather than using a bare except. My code will not catch KeyboardInterrupt nor SystemExit. This is deliberate, and I never want to catch these.

The same story for unveil:

_unveil = None
try:
    _unveil = ctypes.CDLL(None, use_errno=True).unveil
    _unveil.restype = ctypes.c_int
    _unveil.argtypes = ctypes.c_char_p, ctypes.c_char_p
except Exception:
    _unveil = None

Pythonic wrappers

The next and final step is to wrap the low-level call in an interface that hides their C and ctypes nature.

Python strings must be encoded to bytes before they can be passed to C functions. Rather than make the caller worry about this, we’ll let them pass friendly strings and have the wrapper do the conversion. Either may also be NULL, so None is allowed.

def pledge(promises: Optional[str], execpromises: Optional[str]):
    if not _pledge:
        return  # unimplemented

    r = _pledge(None if promises is None else promises.encode(),
                None if execpromises is None else execpromises.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))

As usual, a return of -1 means there was an error, in which case we fetch errno and raise the appropriate OSError.

unveil works a little differently since the first argument is a path. Python functions that accept paths, such as open, generally accept either strings or bytes. On unix-like systems, paths are fundamentally bytestrings and not necessarily Unicode, so it’s necessary to accept bytes. Since strings are nearly always more convenient, they take both. The unveil wrapper here will do the same. If it’s a string, encode it, otherwise pass it straight through.

def unveil(path: Union[str, bytes, None], permissions: Optional[str]):
    if not _unveil:
        return  # unimplemented

    r = _unveil(path.encode() if isinstance(path, str) else path,
                None if permissions is None else permissions.encode())
    if r == -1:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))

That’s it!

Trying it out

Let’s start with unveil. Initially a process has access to the whole file system with the usual restrictions. On the first call to unveil it’s immediately restricted to some subset of the tree. Each call reveals a little more until a final NULL which locks it in place for the rest of the process’s existence.

Suppose a program has been tricked into accessing your shell history, perhaps by mishandling a path:

def hackme():
    try:
        with open(pathlib.Path.home() / ".bash_history"):
            print("You've been hacked!")
    except FileNotFoundError:
        print("Blocked by unveil.")

hackme()

If you’re a Bash user, this prints:

You've been hacked!

Using our new feature to restrict the program’s access first:

# restrict access to static program data
unveil("/usr/share", "r")
unveil(None, None)

hackme()

On OpenBSD this now prints:

Blocked by unveil.

Working just as it should!

With pledge we declare what abilities we’d like to keep by supplying a list of promises, pledging to use only those abilities afterward. A common case is the stdio promise which allows reading and writing of open files, but not opening files. A program might open its log file, then drop the ability to open files while retaining the ability to write to its log.

An invalid or unknown promise is an error. Does that work?

>>> pledge("doesntexist", None)
OSError: [Errno 22] Invalid argument

So far so good. How about the functionality itself?

pledge("stdio", None)
hackme()

The program is instantly killed when making the disallowed system call:

Abort trap (core dumped)

If you want something a little softer, include the error promise:

pledge("stdio error", None)
hackme()

Instead it’s an exception, which will be a lot easier to debug when it comes to Python, so you probably always want to use it.

OSError: [Errno 78] Function not implemented

The core dump isn’t going to be much help to a Python program, so you probably always want to use this promise. In general you need to be extra careful about pledge in complex runtimes like Python’s which may reasonably need to do many arbitrary, undocumented things at any time.

A Survey of $RANDOM

2018-12-25T00:05:38Z

Most Bourne shell clones support a special RANDOM environment variable that evaluates to a random value between 0 and 32,767 (e.g. 15 bits). Assigment to the variable seeds the generator. This variable is an extension and did not appear in the original Unix Bourne shell. Despite this, the different Bourne-like shells that implement it have converged to the same interface, but only the interface. Each implementation differs in interesting ways. In this article we’ll explore how $RANDOM is implemented in various Bourne-like shells.

~~Unfortunately I was unable to determine the origin of $RANDOM.~~ Nobody was doing a good job tracking source code changes before the mid-1990s, so that history appears to be lost. Bash was first released in 1989, but the earliest version I could find was 1.14.7, released in 1996. KornShell was first released in 1983, but the earliest source I could find was from 1993. In both cases $RANDOM already existed. My guess is that it first appeared in one of these two shells, probably KornShell.

Update: Quentin Barnes has informed me that his 1986 copy of KornShell (a.k.a. ksh86) implements $RANDOM. This predates Bash and makes it likely that this feature originated in KornShell.

Bash

Of all the shells I’m going to discuss, Bash has the most interesting history. It never made use use of srand(3) / rand(3) and instead uses its own generator — which is generally what I prefer. Prior to Bash 4.0, it used the crummy linear congruential generator (LCG) found in the C89 standard:

static unsigned long rseed = 1;

static int
brand ()
{
  rseed = rseed * 1103515245 + 12345;
  return ((unsigned int)((rseed >> 16) & 32767));
}

For some reason it was naïvely decided that $RANDOM should never produce the same value twice in a row. The caller of brand() filters the output and discards repeats before returning to the shell script. This actually reduces the quality of the generator further since it increases correlation between separate outputs.

When the shell starts up, rseed is seeded from the PID and the current time in seconds. These values are literally summed and used as the seed.

/* Note: not the literal code, but equivalent. */
rseed = getpid() + time(0);

Subshells, which fork and initally share an rseed, are given similar treatment:

rseed = rseed + getpid() + time(0);

Notice there’s no hashing or mixing of these values, so there’s no avalanche effect. That would have prevented shells that start around the same time from having related initial random sequences.

With Bash 4.0, released in 2009, the algorithm was changed to a Park–Miller multiplicative LCG from 1988:

static int
brand ()
{
  long h, l;

  /* can't seed with 0. */
  if (rseed == 0)
    rseed = 123459876;
  h = rseed / 127773;
  l = rseed % 127773;
  rseed = 16807 * l - 2836 * h;
  return ((unsigned int)(rseed & 32767));
}

There’s actually a subtle mistake in this implementation compared to the generator described in the paper. This function will generate different numbers than the paper, and it will generate different numbers on different hosts! More on that later.

This algorithm is a much better choice than the previous LCG. There were many more options available in 2009 compared to 1989, but, honestly, this generator is pretty reasonable for this application. Bash is so slow that you’re never practically going to generate enough numbers for the small state to matter. Since the Park–Miller algorithm is older than Bash, they could have used this in the first place.

I considered submitting a patch to switch to something more modern. However, given Bash’s constraints, it’s harder said than done. Portability to weird systems is still a concern, and I expect they’d reject a patch that started making use of long long in the PRNG. They still support pre-ANSI C compilers that don’t have 64-bit arithmetic.

However, what still really could be improved is seeding. In Bash 4.x here’s what it looks like:

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (tv.tv_sec ^ tv.tv_usec ^ getpid ());
}

Seeding is both better and worse. It’s better that it’s seeded from a higher resolution clock (milliseconds), so two shells started close in time have more variation. However, it’s “mixed” with XOR, which, in this case, is worse than addition.

For example, imagine two Bash shells started one millsecond apart. Both tv_usec and getpid() are incremented by one. Those increments are likely to cancel each other out by an XOR, and they end up with the same seed.

Instead, each of those quantities should be hashed before mixing. Here’s a rough example using my triple32() hash (adapted to glorious GNU-style pre-ANSI C):

static unsigned long
hash32 (x)
     unsigned long x;
{
  x ^= x >> 17;
  x *= 0xed5ad4bbUL;
  x &= 0xffffffffUL;
  x ^= x >> 11;
  x *= 0xac4c1b51UL;
  x &= 0xffffffffUL;
  x ^= x >> 15;
  x *= 0x31848babUL;
  x &= 0xffffffffUL;
  x ^= x >> 14;
  return x;
}

static void
seedrand ()
{
  struct timeval tv;

  gettimeofday (&tv, NULL);
  sbrand (hash32 (tv.tv_sec) ^
          hash32 (hash32 (tv.tv_usec) ^ getpid ()));
}

I had said there’s there’s a mistake in the Bash implementation of Park–Miller. Take a closer look at the types and the assignment to rseed:

  /* The variables */
  long h, l;
  unsigned long rseed;

  /* The assignment */
  rseed = 16807 * l - 2836 * h;

The result of the substraction can be negative, and that negative value is converted to unsigned long. The C standard says ULONG_MAX + 1 is added to make the value positive. ULONG_MAX varies by platform — typicially long is either 32 bits or 64 bits — so the results also vary. Here’s how the paper defined it:

  long test;

  test = 16807 * l - 2836 * h;
  if (test > 0)
    rseed = test;
  else
    rseed = test + 2147483647;

As far as I can tell, this mistake doesn’t hurt the quality of the generator.

$ 32/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 13634

$ 64/bash -c 'RANDOM=127773; echo $RANDOM $RANDOM'
29932 29115

Zsh

In contrast to Bash, Zsh is the most straightforward: defer to rand(3). Its $RANDOM can return the same value twice in a row, assuming that rand(3) does.

zlong
randomgetfn(UNUSED(Param pm))
{
    return rand() & 0x7fff;
}

void
randomsetfn(UNUSED(Param pm), zlong v)
{
    srand((unsigned int)v);
}

A cool feature is that means you could override it if you wanted with a custom generator.

int
rand(void)
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

Usage:

$ gcc -shared -fPIC -o rand.so rand.c
$ LD_PRELOAD=./rand.so zsh -c 'echo $RANDOM $RANDOM $RANDOM'
4 4 4

This trick also applies to the rest of the shells below.

KornShell (ksh)

KornShell originated in 1983, but it was finally released under an open source license in 2005. There’s a clone of KornShell called Public Domain Korn Shell (pdksh) that’s been forked a dozen different ways, but I’ll get to that next.

KornShell defers to rand(3), but it does some additional naïve filtering on the output. When the shell starts up, it generates 10 values from rand(). If any of them are larger than 32,767 then it will shift right by three all generated numbers.

#define RANDMASK 0x7fff

    for (n = 0; n < 10; n++) {
        // Don't use lower bits when rand() generates large numbers.
        if (rand() > RANDMASK) {
            rand_shift = 3;
            break;
        }
    }

Why not just look at RAND_MAX? I guess they didn’t think of it.

Update: Quentin Barnes pointed out that RAND_MAX didn’t exist until POSIX standardization in 1988. The constant first appeared in Unix in 1990. This KornShell code either predates the standard or needed to work on systems that predate the standard.

Like Bash, repeated values are not allowed. I suspect one shell got this idea from the other.

    do {
        cur = (rand() >> rand_shift) & RANDMASK;
    } while (cur == last);

Who came up with this strange idea first?

OpenBSD’s Public Domain Korn Shell (pdksh)

I picked the OpenBSD variant of pdksh since it’s the only pdksh fork I ever touch in practice, and its $RANDOM is the most interesting of the pdksh forks — at least since 2014.

Like Zsh, pdksh simply defers to rand(3). However, OpenBSD’s rand(3) is infamously and proudly non-standard. By default it returns non-deterministic, cryptographic-quality results seeded from system entropy (via the misnamed arc4random(3)), à la /dev/urandom. Its $RANDOM inherits this behavior.

    setint(vp, (int64_t) (rand() & 0x7fff));

However, if a value is assigned to $RANDOM in order to seed it, it reverts to its old pre-2014 deterministic generation via srand_deterministic(3).

    srand_deterministic((unsigned int)intval(vp));

OpenBSD’s deterministic rand(3) is the crummy LCG from the C89 standard, just like Bash 3.x. So if you assign to $RANDOM, you’ll get nearly the same results as Bash 3.x and earlier — the only difference being that it can repeat numbers.

That’s a slick upgrade to the old interface without breaking anything, making it my favorite version $RANDOM for any shell.

Intercepting and Emulating Linux System Calls with Ptrace

2018-06-23T20:41:08Z

The ptrace(2) (“process trace”) system call is usually associated with debugging. It’s the primary mechanism through which native debuggers monitor debuggees on unix-like systems. It’s also the usual approach for implementing strace — system call trace. With Ptrace, tracers can pause tracees, inspect and set registers and memory, monitor system calls, or even intercept system calls.

By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.

The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.

For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.

You can find runnable code for the examples in this article here:

https://github.com/skeeto/ptrace-examples

strace

Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.

Ptrace has never been standardized. Its interface is similar across different operating systems, especially in its core functionality, but it’s still subtly different from system to system. The ptrace(2) prototype generally looks something like this, though the specific types may be different.

long ptrace(int request, pid_t pid, void *addr, void *data);

The pid is the tracee’s process ID. While a tracee can have only one tracer attached at a time, a tracer can be attached to many tracees.

The request field selects a specific Ptrace function, just like the ioctl(2) interface. For strace, only two are needed:

PTRACE_TRACEME: This process is to be traced by its parent.
PTRACE_SYSCALL: Continue, but stop at the next system call entrance or exit.
PTRACE_GETREGS: Get a copy of the tracee’s registers.

The other two fields, addr and data, serve as generic arguments for the selected Ptrace function. One or both are often ignored, in which case I pass zero.

The strace interface is essentially a prefix to another command.

$ strace [strace options] program [arguments]

My minimal strace doesn’t have any options, so the first thing to do — assuming it has at least one argument — is fork(2) and exec(2) the tracee process on the tail of argv. But before loading the target program, the new process will inform the kernel that it’s going to be traced by its parent. The tracee will be paused by this Ptrace system call.

pid_t pid = fork();
switch (pid) {
    case -1: /* error */
        FATAL("%s", strerror(errno));
    case 0:  /* child */
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        execvp(argv[1], argv + 1);
        FATAL("%s", strerror(errno));
}

The parent waits for the child’s PTRACE_TRACEME using wait(2). When wait(2) returns, the child will be paused.

waitpid(pid, 0, 0);

Before allowing the child to continue, we tell the operating system that the tracee should be terminated along with its parent. A real strace implementation may want to set other options, such as PTRACE_O_TRACEFORK.

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);

All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:

Wait for the process to enter the next system call.
Print a representation of the system call.
Allow the system call to execute and wait for the return.
Print the system call return value.

The PTRACE_SYSCALL request is used in both waiting for the next system call to begin, and waiting for that system call to exit. As before, a wait(2) is needed to wait for the tracee to enter the desired state.

ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);

When wait(2) returns, the registers for the thread that made the system call are filled with the system call number and its arguments. However, the operating system has not yet serviced this system call. This detail will be important later.

The next step is to gather the system call information. This is where it gets architecture specific. On x86-64, the system call number is passed in rax, and the arguments (up to 6) are passed in rdi, rsi, rdx, r10, r8, and r9. Reading the registers is another Ptrace call, though there’s no need to wait(2) since the tracee isn’t changing state.

struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, &regs);
long syscall = regs.orig_rax;

fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
        syscall,
        (long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
        (long)regs.r10, (long)regs.r8,  (long)regs.r9);

There’s one caveat. For internal kernel purposes, the system call number is stored in orig_rax rather than rax. All the other system call arguments are straightforward.

Next it’s another PTRACE_SYSCALL and wait(2), then another PTRACE_GETREGS to fetch the result. The result is stored in rax.

ptrace(PTRACE_GETREGS, pid, 0, &regs);
fprintf(stderr, " = %ld\n", (long)regs.rax);

The output from this simple program is very crude. There is no symbolic name for the system call and every argument is printed numerically, even if it’s a pointer to a buffer. A more complete strace would know which arguments are pointers and use process_vm_readv(2) to read those buffers from the tracee in order to print them appropriately.

However, this does lay the groundwork for system call interception.

System call interception

Suppose we want to use Ptrace to implement something like OpenBSD’s pledge(2), in which a process pledges to use only a restricted set of system calls. The idea is that many programs typically have an initialization phase where they need lots of system access (opening files, binding sockets, etc.). After initialization they enter a main loop in which they processing input and only a small set of system calls are needed.

Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.

Using the same strace model, rather than print out all system calls, we could either block certain system calls or simply terminate the tracee when it misbehaves. Termination is easy: just call exit(2) in the tracer. Since it’s configured to also terminate the tracee. Blocking the system call and allowing the child to continue is a little trickier.

The tricky part is that there’s no way to abort a system call once it’s started. When tracer returns from wait(2) on the entrance to the system call, the only way to stop a system call from happening is to terminate the tracee.

However, not only can we mess with the system call arguments, we can change the system call number itself, converting it to a system call that doesn’t exist. On return we can report a “friendly” EPERM error in errno via the normal in-band signaling.

for (;;) {
    /* Enter next system call */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    /* Is this system call permitted? */
    int blocked = 0;
    if (is_syscall_blocked(regs.orig_rax)) {
        blocked = 1;
        regs.orig_rax = -1; // set to invalid syscall
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }

    /* Run system call and stop on exit */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    if (blocked) {
        /* errno = EPERM */
        regs.rax = -EPERM; // Operation not permitted
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }
}

This simple example only checks against a whitelist or blacklist of system calls. And there’s no nuance, such as allowing files to be opened (open(2)) read-only but not as writable, allowing anonymous memory maps but not non-anonymous mappings, etc. There’s also no way to the tracee to dynamically drop privileges.

How could the tracee communicate to the tracer? Use an artificial system call!

Creating an artificial system call

For my new pledge-like system call — which I call xpledge() to distinguish it from the real thing — I picked system call number 10000, a nice high number that’s unlikely to ever be used for a real system call.

#define SYS_xpledge 10000

Just for demonstration purposes, I put together a minuscule interface that’s not good for much in practice. It has little in common with OpenBSD’s pledge(2), which uses a string interface. Actually designing robust and secure sets of privileges is really complicated, as the pledge(2) manpage shows. Here’s the entire interface and implementation of the system call for the tracee:

#define _GNU_SOURCE
#include 

#define XPLEDGE_RDWR  (1 << 0)
#define XPLEDGE_OPEN  (1 << 1)

#define xpledge(arg) syscall(SYS_xpledge, arg)

If it passes zero for the argument, only a few basic system calls are allowed, including those used to allocate memory (e.g. brk(2)). The PLEDGE_RDWR bit allows various read and write system calls (read(2), readv(2), pread(2), preadv(2), etc.). The PLEDGE_OPEN bit allows open(2).

To prevent privileges from being escalated back, pledge() blocks itself — though this also prevents dropping more privileges later down the line.

In the xpledge tracer, I just need to check for this system call:

/* Handle entrance */
switch (regs.orig_rax) {
    case SYS_pledge:
        register_pledge(regs.rdi);
        break;
}

The operating system will return ENOSYS (Function not implemented) since this isn’t a real system call. So on the way out I overwrite this with a success (0).

/* Handle exit */
switch (regs.orig_rax) {
    case SYS_pledge:
        ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
        break;
}

I wrote a little test program that opens /dev/urandom, makes a read, tries to pledge, then tries to open /dev/urandom a second time, then confirms it can read from the original /dev/urandom file descriptor. Running without a pledge tracer, the output looks like this:

$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604

Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:

$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4

The pledge succeeds but the second fopen(3) does not since the tracer blocked it with EPERM.

This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.

Foreign system emulation

Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.

You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.

The Linux version of PTrace has had a more efficient operation for this technique since 2005: PTRACE_SYSEMU. PTrace stops only once per a system call, and it’s up to the tracer to service that system call before allowing the tracee to continue.

for (;;) {
    ptrace(PTRACE_SYSEMU, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    switch (regs.orig_rax) {
        case OS_read:
            /* ... */

        case OS_write:
            /* ... */

        case OS_open:
            /* ... */

        case OS_exit:
            /* ... */

        /* ... and so on ... */
    }
}

To run binaries for the same architecture from any system with a stable (enough) system call ABI, you just need this PTRACE_SYSEMU tracer, a loader (to take the place of exec(2)), and whatever system libraries the binary needs (or only run static binaries).

In fact, this sounds like a fun weekend project.

A Crude Personal Package Manager

2018-03-27T02:10:35Z

For the past couple of months I’ve been using a custom package manager to manage a handful of software packages within various unix-like environments. Packages are installed in my home directory under ~/.local/bin, and the package manager itself is just a 110 line Bourne shell script. It’s is not intended to replace the system’s package manager but, instead, compliment it in some cases where I need more flexibility. I use it to run custom versions of specific pieces of software — newer or older than the system-installed versions, or with my own patches and modifications — without interfering with the rest of system, and without a need for root access. It’s worked out really well so far and I expect to continue making heavy use of it in the future.

It’s so simple that I haven’t even bothered putting the script in its own repository. It sits unadorned within my dotfiles repository with the name qpkg (“quick package”):

https://github.com/skeeto/dotfiles/blob/master/bin/qpkg

Sitting alongside my dotfiles means it’s always there when I need it, just as if it was a built-in command.

I say it’s crude because its “install” (-I) procedure is little more than a wrapper around tar. It doesn’t invoke libtool after installing a library, and there’s no post-install script — or postinst as Debian calls it. It doesn’t check for conflicts between packages, though there’s a command for doing so manually ahead of time. It doesn’t manage dependencies, nor even have them as a concept. That’s all on the user to screw up.

In other words, it doesn’t attempt solve most of the hard problems tackled by package managers… except for three important issues:

It provides a clean, guaranteed-to-work uninstall procedure. Some Makefiles do have a token “uninstall” target, but it’s often unreliable.
Unlike blindly using a Makefile “install” target, I can check for conflicts before installing the software. I’ll know if and how a package clobbers an already-installed package, and I can manage, or ignore, that conflict manually as needed.
It produces a compact, reusable package file that I can reinstall later, even on a different machine (with a couple of caveats). I don’t need to keep around the original source and build directories should I want to install or uninstall later. I can also rapidly switch back and forth between different builds of the same software.

The first caveat is that the package will be configured for exactly my own home directory, so I usually can’t share it with other users, or install it on machines where I have a different home directory. Though I could still create packages for different installation prefixes.

The second caveat is that some builds tailor themselves by default to the host (e.g. -march=native). If care isn’t taken, those packages may not be very portable. This is more common than I had expected and has mildly annoyed me.

Birth of a package manager

While the package manager is new, I’ve been building and installing software in my home directory for years. I’d follow the normal process of setting the install prefix to $HOME/.local, running the build, and then letting the “install” target do its thing.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ ./configure --prefix=$HOME/.local
$ make -j$(nproc)
$ make install

This worked well enough for years. However, I’ve come to rely a lot on this technique, and I’m using it for increasingly sophisticated purposes, such as building custom cross-compiler toolchains.

A common difficulty has been handling the release of new versions of software. I’d like to upgrade to the new version, but lack a way to cleanly uninstall the previous version. Simply clobbering the old version by installing it on top usually works. Occasionally it wouldn’t, and I’d have to blow away ~/.local and start all over again. With more and more software installed in my home directory, restarting has become more and more of a chore that I’d like to avoid.

What I needed was a way to track exactly which files were installed so that I could remove them later when I needed to uninstall. Fortunately there’s a widely-used convention for exactly this purpose: DESTDIR.

It’s expected that when a Makefile provides an “install” target, it prefixes the installation path with the DESTDIR macro, which is assigned to the empty string by default. This allows the user to install the software to a temporary location for the purposes of packaging. Unlike the installation prefix (--prefix) configured before the build began, the software is not expected to function properly when run in the DESTDIR location.

$ DESTDIR=_destdir
$ mkdir $DESTDIR
$ make DESTDIR=$DESTDIR install

A different tool will used to copy these files into place and actually install it. This tool can track what files were installed, allowing them to be removed later when uninstalling. My package manager uses the tar program for both purposes. First it creates a package by packing up the DESTDIR (at the root of the actual install prefix):

$ tar czf package.tgz -C $DESTDIR$HOME/.local .

So a package is nothing more than a gzipped tarball. To install, it unpacks the tarball in ~/.local.

$ cd $HOME/.local
$ tar xzf ~/package.tgz

But how does it uninstall a package? It didn’t keep track of what was installed. Easy! The tarball itself contains the package list, and it’s printed with tar’s t mode.

cd $HOME/.local
for file in $(tar tzf package.tgz | grep -v '/$'); do
    rm -f "$file"
done

I’m using grep to skip directories, which are conveniently listed with a trailing slash. Note that in the example above, there are a couple of issues with file names containing whitespace. If the file contains a space character, it will word split incorrectly in the for loop. A Makefile couldn’t handle such a file in the first place, but, in case it’s still necessary, my package manager sets IFS to just a newline.

If the file name contains a newline, then my package manager relies on a cosmic ray striking just the right bit at just the right instant to make it all work out, because no version of tar can unambiguously print such file names. Crossing your fingers during this process may help.

Commands

There are five commands, each assigned to a capital letter: -B, -C, -I, -V, and -U. It’s an interface pattern inspired by Ted Unangst’s signify (see signify(1)). I also used this pattern with Blowpipe and, in retrospect, wish I had also used with Enchive.

Build (`-B`)

Unlike the other three commands, the “build” command isn’t essential, and is just for convenience. It assumes the build uses an Autoconfg-like configure script and runs it automatically, followed by make with the appropriate -j (jobs) option. It automatically sets the --prefix argument when running the configure script.

If the build uses something other and an Autoconf-like configure script, such as CMake, then you can’t use the “build” command and must perform the build yourself. For example, I must do this when building LLVM and Clang.

Before using the “build” command, the package must first be unpacked and patched if necessary. Then the package manager can take over to run the build.

$ tar xzf name-version.tar.gz
$ cd name-version/
$ patch -p1 < ../0001.patch
$ patch -p1 < ../0002.patch
$ patch -p1 < ../0003.patch
$ cd ..
$ mkdir build
$ cd build/
$ qpkg -B ../name-version/

In this example I’m doing an out-of-source build by invoking the configure script from a different directory. Did you know Autoconf scripts support this? I didn’t know until recently! Unfortunately some hand-written Autoconf-like scripts don’t, though this will be immediately obvious.

Once qpkg returns, the program will be fully built — or stuck on a build error if you’re unlucky. If you need to pass custom configure options, just tack them on the qpkg command:

$ qpkg -B ../name-version/ --without-libxml2 --with-ncurses

Since the second and third steps — creating the build directory and moving into it — is so common, there’s an optional switch for it: -d. This option’s argument is the build directory. qpkg creates that directory and runs the build inside it. In practice I just use “x” for the build directory since it’s so quick to add “dx” to the command.

$ tar xzf name-version.tar.gz
$ qpkg -Bdx ../name-version/

With the software compiled, the next step is creating the package.

Create (`-C`)

The “create” command creates the DESTDIR (_destdir in the working directory) and runs the “install” Makefile target to fill it with files. Continuing with the example above and its x/ build directory:

$ qpkg -Cdx name

Where “name” is the name of the package, without any file name extension. Like with “build”, extra arguments after the package name are passed to make in case there needs to be any additional tweaking.

When the “create” command finishes, there will be new package named name.tgz in the working directory. At this point the source and build directories are no longer needed, assuming everything went fine.

$ rm -rf name-version/
$ rm -rf x/

This package is ready to install, though you may want to verify it first.

Verify (`-V`)

The “verify” command checks for collisions against installed packages. It works like uninstallation, but rather than deleting files, it checks if any of the files already exist. If they do, it means there’s a conflict with an existing package. These file names are printed.

$ qpkg -V name.tgz

The most common conflict I’ve seen is in the info index (info/dir) file, which is safe to ignore since I don’t care about it.

If the package has already been installed, there will of course be tons of conflicts. This is the easiest way to check if a package has been installed.

Install (`-I`)

The “install” command is just the dumb tar xzf explained above. It will clobber anything in its way without warning, which is why, if that matters, “verify” should be used first.

$ qpkg -I name.tgz

When qpkg returns, the package has been installed and is probably ready to go. A lot of packages complain that you need to run libtool to finalize an installation, but I’ve never had a problem skipping it. This dumb unpacking generally works fine.

Uninstall (`-U`)

Obviously the last command is “uninstall”. As explained above, this needs the original package that was given to the “install” command.

$ qpkg -U name.tgz

Just as “install” is dumb, so is “uninstall,” blindly deleting anything listed in the tarball. One thing I like about dumb tools is that there are no surprises.

I typically suffix the package name with the version number to help keep the packages organized. When upgrading to a new version of a piece of software, I build the new package, which, thanks to the version suffix, will have a distinct name. Then I uninstall the old package, and, finally, install the new one in its place. So far I’ve been keeping the old package around in case I still need it, though I could always rebuild it in a pinch.

Package by accumulation

Building a GCC cross-compiler toolchain is a tricky case that doesn’t fit so well with the build, create, and install process illustrated above. It would be nice for the cross-compiler to be a single, big package, but due to the way it’s built, it would need to be five or so packages, a couple of which will conflict (one being a subset of another):

binutils
C headers
core GCC
C runtime
rest of GCC

Each step needs to be installed before the next step will work. (I don’t even want to think about cross-compiling a cross-compiler.)

To deal with this, I added a “keep” (-k) option that leaves the DESTDIR around after creating the package. To keep things tidy, the intermediate packages exist and are installed, but the final, big cross-compiler package accumulates into the DESTDIR. The final package at the end is actually the whole cross compiler in one package, a superset of them all.

Complicated situations like these are where I can really understand the value of Debian’s fakeroot tool.

My use case, and an alternative

The role filled by my package manager is actually pretty well suited for pkgsrc, which is NetBSD’s ports system made available to other unix-like systems. However, I just need something really lightweight that gives me absolute control — even more than I get with pkgsrc — in the dozen or so cases where I really need it.

All I need is a standard C toolchain in a unix-like environment (even a really old one), the source tarballs for the software I need, my 110 line shell script package manager, and one to two cans of elbow grease. From there I can bootstrap everything I might need without root access, even in a disaster. If the software I need isn’t written in C, it can ultimately get bootstrapped from some crusty old C compiler, which might even involve building some newer C compilers in between. After a certain point it’s C all the way down.

Debugging Emacs or: How I Learned to Stop Worrying and Love DTrace

2018-01-17T23:59:49Z

Update: This article was featured on BSD Now 233 (starting at 21:38).

For some time Elfeed was experiencing a strange, spurious failure. Every so often users were seeing an error (spoiler warning) when updating feeds: “error in process sentinel: Search failed.” If you use Elfeed, you might have even seen this yourself. From the surface it appeared that curl, tasked with the responsibility for downloading feed data, was producing incomplete output despite reporting a successful run. Since the run was successful, Elfeed assumed certain data was in curl’s output buffer, but, since it wasn’t, it failed hard.

Unfortunately this issue was not reproducible. Manually running curl outside of Emacs never revealed any issues. Asking Elfeed to retry fetching the feeds would work fine. The issue would only randomly rear its head when Elfeed was fetching many feeds in parallel, under stress. By the time the error was discovered, the curl process had exited and vital debugging information was lost. Considering that this was likely to be a bug in Emacs itself, there really wasn’t a reliable way to capture the necessary debugging information from within Emacs Lisp. And, indeed, this later proved to be the case.

A quick-and-dirty work around is to use condition-case to catch and swallow the error. When the bizarre issue shows up, rather than fail badly in front of the user, Elfeed could attempt to swallow the error — assuming it can be reliably detected — and treat the fetch as simply a failure. That didn’t sit comfortably with me. Elfeed had done its due diligence checking for errors already. Someone was lying to Elfeed, and I intended to catch them with their pants on fire. Someday.

I’d just need to witness the bug on one of my own machines. Elfeed is part of my daily routine, so surely I’d have to experience this issue myself someday. My plan was, should that day come, to run a modified Elfeed, instrumented to capture extra data. I would have also routinely run Emacs under GDB so that I could inspect the failure more deeply.

For now I just had to wait to hunt that zebra.

Bryan Cantrill, DTrace, and FreeBSD

Over the holidays I re-discovered Bryan Cantrill, a systems software engineer who worked for Sun between 1996 and 2010, and is most well known for DTrace. My first exposure to him was in a BSD Now interview in 2015. I had re-watched that interview and decided there was a lot more I had to learn from him. He’s become a personal hero to me. So I scoured the internet for more of his writing and talks. Besides what I’ve already linked in this article, here are a couple more great presentations:

You can also find some of his writing scattered around the DTrace blog.

Some interesting operating system technology came out of Sun during its final 15 or so years — most notably DTrace and ZFS — and Bryan speaks about it passionately. Almost as a matter of luck, most of it survived the Oracle acquisition thanks to Sun releasing it as open source in just the nick of time. Otherwise it would have been lost forever. The scattered ex-Sun employees, still passionate about their prior work at Sun, along with some of their old customers have since picked up the pieces and kept going as a community under the name illumos. It’s like an open source flotilla.

Naturally I wanted to get my hands on this stuff to try it out for myself. Is it really as good as they say? Normally I stick to Linux, but it (generally) doesn’t have these Sun technologies. The main reason is license incompatibility. Sun released its code under the CDDL, which is incompatible with the GPL. Ubuntu does infamously include ZFS, but other distributions are unwilling to take that risk. Porting DTrace is a serious undertaking since it’s got its fingers throughout the kernel, which also makes the licensing issues even more complicated.

(Update Feburary 2018: DTrace has been released under the GPLv2, allowing it to be legally integrated with Linux.)

Linux has a reputation for Not Invented Here (NIH) syndrome, and these licensing issues certainly contribute to that. Rather than adopt ZFS and DTrace, they’ve been reinvented from scratch: btrfs instead of ZFS, and a slew of partial options instead of DTrace. Normally I’m most interested in system call tracing, and my go to is strace, though it certainly has its limitations — including this situation of debugging curl under Emacs. Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2).

So, if I want to try these for myself, I’ll need to install a different operating system. I’ve dabbled with OmniOS, an OS built on illumos, in virtual machines, using it as an alien environment to test some of my software (e.g. enchive). OmniOS has a philosophy called Keep Your Software To Yourself (KYSTY), which is really just code for “we don’t do packaging.” Honestly, you can’t blame them since they’re a tiny community. The best solution to this is probably pkgsrc, which is essentially a universal packaging system. Otherwise you’re on your own.

There’s also openindiana, which is a more friendly desktop-oriented illumos distribution. Still, the short of it is that you’re very much on your own when things don’t work. The situation is like running Linux a couple decades ago, when it was still difficult to do.

If you’re interested in trying DTrace, the easiest option these days is probably FreeBSD. It’s got a big, active community, thorough documentation, and a huge selection of packages. Its license (the BSD license, duh) is compatible with the CDDL, so both ZFS and DTrace have been ported to FreeBSD.

What is DTrace?

I’ve done all this talking but haven’t yet described what DTrace really is. I won’t pretend to write my own tutorial, but I’ll provide enough information to follow along. DTrace is a tracing framework for debugging production systems in real time, both for the kernel and for applications. The “production systems” part means it’s stable and safe — using DTrace won’t put your system at risk of crashing or damaging data. The “real time” part means it has little impact on performance. You can use DTrace on live, active systems with little impact. Both of these core design principles are vital for troubleshooting those really tricky bugs that only show up in production.

There are DTrace probes scattered all throughout the system: on system calls, scheduler events, networking events, process events, signals, virtual memory events, etc. Using a specialized language called D (unrelated to the general purpose programming language D), you can dynamically add behavior at these instrumentation points. Generally the behavior is to capture information, but it can also manipulate the event being traced.

Each probe is fully identified by a 4-tuple delimited by colons: provider, module, function, and probe name. An empty element denotes a sort of wildcard. For example, syscall::open:entry is a probe at the beginning (i.e. “entry”) of open(2). syscall:::entry matches all system call entry probes.

Unlike strace on Linux which monitors a specific process, DTrace applies to the entire system when active. To run curl under strace from Emacs, I’d have to modify Emacs’ behavior to do so. With DTrace I can instrument every curl process without making a single change to Emacs, and with negligible impact to Emacs. That’s a big deal.

So, when it comes to this Elfeed issue, FreeBSD is much better poised for debugging the problem. All I have to do is catch it in the act. However, it’s been months since that bug report and I’m not really making this connection yet. I’m just hoping I eventually find an interesting problem where I can apply DTrace.

FreeBSD on a Raspberry Pi 2

So I’ve settled in FreeBSD as the playground for these technologies, I just have to decide where. I could always run it in a virtual machine, but it’s always more interesting to try things out on real hardware. FreeBSD supports the Raspberry Pi 2 as a Tier 2 system, and I had a Raspberry Pi 2 sitting around collecting dust, so I put it to use.

I wrote the image to an SD card, and for a few days I stretched my legs on this new system. I cloned a couple dozen of my own git repositories, ran the builds and the tests, and just got a feel for things. I tried out the ports system for the first time, mainly to discover that the low-powered Raspberry Pi 2 takes days to build some of the packages I want to try.

I mostly program in Vim these days, so it’s some days before I even set up Emacs. Eventually I do build Emacs, clone my configuration, fire it up, and give Elfeed a spin.

And that’s when the “search failed” bug strikes! Not just once, but dozens of times. Perfect! This low-powered platform is the jackpot for this particular bug, triggering it left and right. Given that I’ve got DTrace at my disposal, it’s the perfect place to debug this. Something is lying to Elfeed and DTrace will play the judge.

Before I dive in I see three possibilities:

curl is reporting success but truncating its output.
Emacs is quietly truncating curl’s output.
Emacs is misinterpreting curl’s exit status.

With Dtrace I can observe what every curl process writes to Emacs, and I can also double check curl’s exit status. I come up with the following (newbie) DTrace script:

syscall::write:entry
/execname == "curl"/
{
    printf("%d WRITE %d \"%s\"\n",
           pid, arg2, stringof(copyin(arg1, arg2)));
}

syscall::exit:entry
/execname == "curl"/
{
    printf("%d EXIT  %d\n", pid, arg0);
}

The /execname == "curl"/ is a predicate that (obviously) causes the behavior to only fire for curl processes. The first probe has DTrace print a line for every write(2) from curl. arg0, arg1, and arg2 correspond to the arguments of write(2): fd, buf, count. It logs the process ID (pid) of the write, the length of the write, and the actual contents written. Remember that these curl processes are run in parallel by Emacs, so the pid allows me to associate the separate writes and the exit status.

The second probe prints the pid and the exit status (the first argument to exit(2)).

I also want to compare this to exactly what is delivered to Elfeed when curl exits, so I modify the process sentinel — the callback that handles a subprocess exiting — to call write-file before any action is taken. I can compare these buffer dumps to the logs produced by DTrace.

There are two important findings.

First, when the “search failed” bug occurs, the buffer was completely empty (95% of the time) or truncated at the end of the HTTP headers (5% of the time), right at the blank line. DTrace indicates that curl did its job to the full, so it’s Emacs who’s the liar. It’s not delivering all of curl’s data to Elfeed. That’s pretty annoying.

Second, curl was line-buffered. Each line was a separate, independent write(2). I was certainly not expecting this. Normally the C library only does line buffering when the output is a terminal. That’s because it’s guessing a user may be watching, expecting the output to arrive a line at a time.

Here’s a sample of what it looked like in the log:

88188 WRITE 32 "Server: Apache/2.4.18 (Ubuntu)
"
88188 WRITE 46 "Location: https://blog.plover.com/index.atom
"
88188 WRITE 21 "Content-Length: 299
"
88188 WRITE 45 "Content-Type: text/html; charset=iso-8859-1
"
88188 WRITE 2 "
"

Why would curl think Emacs is a terminal?

Oh. That’s right. This is the same problem I ran into four years ago when writing EmacSQL. By default Emacs connects to subprocesses through a pseudo-terminal (pty). I called this a mistake in Emacs back then, and I still stand by that claim. The pty causes weird, annoying problems for little benefit:

Interpreting control characters. Hope you weren’t transferring binary data!
Subprocesses will generally get line buffered. This makes them slower, though in some situations it might be desirable.
Stdout and stderr get mixed together. (Optional since Emacs 25.)
New! There’s a bug somewhere in Emacs that causes truncation when ptys are used heavily in parallel.

Just from eyeballing the DTrace log I knew what to do: dump the pty and switch to a pipe. This is controlled with the process-connection-type variable, and fixing it is a one-liner.

Not only did this completely resolve the truncation issue, Elfeed is noticeably faster at fetching feeds on all machines. It’s no longer receiving mountains of XML one line at a time, like sucking pudding through a straw. It’s now quite zippy even on my Raspberry Pi 2, which had never been the case before (without the “search failed” bug). Even if you were never affected by this bug, you will benefit from the fix.

I haven’t officially reported this as an Emacs bug yet because reproducibility is still an issue. It needs something better than “fire off a bunch of HTTP requests across the internet in parallel from a Raspberry Pi.”

The fix reminds me of that old boilermaker story about charging a lot of money just to swing a hammer. Once the problem arose, DTrace quickly helped to identify the place to hit Emacs with the hammer.

Finally, a big thanks to alphapapa for originally taking the time to report this bug months ago.