Linux System Calls, Error Numbers, and In-Band Signaling

Today I got an e-mail asking about a previous article on creating threads on Linux using raw system calls (specifically x86-64). The questioner was looking to use threads in a program without any libc dependency. However, he was concerned about checking for mmap(2) errors when allocating the thread’s stack. The mmap(2) man page says it returns -1 (a.k.a. MAP_FAILED) on error and sets errno. But how do you check errno without libc?

As a reminder here’s what the (unoptimized) assembly looks like.

stack_create:
    mov rdi, 0
    mov rsi, STACK_SIZE
    mov rdx, PROT_WRITE | PROT_READ
    mov r10, MAP_ANONYMOUS | MAP_PRIVATE | MAP_GROWSDOWN
    mov rax, SYS_mmap
    syscall
    ret

As usual, the system call return value is in rax, which becomes the return value for stack_create(). Again, its C prototype would look like this:

void *stack_create(void);

If you were to, say, intentionally botch the arguments to force an error, you might notice that the system call isn’t returning -1, but other negative values. What gives?

The trick is that errno is a C concept. That’s why it’s documented as errno(3) — the 3 means it belongs to C. Just think about how messy this thing is: it’s a thread-local value living in the application’s address space. The kernel rightfully has nothing to do with it. Instead, the mmap(2) wrapper in libc assigns errno (if needed) after the system call returns. This is how all system calls through libc work, even with the syscall(2) wrapper.

So how does the kernel report the error? It’s an old-fashioned return value. If you have any doubts, take it straight from the horse’s mouth: mm/mmap.c:do_mmap(). Here’s a sample of return statements.

if (!len)
        return -EINVAL;

/* Careful about overflows.. */
len = PAGE_ALIGN(len);
if (!len)
        return -ENOMEM;

/* offset overflow? */
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
        return -EOVERFLOW;

/* Too many mappings? */
if (mm->map_count > sysctl_max_map_count)
        return -ENOMEM;

It’s returning the negated error number. Simple enough.

If you think about it a moment, you might notice a complication: This is a form of in-band signaling. On success, mmap(2) returns a memory address. All those negative error numbers are potentially addresses that a caller might want to map. How can we tell the difference?

1) None of the possible error numbers align on a page boundary, so they’re not actually valid return values. NULL does lie on a page boundary, which is one reason why it’s not used as an error return value for mmap(2). The other is that you might actually want to map NULL, for better or worse.

2) Those low negative values lie in a region of virtual memory reserved exclusively for the kernel (sometimes called “low memory”). On x86-64, any address with the most significant bit set (i.e. the sign bit of a signed integer) is one of these addresses. Processes aren’t allowed to map these addresses, and so mmap(2) will never return such a value on success.

So what’s a clean, safe way to go about checking for error values? It’s a lot easier to read musl than glibc, so let’s take a peek at how musl does it in its own mmap: src/mman/mmap.c.

if (off & OFF_MASK) {
    errno = EINVAL;
    return MAP_FAILED;
}
if (len >= PTRDIFF_MAX) {
    errno = ENOMEM;
    return MAP_FAILED;
}
if (flags & MAP_FIXED) {
    __vm_wait();
}
return (void *)syscall(SYS_mmap, start, len, prot, flags, fd, off);

Hmm, it looks like its returning the result directly. What happened to setting errno? Well, syscall() is actually a macro that runs the result through __syscall_ret().

#define syscall(...) __syscall_ret(__syscall(__VA_ARGS__))

Looking a little deeper: src/internal/syscall_ret.c.

long __syscall_ret(unsigned long r)
{
    if (r > -4096UL) {
        errno = -r;
        return -1;
    }
    return r;
}

Bingo. As documented, if the value falls within that “high” (unsigned) range of negative values for any system call, it’s an error number.

Getting back to the original question, we could employ this same check in the assembly code. However, since this is a anonymous memory map with a kernel-selected address, there’s only one possible error: ENOMEM (12). This error happens if the maximum number of memory maps has been reached, or if there’s no contiguous region available for the 4MB stack. The check will only need to test the result against -12.

Render the Mandelbrot Set with jq

One of my favorite data processing tools is jq, a command line JSON processor. It’s essentially awk for JSON. You supply a small script composed of transformations and filters, and jq evaluates the filters on each input JSON object, producing zero or more outputs per input. My most common use case is converting JSON data into CSV with jq’s @csv filter, which is then fed into SQLite (another favorite) for analysis.

On a recent pass over the manual, the while and until filters caught my attention, lighting up my Turing-completeness senses. These filters allow jq to compute an arbitrary recurrence, such as the Mandelbrot set.

Setting that aside for a moment, I said before that an input could produce zero or more outputs. The zero is when it gets filtered out, and one output is the obvious case. Some filters produce multiple outputs from a single input. There are a number of situations when this happens, but the important one is the range filter. For example,

$ echo 6 | jq 'range(1; .)'
1
2
3
4
5

The . is the input object, and range is producing one output for every number between 1 and . (exclusive). If an expression has multiple filters producing multiple outputs, under some circumstances jq will produce a Cartesian product: every combination is generated.

$ echo 4 | jq -c '{x: range(1; .), y: range(1; .)}'
{"x":1,"y":1}
{"x":1,"y":2}
{"x":1,"y":3}
{"x":2,"y":1}
{"x":2,"y":2}
{"x":2,"y":3}
{"x":3,"y":1}
{"x":3,"y":2}
{"x":3,"y":3}

So if my goal is the Mandelbrot set, I can use this to generate the complex plane, over which I will run the recurrence. For input, I’ll use a single object with the keys x, dx, y, and dy, defining the domain and range of the image. A reasonable input might be:

{"x": [-2.5, 1.5], "dx": 0.05, "y": [-1.5, 1.5], "dy": 0.1}

The “body” of the until will be the Mandelbrot set recurrence.

z(n+1) = z(n)^2 + c

As you might expect, jq doesn’t have support for complex numbers, so the components will be computed explicitly. I’ve worked it out before, so borrowing that I finally had my script:

#!/bin/sh
echo '{"x": [-2.5, 1.5], "dx": 0.05, "y": [-1.5, 1.5], "dy": 0.1}' | \
  jq -jr "{ \
     ci: range(.y[0]; .y[1] + .dy; .dy), \
     cr: range(.x[0]; .x[1]; .dx), \
     k: 0, \
     r: 0, \
     i: 0, \
   } | until(.r * .r + .i * .i > 4 or .k == 94; { \
         cr,
         ci,
         k: (.k + 1),
         r: (.r * .r - .i * .i + .cr),
         i: (.r * .i * 2 + .ci) \
       }) \
   | [.k + 32] | implode"

It iterates to a maximum depth of 94: the number of printable ASCII characters, except space. The final two filters convert the output ASCII characters, and the -j and -r options tell jq to produce joined, raw output. So, if you have jq installed and an exactly 80-character wide terminal, go ahead and run that script. You should see something like this:

!!!!!!!!!!!!!!!!!!!"""""""""""""""""""""""""""""""""""""""""""""""""""
!!!!!!!!!!!!!!!!!"""""""""""""""""""""""""""""""""""""""""""""""""""""
!!!!!!!!!!!!!!!"""""""""""""""###########"""""""""""""""""""""""""""""
!!!!!!!!!!!!!!"""""""""#########################""""""""""""""""""""""
!!!!!!!!!!!!"""""""################$$$$$%3(%%$$$####""""""""""""""""""
!!!!!!!!!!!"""""################$$$$$$%%&'+)+J%$$$$####"""""""""""""""
!!!!!!!!!!"""################$$$$$$$%%%&()D8+(&%%$$$$#####""""""""""""
!!!!!!!!!""################$$$$$$$%%&&'(.~~~~2(&%%%%$$######""""""""""
!!!!!!!!""##############$$$$$$%%&'(((()*.~~~~-*)(&&&2%$$#####"""""""""
!!!!!!!""#############$$$$%%%%&&',J~0:~~~~~~~~~~4,./0/%$######""""""""
!!!!!!!"###########$$%%%%%%%&&&(.,^~~~~~~~~~~~~~~~~~4'&%$######"""""""
!!!!!!"#######$$$%%','''''''''(+4~~~~~~~~~~~~~~~~~~~1)3%$$######""""""
!!!!!!###$$$$$$%%%&'*04,-C-+))+8~~~~~~~~~~~~~~~~~~~~~/(&$$#######"""""
!!!!!!#$$$$$$%%%%&'(+2~~~~~~~/0~~~~~~~~~~~~~~~~~~~~~~?'%$$$######"""""
!!!!!!$$$$$&&&&'(,-.6~~~~~~~~~A~~~~~~~~~~~~~~~~~~~~~~(&%$$$######"""""
!!!!!!`ce~~ku{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~,('&%$$$#######""""
!!!!!!$$$$$&&&&'(,-.6~~~~~~~~~A~~~~~~~~~~~~~~~~~~~~~~(&%$$$######"""""
!!!!!!#$$$$$$%%%%&'(+2~~~~~~~/0~~~~~~~~~~~~~~~~~~~~~~?'%$$$######"""""
!!!!!!###$$$$$$%%%&'*04,-C-+))+8~~~~~~~~~~~~~~~~~~~~~/(&$$#######"""""
!!!!!!"#######$$$%%','''''''''(+4~~~~~~~~~~~~~~~~~~~1)3%$$######""""""
!!!!!!!"###########$$%%%%%%%&&&(.,^~~~~~~~~~~~~~~~~~4'&%$######"""""""
!!!!!!!""#############$$$$%%%%&&',J~0:~~~~~~~~~~4,./0/%$######""""""""
!!!!!!!!""##############$$$$$$%%&'(((()*.~~~~-*)(&&&2%$$#####"""""""""
!!!!!!!!!""################$$$$$$$%%&&'(.~~~~2(&%%%%$$######""""""""""
!!!!!!!!!!"""################$$$$$$$%%%&()D8+(&%%$$$$#####""""""""""""
!!!!!!!!!!!"""""################$$$$$$%%&'+)+L%$$$$####"""""""""""""""
!!!!!!!!!!!!"""""""################$$$$$%3(%%$$$####""""""""""""""""""
!!!!!!!!!!!!!!"""""""""#########################""""""""""""""""""""""
!!!!!!!!!!!!!!!"""""""""""""""###########"""""""""""""""""""""""""""""
!!!!!!!!!!!!!!!!!"""""""""""""""""""""""""""""""""""""""""""""""""""""
!!!!!!!!!!!!!!!!!!!"""""""""""""""""""""""""""""""""""""""""""""""""""

Tweaking the input parameters, it scales up nicely:

As demonstrated by the GIF, it’s very slow compared to more reasonable implementations, but I wouldn’t expect otherwise. It could be turned into a zoom animation just by feeding it more input objects with different parameters. It will produce one full “image” per input. Capturing an animation is left as an exercise for the reader.

Modifying the Middle of a zlib Stream

I recently ran into problem where I needed to modify bytes at the beginning of an existing zlib stream. My program creates a file in a format I do not control, and the file format has a header indicating the total, uncompressed data size, followed immediately by the data. The tricky part is that the header and data are zlib compressed together, and I don’t know how much data there will be until I’ve collected it all. Sometimes it’s many gigabytes. I don’t know how to fill out the header when I start, and I can’t rewrite it when I’m done since it’s compressed in the zlib stream … or so I thought.

nelem samples[nelem]

My original solution was not to compress anything until it gathered the entirety of the data. The input would get concatenated into a huge buffer, then finally compressed and written out. It’s not ideal, because the program uses a lot more memory than it theoretical could, especially if the data is highly compressible. It would be far better to compress the data as it arrives and somehow update the header later.

My first thought was to ask zlib to leave the header uncompressed, then enable compression (deflateParams()) for the data. I’d work out the magic offset and overwrite the uncompressed header bytes once I’m done. There are two major issues with this, and I’ll address each:

Fixing the checksum

Ignoring the second problem for a moment, I could fix the checksum by computing it myself. When I overwrite my uncompressed header bytes, I could also overwrite the checksum at the end of the compressed stream. For illustration, here’s an simple example implementation of adler32 (from Wikipedia):

#define MOD_ADLER 65521

uint32_t
example_adler32(uint8_t *data, size_t len)
{
    uint32_t a = 1;
    uint32_t b = 0;
    for (size_t i = 0; i < len; i++) {
        a = (a + data[i]) % MOD_ADLER;
        b = (b + a) % MOD_ADLER;
    }
    return (b << 16) | a;
}

If you think about this for a moment, you may notice this puts me back at square one. If I don’t know the header, then I don’t know the checksum value at the end of the header, going into the data buffer. I’d need to buffer all the data to compute the checksum. Fortunately adler32 has the nice property that two checksums can be concatenated as if they were one long stream. In a malicious context this is known as a length extension attack, but it’s a real benefit here.

It’s like the zlib authors anticipated my needs, because the zlib library has a function exactly for this:

uint32_t adler32_combine(uint32_t adler1, uint32_t adler2, size_t len2);

I just have to keep track of the data checksum adler2 and I can compute the proper checksum later.

uint64_t total = 0;
uint32_t data_adler = adler32(0, 0, 0); // initial value
while (processing_input) {
    // ...
    data_adler = adler32(data_adler, data, size);
    total += size;
}
// ...
uint32_t header_adler = adler32(0, 0, 0);
header_adler = adler32(header_adler, header, header_size);
uint32_t adler = adler32_combine(header_adler, data_adler, total);

Preventing back-references

This part is more complicated and it helps to have some familiarity with zlib. Every time zlib is asked to compress data, it’s given a flush parameter. Under normal operation, this value is always Z_NO_FLUSH until the end of the stream, in which case it’s finalized with Z_FINISH. Other flushing options force it to emit data sooner at the cost of reduced compression ratio. This would primarily be used to eliminate output latency on interactive streams (e.g. compressed SSH sessions).

The necessary flush option for this situation is Z_FULL_FLUSH. It forces out all output data and resets the dictionary: a fence. Future inputs cannot reference anything before a full flush. Since the header is uncompressed, it will not reference itself either. Ignoring the checksum problem, I can safely modify these bytes.

Putting it all together

To fully demonstrate all of this, I’ve put together an example using one of my favorite image formats, Netpbm P6.

In the P6 format, the image header is an ASCII description of the image’s dimensions followed immediately by raw pixel data.

P6
width height
depth
[RGB bytes]

It’s a bit contrived, but it’s the project I used to work it all out. The demo reads arbitrary raw byte data on standard input and uses it to produce a zlib-compressed PPM file on standard output. It doesn’t know the size of the input ahead of time, nor does it naively buffer it all. There’s no dynamic allocation (except for what zlib does internally), but the program can process arbitrarily large input. The only requirement is that standard output is seekable. Using the technique described above, it patches the header within the zlib stream with the final image dimensions after the input has been exhausted.

If you’re on a Debian system, you can use zlib-flate to decompress raw zlib streams (gzip wraps zlib, but can’t raw zlib). Alternatively your system’s openssl program may have zlib support. Here’s running it on itself as input. Remember, you can’t pipe it into zlib-flate because the output needs to be seekable in order to write the header.

$ ./zppm < zppm > out.ppmz
$ zlib-flate -uncompress < out.ppmz > out.ppm

Unfortunately due to the efficiency-mindedness of zlib, its use requires careful bookkeeping that’s easy to get wrong. It’s a little machine that at each step needs to be either fed more input or its output buffer cleared. Even with all the error checking stripped away, it’s still too much to go over in full here, but I’ll summarize the parts.

First I process an empty buffer with compression disabled. The output buffer will be discarded, so input buffer could be left uninitialized, but I don’t want to upset anyone. All I need is the output size, which I use to seek over the to-be-written header. I use Z_FULL_FLUSH as described, and there’s no loop because I presume my output buffer is easily big enough for this.

char bufin[4096];
char bufout[4096];

z_stream z = {
    .next_in = (void *)bufin,
    .avail_in = HEADER_SIZE,
    .next_out = (void *)bufout,
    .avail_out = sizeof(bufout),
};
deflateInit(&z, Z_NO_COMPRESSION);
memset(bufin, 0, HEADER_SIZE);
deflate(&z, Z_FULL_FLUSH);
fseek(stdout, sizeof(bufout) - z.avail_out, SEEK_SET);

Next I enable compression and reset the checksum. This makes zlib track the checksum for all of the non-header input. Otherwise I’d be throwing away all its checksum work and repeating it myself.

deflateParams(&z, Z_BEST_COMPRESSION, Z_DEFAULT_STRATEGY);
z.adler = adler32(0, 0, 0);

I won’t include it in this article, but what follows is a standard zlib compression loop, consuming all the input data. There’s one key difference compared to a normal zlib compression loop: when the input is exhausted, instead of Z_FINISH I use Z_SYNC_FLUSH to force everything out. The problem with Z_FINISH is that it will write the checksum, but we’re not ready for that.

With all the input processed, it’s time to go back to rewrite the header. Rather than mess around with magic byte offsets, I start a second, temporary zlib stream and do the Z_FULL_FLUSH like before, but this time with the real header. In deciding the header size, I reserved 6 characters for the width and 10 characters for the height.

sprintf(bufin, "P6\n%-6lu\n%-10lu\n255\n", width, height);
uint32_t adler = adler32(0, 0, 0);
adler = adler32(adler, (void *)bufin, HEADER_SIZE);

z_stream zh = {
    .next_in = (void *)bufin,
    .avail_in = HEADER_SIZE,
    .next_out = (void *)bufout,
    .avail_out = sizeof(bufout),
};
deflateInit(&zh, Z_NO_COMPRESSION);
deflate(&zh, Z_FULL_FLUSH);
fseek(stdout, 0, SEEK_SET);
fwrite(bufout, 1, sizeof(bufout) - zh.avail_out, stdout);
fseek(stdout, 0, SEEK_END);
deflateEnd(&zh);

The header is now complete, so I can go back to finish the original compression stream. Again, I assume the output buffer is big enough for these final bytes.

z.adler = adler32_combine(adler, z.adler, z.total_in - HEADER_SIZE);
z.next_out = (void *)bufout;
z.avail_out = sizeof(bufout);
deflate(&z, Z_FINISH);
fwrite(bufout, 1, sizeof(bufout) - z.avail_out, stdout);
deflateEnd(&z);

It’s a lot more code than I expected, but it wasn’t too hard to work out. If you want to get into the nitty gritty and really hack a zlib stream, check out RFC 1950 and RFC 1951.

Inspecting C's qsort Through Animation

The C standard library includes a qsort() function for sorting arbitrary buffers given a comparator function. The name comes from its original Unix implemenation, “quicker sort,” a variation of the well-known quicksort algorithm. The C standard doesn’t specify an algorithm, except to say that it may be unstable (C99 §7.20.5.2¶4) — equal elements have an unspecified order. As such, different C libraries use different algorithms, and even when using the same algorthm they make different implementation tradeoffs.

I added a drawing routine to a comparison function to see what the sort function was doing for different C libraries. Every time it’s called for a comparison, it writes out a snapshot of the array as a Netpbm PPM image. It’s easy to turn concatenated PPMs into a GIF or video. Here’s my code if you want to try it yourself:

Adjust the parameters at the top to taste. Rather than call rand() in the standard library, I included xorshift64star() with a hardcoded seed so that the array will be shuffled exactly the same across all platforms. This makes for a better comparison.

To get an optimized GIF on unix-like systems, run it like so. (Microsoft’s UCRT currently has serious bugs with pipes, so it was run differently in that case.)

./a.out | convert -delay 10 ppm:- gif:- | gifsicle -O3 > sort.gif

The number of animation frames reflects the efficiency of the sort, but this isn’t really a benchmark. The input array is fully shuffled, and real data often not. For a benchmark, have a look at a libc qsort() shootout of sorts instead.

To help you follow along, clicking on any animation will restart it.

glibc

Sorted in 307 frames. glibc prefers to use mergesort, which, unlike quicksort, isn’t an in-place algorithm, so it has to allocate memory. That allocation could fail for huge arrays, and, since qsort() can’t fail, it uses quicksort as a backup. You can really see the mergesort in action: changes are made that we cannot see until later, when it’s copied back into the original array.

diet

Sorted in 503 frames. diet libc is an alternative C standard library for Linux. It’s optimized for size, which shows through its slower performance. It looks like a quicksort that always chooses the last element as the pivot.

musl

Sort in 637 frames. musl libc is another alternative C standard library for Linux. It’s my personal preference when I statically link Linux binaries. Its qsort() looks a lot like a heapsort, and with some research I see it’s actually smoothsort, a heapsort variant.

BSD

Sorted in 354 frames. I ran it on both OpenBSD and FreeBSD with identical results, so, unsurprisingly, they share an implementation. It’s quicksort, and what’s neat about it is at the beginning you can see it searching for a median for use as the pivot. This helps avoid the O(n^2) worst case.

BSD also includes a mergesort() with the same prototype, except with an int return for reporting failures. This one sorted in 247 frames. Like glibc before, there’s some behind-the-scenes that isn’t captured. But even more, notice how the markers disappear during the merge? It’s running the comparator against copies, stored outside the original array. Sneaky!

Again, BSD also includes heapsort(), so ran that too. It sorted in 418 frames. It definitely looks like a heapsort, and the worse performance is similar to musl. It seems heapsort is a poor fit for this data.

Cygwin

It turns out Cygwin borrowed its qsort() from BSD. It’s pixel identical to the above. I hadn’t noticed until I looked at the frame counts.

MSVCRT.DLL (MinGW) and UCRT (Visual Studio)

MinGW builds against MSVCRT.DLL, found on every Windows system despite its unofficial status. Until recently Microsoft didn’t include a C standard library as part of the OS, but that changed with their Universal CRT (UCRT) announcement. I thought I’d try them both.

Turns out they borrowed their old qsort() for the UCRT, and the result is the same: sorted in 417 frames. It chooses a pivot from the median of the ends and the middle, swaps the pivot to the middle, then partitions. Looking to the middle for the pivot makes sorting pre-sorted arrays much more efficient.

Pelles C

Finally I ran it against Pelles C, a C compiler for Windows. It sorted in 463 frames. I can’t find any information about it, but it looks like some sort of hybrid between quicksort and insertion sort. Like BSD qsort(), it finds a good median for the pivot, partitions the elements, and if a partition is small enough, it switches to insertion sort. This should behave well on mostly-sorted arrays, but poorly on well-shuffled arrays (like this one).

More Implementations

That’s everything that was readily accessible to me. If you can run it against something new, I’m certainly interested in seeing more implementations.

How to Read and Write Other Process Memory

I recently put together a little game memory cheat tool called MemDig. It can find the address of a particular game value (score, lives, gold, etc.) after being given that value at different points in time. With the address, it can then modify that value to whatever is desired.

I’ve been using tools like this going back 20 years, but I never tried to write one myself until now. There are many memory cheat tools to pick from these days, the most prominent being Cheat Engine. These tools use the platform’s debugging API, so of course any good debugger could do the same thing, though a debugger won’t be specialized appropriately (e.g. locating the particular address and locking its value).

My motivation was bypassing an in-app purchase in a single player Windows game. I wanted to convince the game I had made the purchase when, in fact, I hadn’t. Once I had it working successfully, I ported MemDig to Linux since I thought it would be interesting to compare. I’ll start with Windows for this article.

Windows

Only three Win32 functions are needed, and you could almost guess at how it works.

It’s very straightforward and, for this purpose, is probably the simplest API for any platform (see update).

As you probably guessed, you first need to open the process, given its process ID (integer). You’ll need to select the desired access bit a bit set. To read memory, you need the PROCESS_VM_READ and PROCESS_QUERY_INFORMATION rights. To write memory, you need the PROCESS_VM_WRITE and PROCESS_VM_OPERATION rights. Alternatively you could just ask for all rights with PROCESS_ALL_ACCESS, but I prefer to be precise.

DWORD access = PROCESS_VM_READ |
               PROCESS_QUERY_INFORMATION |
               PROCESS_VM_WRITE |
               PROCESS_VM_OPERATION;
HANDLE proc = OpenProcess(access, FALSE, pid);

And then to read or write:

void *addr; // target process address
SIZE_T written;
ReadProcessMemory(proc, addr, &value, sizeof(value), &written);
// or
WriteProcessMemory(proc, addr, &value, sizeof(value), &written);

Don’t forget to check the return value and verify written. Finally, don’t forget to close it when you’re done.

CloseHandle(proc);

That’s all there is to it. For the full cheat tool you’d need to find the mapped regions of memory, via VirtualQueryEx. It’s not as simple, but I’ll leave that for another article.

Linux

Unfortunately there’s no standard, cross-platform debugging API for unix-like systems. Most have a ptrace() system call, though each works a little differently. Note that ptrace() is not part of POSIX, but appeared in System V Release 4 (SVr4) and BSD, then copied elsewhere. The following will all be specific to Linux, though the procedure is similar on other unix-likes.

In typical Linux fashion, if it involves other processes, you use the standard file API on the /proc filesystem. Each process has a directory under /proc named as its process ID. In this directory is a virtual file called “mem”, which is a file view of that process’ entire address space, including unmapped regions.

char file[64];
sprintf(file, "/proc/%ld/mem", (long)pid);
int fd = open(file, O_RDWR);

The catch is that while you can open this file, you can’t actually read or write on that file without attaching to the process as a debugger. You’ll just get EIO errors. To attach, use ptrace() with PTRACE_ATTACH. This asynchronously delivers a SIGSTOP signal to the target, which has to be waited on with waitpid().

You could select the target address with lseek(), but it’s cleaner and more efficient just to do it all in one system call with pread() and pwrite(). I’ve left out the error checking, but the return value of each function should be checked:

ptrace(PTRACE_ATTACH, pid, 0, 0);
waitpid(pid, NULL, 0);

off_t addr = ...; // target process address
pread(fd, &value, sizeof(value), addr);
// or
pwrite(fd, &value, sizeof(value), addr);

ptrace(PTRACE_DETACH, pid, 0, 0);

The process will (and must) be stopped during this procedure, so do your reads/writes quickly and get out. The kernel will deliver the writes to the other process’ virtual memory.

Like before, don’t forget to close.

close(fd);

To find the mapped regions in the real cheat tool, you would read and parse the virtual text file /proc/pid/maps. I don’t know if I’d call this stringly-typed method elegant — the kernel converts the data into string form and the caller immediately converts it right back — but that’s the official API.

Update: Konstantin Khlebnikov has pointed out the process_vm_readv() and process_vm_writev() system calls, available since Linux 3.2 (January 2012) and glibc 2.15 (March 2012). These system calls not require ptrace(), nor does the remote process need to be stopped. They’re equivalent to ReadProcessMemory() and WriteProcessMemory(), except there’s no requirement to first “open” the process.

Two Years as a High School Mentor

Over the past two years I’ve been mentoring a high school student, Dan Kelly, in software development though my workplace mentoring program. It’s been one of the most rewarding experiences in my life, far beyond my initial expectations. It didn’t just go well, it went extraordinarily well. One of my co-workers described the results as, “the single most successful technical investment I’ve ever seen made in a person.” Last week the mentorship effectively ended as he became a college freshman. While it’s fresh in my mind, I’d like to reflect on how it went and why this arrangement worked so particularly well for both of us.

Students come into work for a few hours at a time a few days each week throughout the school year, typically after school. They’re supplied with their own computer in a mentor-arranged work space. In my case, Dan sat at a smaller makeshift desk in my office. The schedule is informal and we were frequently making adjustments as needed.

An indispensable feature is that the students are unpaid. Based on my observations, the primary reason many other mentorships aren’t nearly as effective is a failure to correctly leverage this. Some mentors put students directly to work on an existing, funded project. This is a mistake. Real world, practical projects are poor learning environments for beginners. The complexity inhibits development of the fundamentals, and, from a position of no experience, the student will have little ability to influence the project, stifling their motivation.

If being unpaid sounds unfair, remember that these students are coming in with essentially no programming experience. For several months, students are time sink. Maybe they fiddled around with Python on their own, but it probably doesn’t really amount to anything meaningful (yet).

Being unpaid means students can be put to work on anything, even if unfunded — even if it’s only for the student’s own benefit. The very first thing I did with Dan was to walk him through an install of Debian (wiping the near-useless, corporate Windows image). I wanted him fully in control of, and responsible for, his own development environment. I never sat at his keyboard, and all exercises took place at his computer.

From here I took a bit of an unconventional approach: I decided he was going to learn from the bottom up, starting with C. There would be no Fisher-Price-esque graphical programming language (yes, some mentors actually do this), not even an easier, education-friendly, but production-ready language like Python. I wanted him to internalize pointers before trusting him with real work. I showed him how to invoke gcc, the various types of files involved, how to invoke the linker, and lent him my hard copy of first edition K&R.

Similarly, I wasn’t going to start him off with some toy text editor. Like with C, he was going to learn real production tools from the start. I gave the Emacs run-down and started him on the tutorial (C-h t). If he changed his mind later after getting more experience, wanting to use something different, that would be fine.

Once he got the hang of everything so far, I taught him Git and Magit. Finally we could collaborate.

The first three months were spent on small exercises. He’d learn a little bit from K&R, describe to me what he learned, and I’d come up with related exercises, not unlike those on DailyProgrammer, to cement the concepts. I’d also translate concepts into modern C (e.g. C99).

With K&R complete, he was ready to go beyond simple exercises. Unfortunately, I wasn’t able to involve him in my own work at the time. His primary interests included graphics, so we decided on a multi-threaded, networked multi-processor raytracer.

It was a lot more educational for us both than I expected. I spent a significant amount of personal time learning graphics well enough to keep ahead: color spaces, blending, anti-aliasing, brushing up on linear algebra, optics, lighting models, modeling formats, Blender, etc. In a few cases I simply learned it from him. It reminds me of a passage from The Moon is a Harsh Mistress (the namesake of this blog):

I liked Prof. He would teach anything. Wouldn’t matter that he knew nothing about it; if pupil wanted it, he would smile and set a price, locate materials, stay a few lessons ahead. Or barely even if he found it tough—never pretended to know more than he did. Took algebra from him and by time we reached cubics I corrected his probs as often as he did mine—but he charged into each lesson gaily.

I started electronics under him, soon was teaching him. So he stopped charging and we went along together until he dug up an engineer willing to daylight for extra money—whereupon we both paid new teacher and Prof tried to stick with me, thumb-fingered and slow, but happy to be stretching his mind.

Since this was first and foremost an educational project, I decided we would use no libraries other than the (POSIX) C standard library. He would know how nearly every aspect of the program worked. The input (textures) and output formats were Netpbm PPMs, BMP files, and YUV4MPEG2 (video), each of which is very easy to read and write. It loaded 3D models in the easy-to-parse Wavefront .obj format. It also supported a text rendered overlay, initially using our own font format and later with fonts in the Glyph Bitmap Distribution Format — again, easy to parse and use. The text made it possible to produce demo videos without any additional editing (see the above video).

After a year on the raytracer, he had implemented most of what he wanted and I had an opportunity to involve him in my funded research. By this point he as very capable, quickly paying back his entire education.

Together we made rapid progress — much more than I could alone. I can’t go into the specifics, but much of his work was built on lessons learned from the raytracer, including an OpenGL display and SIMD-optimized physics engine. I also taught him x86 assembly, which he applied to co-author a paper, ROP Gadget Prevalence and Survival under Compiler-based Binary Diversification Schemes (soon to be published, wherein this will become a link).

To reiterate, an important part of this entire journey was the influence he had over his own work. He had say on the direction of each project. Until he started earning a college intern pay (post-graduation), I had no ability to make him do anything he didn’t want to do. I could only rely on his motivation. Fortunately what motivates him is what also motivates me, so to find the framing for that motivation I only had to imagine myself in his shoes.

A Comment on Respect

An important aspect I hadn’t noticed until the second year was respect. Most of Dan’s interactions with other co-workers was very respectful. They listened to what he had to say and treated it with the same respect as they would a regular, full-time co-worker. I can’t emphasize how invaluable this is for a student.

I bring this up because there were a handful of interactions that weren’t so respectful. A few individuals, when discovering his status, made a jokes (“Hey, where’s my coffee?”) or wouldn’t take his comments seriously. Please don’t be one of these people.

What’s comes next?

In many ways, Dan is starting college in a stronger position than I was finishing college. The mentorship was a ton of vocational, practical experience that doesn’t normally happen in a college course. However, there’s plenty of computer science theory that I’m not so great at teaching. For example, he got hands on experience with practical data structures, but only picked up a shallow understanding of their analysis (Big O, etc.). College courses will fill those gaps, and they will be learned more easily with a thorough intuition of their context.

As he moves on, I’ll be starting all over again with a new student.

An Elfeed Database Analysis

The end of the month marks Elfeed’s third birthday. Surprising to nobody, it’s also been three years of heavy, daily use by me. While I’ve used Elfeed concurrently on a number of different machines over this period, I’ve managed to keep an Elfeed database index with a lineage going all the way back to the initial development stages, before the announcement. It’s a large, organically-grown database that serves as a daily performance stress test. Hopefully this means I’m one of the first people to have trouble if an invisible threshold is ever exceeded.

I’m also the sort of person who gets excited when I come across an interesting dataset, and I have this gem sitting right in front of me. So a couple of days ago I pushed a new Elfeed function, elfeed-csv-export, which exports a database index into three CSV files. These are intended to serve as three tables in a SQL database, exposing the database to interesting relational queries and joins. Entry content (HTML, etc.) has always been considered volatile, so this is not exported. The export function isn’t interactive (yet?), so if you want to generate your own you’ll need to (require 'elfeed-csv) and evaluate it yourself.

All the source code for performing the analysis below on your own database can be found here:

The three exported tables are feeds, entries, and tags. Here are the corresponding columns (optional CSV header) for each:

url, title, canonical-url, author
id, feed, title, link, date
entry, feed, tag

And here’s the SQLite schema I’m using for these tables:

CREATE TABLE feeds (
    url TEXT PRIMARY KEY,
    title TEXT,
    canonical_url TEXT,
    author TEXT
);

CREATE TABLE entries (
    id TEXT NOT NULL,
    feed TEXT NOT NULL REFERENCES feeds (url),
    title TEXT,
    link TEXT NOT NULL,
    date REAL NOT NULL,
    PRIMARY KEY (id, feed)
);

CREATE TABLE tags (
    entry TEXT NOT NULL,
    feed TEXT NOT NULL,
    tag TEXT NOT NULL,
    FOREIGN KEY (entry, feed) REFERENCES entries (id, feed)
);

Web authors are notoriously awful at picking actually-unique entry IDs, even when using the smarter option, Atom. I still simply don’t trust that entry IDs are unique, so, as usual, I’ve qualified them by their source feed URL, hence the primary key on both columns in entries.

At this point I wish I had collected a lot more information. If I were to start fresh today, Elfeed’s database schema would not only fully match Atom’s schema, but also exceed it with additional logging:

I may start tracking some of these. If I don’t, I’ll be kicking myself three years from now when I look at this again.

A look at my index

So just how big is my index? It’s 25MB uncompressed, 2.5MB compressed. I currently follow 117 feeds, but my index includes 43,821 entries from 309 feeds. These entries are marked with 53,360 tags from a set of 35 unique tags. Some of these datapoints are the result of temporarily debugging Elfeed issues and don’t represent content that I actually follow. I’m more careful these days to test in a temporary database as to avoid contamination. Some are duplicates due to feeds changing URLs over the years. Some are artifacts from old bugs. This all represents a bit of noise, but should be negligible. During my analysis I noticed some of these anomalies and took a moment to clean up obviously bogus data (weird dates, etc.), all by adjusting tags.

The first thing I wanted to know is the weekday frequency. A number of times I’ve blown entire Sundays working on Elfeed, and, as if to frustrate my testing, it’s not unusual for several hours to pass between new entries on Sundays. Is this just my perception or are Sundays really that slow?

Here’s my query. I’m using SQLite’s strftime to shift the result into my local time zone, Eastern Time. This time zone is the source, or close to the source, of a large amount of the content. This also automatically accounts for daylight savings time, which can’t be done with a simple divide and subtract.

SELECT tag,
       cast(strftime('%w', date, 'unixepoch', 'localtime') AS INT) AS day,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, day;

The most frequent tag (13,666 appearances) is “youtube”, which marks every YouTube video, and I’ll use gnuplot to visualize it. The input “file” is actually a command since gnuplot is poor at filtering data itself, especially for histograms.

plot '< grep ^youtube, weekdays.csv' using 2:3 with boxes

Wow, things do quiet down dramatically on weekends! From the glass-half-full perspective, this gives me a chance to catch up when I inevitably fall behind on these videos during the week.

The same is basically true for other types of content, including “comic” (12,465 entries) and “blog” (7,505 entries).

However, “emacs” (2,404 entries) is a different story. It doesn’t slow down on the weekend, but Emacs users sure love to talk about Emacs on Mondays. In my own index, this spike largely comes from Planet Emacsen. Initially I thought maybe this was an artifact of Planet Emacsen’s date handling — i.e. perhaps it does a big fetch on Mondays and groups up the dates — but I double checked: they pass the date directly through from the original articles.

Conclusion: Emacs users love Mondays. Or maybe they hate Mondays and talk about Emacs as an escape.

I can reuse the same query to look at different time scales. When during the day do entries appear? Adjusting the time zone here becomes a lot more important.

SELECT tag,
       cast(strftime('%H', date, 'unixepoch', 'localtime') AS INT) AS hour,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, hour;

Emacs bloggers tend to follow a nice Eastern Time sleeping schedule. (I wonder how Vim bloggers compare, since, as an Emacs user, I naturally assume Vim users’ schedules are as undisciplined as their bathing habits.) However, this also might be prolific the Irreal breaking the curve.

The YouTube channels I follow are a bit more erratic, but there’s still a big drop in the early morning and a spike in the early afternoon. It’s unclear if the timestamp published in the feed is the upload time or the publication time. This would make a difference in the result (e.g. overnight video uploads).

Do you suppose there’s a slow month?

SELECT tag,
       cast(strftime('%m', date, 'unixepoch', 'localtime') AS INT) AS day,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, day;

December is a big drop across all tags, probably for the holidays. Both “comic” and “blog” also have an interesting drop in August. For brevity, I’ll only show one. This might be partially due my not waiting until the end of this month for this analysis, since there are only 2.5 Augusts in my 3-year dataset.

Unfortunately the timestamp is the only direct numerical quantity in the data. So far I’ve been binning data points and counting to get a second numerical quantity. Everything else is text, so I’ll need to get more creative to find other interesting relationships.

So let’s have a look a the lengths of entry titles.

SELECT tag,
       length(title) AS length,
       count(*) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, length
ORDER BY length;

The shortest are the webcomics. I’ve complained about poor webcomic titles before, so this isn’t surprising. The spikes are from comics that follow a strict (uncreative) title format.

Emacs article titles follow a nice distribution. You can tell these are programmers because so many titles are exactly 32 characters long. Picking this number is such a natural instinct that we aren’t even aware of it. Or maybe all their database schemas have VARCHAR(32) title columns?

Blogs in general follow a nice distribution. The big spike is from the Dwarf Fortress development blog, which follows a strict date format.

The longest on average are YouTube videos. This is largely due to the kinds of videos I watch (“Let’s Play” videos), which tend to have long, predictable names.

And finally, here’s the most interesting-looking graph of them all.

SELECT ((date - 4*60*60) % (24*60*60)) / (60*60) AS day_time,
       length(title) AS length
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed;

This is the title length versus time of day (not binned). Each point is one of the 53,360 posts.

set style fill transparent solid 0.25 noborder
set style circle radius 0.04
plot 'length-vs-daytime.csv' using 1:2 with circles

(This is a good one to follow through to the full size image.)

Again, all Eastern Time since I’m self-centered like that. Vertical lines are authors rounding their post dates to the hour. Horizontal lines are the length spikes from above, such as the line of entries at title length 10 in the evening (Dwarf Fortress blog). There’s a the mid-day cloud of entries of various title lengths, with the shortest title cloud around mid-morning. That’s probably when many of the webcomics come up.

Additional analysis could look further at textual content, beyond simply length, in some quantitative way (n-grams? soundex?). But mostly I really need to keep track of more data!

Automatic Deletion of Incomplete Output Files

Conventionally, a program that creates an output file will delete its incomplete output should an error occur while writing the file. It’s risky to leave behind a file that the user may rightfully confuse for a valid file. They might not have noticed the error.

For example, compression programs such as gzip, bzip2, and xz when given a compressed file as an argument will create a new file with the compression extension removed. They write to this file as the compressed input is being processed. If the compressed stream contains an error in the middle, the partially-completed output is removed.

There are exceptions of course, such as programs that download files over a network. The partial result has value, especially if the transfer can be continued from where it left off. The convention is to append another extension, such as “.part”, to indicate a partial output.

The straightforward solution is to always delete the file as part of error handling. A non-interactive program would report the error on standard error, delete the file, and exit with an error code. However, there are at least two situations where error handling would be unable to operate: unhandled signals (usually including a segmentation fault) and power failures. A partial or corrupted output file will be left behind, possibly looking like a valid file.

A common, more complex approach is to name the file differently from its final name while being written. If written successfully, the completed file is renamed into place. This is already required for durable replacement, so it’s basically free for many applications. In the worst case, where the program is unable to clean up, the obviously incomplete file is left behind only wasting space.

Looking to be more robust, I had the following misguided idea: Rely completely on the operating system to perform cleanup in the case of a failure. Initially the file would be configured to be automatically deleted when the final handle is closed. This takes care of all abnormal exits, and possibly even power failures. The program can just exit on error without deleting the file. Once written successfully, the automatic-delete indicator is cleared so that the file survives.

The target application for this technique supports both Linux and Windows, so I would need to figure it out for both systems. On Windows, there’s the flag FILE_FLAG_DELETE_ON_CLOSE. I’d just need to find a way to clear it. On POSIX, file would be unlinked while being written, and linked into the filesystem on success. The latter turns out to be a lot harder than I expected.

Solution for Windows

I’ll start with Windows since the technique actually works fairly well here — ignoring the usual, dumb Win32 filesystem caveats. This is a little surprising, since it’s usually Win32 that makes these things far more difficult than they should be.

The primary Win32 function for opening and creating files is CreateFile. There are many options, but the key is FILE_FLAG_DELETE_ON_CLOSE. Here’s how an application might typically open a file for output.

DWORD access = GENERIC_WRITE;
DWORD create = CREATE_ALWAYS;
DWORD flags = FILE_FLAG_DELETE_ON_CLOSE;
HANDLE f = CreateFile("out.tmp", access, 0, 0, create, flags, 0);

This special flag asks Windows to delete the file as soon as the last handle to to file object is closed. Notice I said file object, not file, since these are different things. The catch: This flag is a property of the file object, not the file, and cannot be removed.

However, the solution is simple. Create a new link to the file so that it survives deletion. This even works for files residing on a network shares.

CreateHardLink("out", "out.tmp", 0);
CloseHandle(f);  // deletes out.tmp file

The gotcha is that the underlying filesystem must be NTFS. FAT32 doesn’t support hard links. Unfortunately, since FAT32 remains the least common denominator and is still widely used for removable media, depending on the application, your users may expect support for saving files to FAT32. A workaround is probably required.

Solution for Linux

This is where things really fall apart. It’s just barely possible on Linux, it’s messy, and it’s not portable anywhere else. There’s no way to do this for POSIX in general.

My initial thought was to create a file then unlink it. Unlike the situation on Windows, files can be unlinked while they’re currently open by a process. These files are finally deleted when the last file descriptor (the last reference) is closed. Unfortunately, using unlink(2) to remove the last link to a file prevents that file from being linked again.

Instead, the solution is to use the relatively new (since Linux 3.11), Linux-specific O_TMPFILE flag when creating the file. Instead of a filename, this variation of open(2) takes a directory and creates an unnamed, temporary file in it. These files are special in that they’re permitted to be given a name in the filesystem at some future point.

For this example, I’ll assume the output is relative to the current working directory. If it’s not, you’ll need to open an additional file descriptor for the parent directory, and also use openat(2) to avoid possible race conditions (since paths can change from under you). The number of ways this can fail is already rapidly multiplying.

int fd = open(".", O_TMPFILE|O_WRONLY, 0600);

The catch is that only a handful of filesystems support O_TMPFILE. It’s like the FAT32 problem above, but worse. You could easily end up in a situation where it’s not supported, and will almost certainly require a workaround.

Linking a file from a file descriptor is where things get messier. The file descriptor must be linked with linkat(2) from its name on the /proc virtual filesystem, constructed as a string. The following snippet comes straight from the Linux open(2) manpage.

char buf[64];
sprintf(buf, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, buf, AT_FDCWD, "out", AT_SYMLINK_FOLLOW);

Even on Linux, /proc isn’t always available, such as within a chroot or a container, so this part can fail as well. In theory there’s a way to do this with the Linux-specific AT_EMPTY_PATH and avoid /proc, but I couldn’t get it to work.

// Note: this doesn't actually work for me.
linkat(fd, "", AT_FDCWD, "out", AT_EMPTY_PATH);

Given the poor portability (even within Linux), the number of ways this can go wrong, and that a workaround is definitely needed anyway, I’d say this technique is worthless. I’m going to stick with the tried-and-true approach for this one.

null program

Chris Wellons

(PGP)