Why Aren't There C Conferences?

This article was discussed on Hacker News.

Most widely-used programming languages have at least one regular conference dedicated to discussing it. Heck, even Lisp has one. It’s a place to talk about the latest developments of the language, recent and upcoming standards, and so on. However, C is a notable exception. Despite its role as the foundation of the entire software ecosystem, there aren’t any regular conferences about C. I have a couple of theories about why.

First, C is so fundamental and ubiquitous that a conference about C would be too general. There are so many different uses ranging across embedded development, operating system kernels, systems programming, application development, and, most recently, web development (WebAssembly). It’s just not a cohesive enough topic. Any conference that might be about C is instead focused on some particular subset of its application. It’s not a C conference, it’s a database conference, or an embedded conference, or a Linux conference, or a BSD conference, etc.

Second, C has a tendency to be conservative, changing and growing very slowly. This is a feature, and one that is often undervalued by developers. (In fact, I’d personally like to see a future revision that makes the C language specification smaller and simpler, rather than accumulate more features.) The last major revision to C happened in 1999 (C99). There was a minor revision in 2011 (C11), and an even smaller revision in 2018 (C17). If there was a C conference, recent changes to the language wouldn’t be a very fruitful topic.

However, the tooling has advanced significantly in recent years, especially with the advent of LLVM and Clang. This is largely driven by the C++ community, and C has significantly benefited as a side effect due to its overlap. Those are topics worthy of conferences, but these are really C++ conferences.

The closest thing we have to a C conference every year is CppCon. A lot of CppCon isn’t really just about C++, and the subjects of many of the talks are easily applied to C, since C++ builds so much upon C. In a sense, a subset of CppCon could be considered a C conference. That’s what I’m looking for when I watch the CppCon presentations each year on YouTube.

Starting last year, I began a list of all the talks that I thought would be useful to C programmers. Some are entirely relevant to C, others just have significant portions that are relevant to C. When someone asks about where they can find a C conference, I send them my list.

I’m sharing them here so you can bookmark this page and never return again.

2017

Here’s the list for CppCon 2017. These are roughly ordered from highest to lowest recommendation:

2018

The final CppCon 2018 videos were uploaded this week, so my 2018 listing can be complete:

There were three talks strictly about C++ that I thought were interesting from a language design perspective. So I think they’re worth recommending, too. (In fact, they’re a sort of ammo against using C++ due to its insane complexity.)

Bonus

Finally, here are a few more good presentations from other C++ conferences which you can just pretend are about C:

A JIT Compiler Skirmish with SELinux

This is a debugging war story.

Once upon a time I wrote a fancy data conversion utility. The input was a complex binary format defined by a data dictionary supplied at run time by the user alongside the input data. Since the converter was typically used to process massive quantities of input, and the nature of that input wasn’t known until run time, I wrote an x86-64 JIT compiler to speed it up. The converter generated a fast, native binary parser in memory according to the data dictionary specification. Processing data now took much less time and everyone rejoiced.

Then along came SELinux, Sheriff of Pedantry. Not liking all the shenanigans with page protections, SELinux huffed and puffed and made mprotect(2) return EACCES (“Permission denied”). Believing I was following all the rules and so this would never happen, I foolishly did not check the result and the converter was now crashing for its users. What made SELinux so unhappy, and could this somehow be resolved?

Allocating memory

Before going further, let’s back up and review how this works. Suppose I want to generate code at run time and execute it. In the old days this was as simple as writing some machine code into a buffer and jumping to that buffer — e.g. by converting the buffer to a function pointer and calling it.

typedef int (*jit_func)(void);

/* NOTE: This doesn't work anymore! */
jit_func
jit_compile(int retval)
{
    unsigned char *buf = malloc(6);
    if (buf) {
        /* mov eax, retval */
        buf[0] = 0xb8;
        buf[1] = retval >>  0;
        buf[2] = retval >>  8;
        buf[3] = retval >> 16;
        buf[4] = retval >> 24;
        /* ret */
        buf[5] = 0xc3;
    }
    return (jit_func)buf;
}

int
main(void)
{
    jit_func f = jit_compile(1001);
    printf("f() = %d\n", f());
    free(f);
}

This situation was far too easy for malicious actors to abuse. An attacker could supply instructions of their own choosing — i.e. shell code — as input and exploit a buffer overflow vulnerability to execute the input buffer. These exploits were trivial to craft.

Modern systems have hardware checks to prevent this from happening. Memory containing instructions must have their execute protection bit set before those instructions can be executed. This is useful both for making attackers work harder and for catching bugs in programs — no more executing data by accident.

This is further complicated by the fact that memory protections have page granularity. You can’t adjust the protections for a 6-byte buffer. You do it for the entire surrounding page — typically 4kB, but sometimes as large as 2MB. This requires replacing that malloc(3) with a more careful allocation strategy. There are a few ways to go about this.

Anonymous memory mapping

The most common and most sensible is to create an anonymous memory mapping: a file memory map that’s not actually backed by a file. The mmap(2) function has a flag specifically for this purpose: MAP_ANONYMOUS.

#include <sys/mman.h>

void *
anon_alloc(size_t len)
{
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, -1, 0);
    return p != MAP_FAILED ? p : 0;
}

void
anon_free(void *p, size_t len)
{
    munmap(p, len);
}

Unfortunately, MAP_ANONYMOUS not part of POSIX. If you’re being super strict with your includes — as I tend to be — this flag won’t be defined, even on systems where it’s supported.

#define _POSIX_C_SOURCE 200112L
#include <sys/mman.h>
// MAP_ANONYMOUS undefined!

To get the flag, you must use the _BSD_SOURCE, or, more recently, the _DEFAULT_SOURCE feature test macro to explicitly enable that feature.

#define _POSIX_C_SOURCE 200112L
#define _DEFAULT_SOURCE /* for MAP_ANONYMOUS */
#include <sys/mman.h>

The POSIX way to do this is to instead map /dev/zero. So, wanting to be Mr. Portable, this is what I did in my tool. Take careful note of this.

#define _POSIX_C_SOURCE 200112L
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

void *
anon_alloc(size_t len)
{
    int fd = open("/dev/zero", O_RDWR);
    if (fd == -1)
        return 0;
    int prot = PROT_READ | PROT_WRITE;
    int flags = MAP_PRIVATE;
    void *p = mmap(0, len, prot, flags, fd, 0);
    close(fd);
    return p != MAP_FAILED ? p : 0;
}

Aligned allocation

Another, less common (and less portable) strategy is to lean on the existing C memory allocator, being careful to allocate on page boundaries so that the page protections don’t affect other allocations. The classic allocation functions, like malloc(3), don’t allow for this kind of control. However, there are a couple of aligned allocation alternatives.

The first is posix_memalign(3):

int posix_memalign(void **ptr, size_t alignment, size_t size);

By choosing page alignment and a size that’s a multiple of the page size, it’s guaranteed to return whole pages. When done, pages are freed with free(3). Though, unlike unmapping, the original page protections must first be restored since those pages may be reused.

#define _POSIX_C_SOURCE 200112L
#include <stdlib.h>
#include <unistd.h>

void *
anon_alloc(size_t len)
{
    void *p;
    long pagesize = sysconf(_SC_PAGE_SIZE); // TODO: cache this
    size_t roundup = (len + pagesize - 1) / pagesize * pagesize;
    return posix_memalign(&p, pagesize, roundup) ? 0 : p;
}

If you’re using C11, there’s also aligned_alloc(3). This is the most uncommon of all since most C programmers refuse to switch to a new standard until it’s at least old enough to drive a car.

Changing page protections

So we’ve allocated our memory, but it’s not going to start in an executable state. Why? Because a W^X (“write xor execute”) policy is becoming increasingly common. Attempting to set both write and execute protections at the same time may be denied. (In fact, there’s an SELinux policy for this.)

As a JIT compiler, we need to write to a page and execute it. Again, there are two strategies. The complicated strategy is to map the same memory at two different places, one with the execute protection, one with the write protection. This allows the page to be modified as it’s being executed without violating W^X.

The simpler and more secure strategy is to write the machine instructions, then swap the page over to executable using mprotect(2) once it’s ready. This is what I was doing in my tool.

unsigned char *buf = anon_alloc(len);
/* ... write instructions into the buffer ... */
mprotect(buf, len, PROT_EXEC);
jit_func func = (jit_func)buf;
func();

At a high level, That’s pretty close to what I was actually doing. That includes neglecting to check the result of mprotect(2). This worked fine and dandy for several years, when suddenly (shown here in the style of strace):

mprotect(ptr, len, PROT_EXEC) = -1 EACCES (Permission denied)

Then the program would crash trying to execute the buffer. Suddenly it wasn’t allowed to make this buffer executable. My program hadn’t changed. What had changed was the SELinux security policy on this particular system.

Asking for help

The problem is that I don’t administer this (Red Hat) system. I can’t access the logs and I didn’t set the policy. I don’t have any insight on why this call was suddenly being denied. To make this more challenging, the folks that manage this system didn’t have the necessary knowledge to help with this either.

So to figure this out, I need to treat it like a black box and probe at system calls until I can figure out just what SELinux policy I’m up against. I only have practical experience administrating Debian systems (and its derivatives like Ubuntu), which means I’ve hardly ever had to deal with SELinux. I’m flying fairly blind here.

Since my real application is large and complicated, I code up a minimal example, around a dozen lines of code: allocate a single page of memory, write a single return (ret) instruction into it, set it as executable, and call it. The program checks for errors, and I can run it under strace if that’s not insightful enough. This program is also something simple I could provide to the system administrators, since they were willing to turn some of the knobs to help narrow down the problem.

However, here’s where I made a major mistake. Assuming the problem was solely in mprotect(2), and wanting to keep this as absolutely simple as possible, I used posix_memalign(3) to allocate that page. I saw the same EACCES as before, and assumed I was demonstrating the same problem. Take note of this, too.

Finding a resolution

Eventually I’d need to figure out what policy was blocking my JIT compiler, then see if there was an alternative route. The system loader still worked after all, and I could plainly see that with strace. So it wasn’t a blanket policy that completely blocked the execute protection. Perhaps the loader was given an exception?

However, the very first order of business was to actually check the result from mprotect(2) and do something more graceful rather than crash. In my case, that meant falling back to executing a byte-code virtual machine. I added the check, and now the program ran slower instead of crashing.

The program runs on both Linux and Windows, and the allocation and page protection management is abstracted. On Windows it uses VirtualAlloc() and VirtualProtect() instead of mmap(2) and mprotect(2). Neither implementation checked that the protection change succeeded, so I fixed the Windows implementation while I was at it.

Thanks to Mingw-w64, I actually do most of my Windows development on Linux. And, thanks to Wine, I mean everything, including running and debugging. Calling VirtualProtect() in Wine would ultimately call mprotect(2) in the background, which I expected would be denied. So running the Windows version with Wine under this SELinux policy would be the perfect test. Right?

Except that mprotect(2) succeeded under Wine! The Windows version of my JIT compiler was working just fine on Linux. Huh?

This system doesn’t have Wine installed. I had built and packaged it myself. This Wine build definitely has no SELinux exceptions. Not only did the Wine loader work correctly, it can change page protections in ways my own Linux programs could not. What’s different?

Debugging this with all these layers is starting to look silly, but this is exactly why doing Windows development on Linux is so useful. I run my program under Wine under strace:

$ strace wine ./mytool.exe

I study the system calls around mprotect(2). Perhaps there’s some stricter alignment issue? No. Perhaps I need to include PROT_READ? No. The only difference I can find is they’re using the MAP_ANONYMOUS flag. So, armed with this knowledge, I modify my minimal example to allocate 1024 pages instead of just one, and suddenly it works correctly. I was most of the way to figuring this all out.

Inside glibc allocation

Why did increasing the allocation size change anything? This is a typical Linux system, so my program is linked against the GNU C library, glibc. This library allocates memory from two places depending on the allocation size.

For small allocations, glibc uses brk(2) to extend the executable image — i.e. to extend the .bss section. These resources are not returned to the operating system after they’re freed with free(3). They’re reused.

For large allocations, glibc uses mmap(2) to create a new, anonymous mapping for that allocation. When freed with free(3), that memory is unmapped and its resources are returned to the operating system.

By increasing the allocation size, it became a “large” allocation and was backed by an anonymous mapping. Even though I didn’t use mmap(2), to the operating system this would be indistinguishable to what Wine was doing (and succeeding at).

Consider this little example program:

int
main(void)
{
    printf("%p\n", malloc(1));
    printf("%p\n", malloc(1024 * 1024));
}

When not compiled as a Position Independent Executable (PIE), here’s what the output looks like. The first pointer is near where the program was loaded, low in memory. The second pointer is a randomly selected address high in memory.

0x1077010
0x7fa9b998e010

And if you run it under strace, you’ll see that the first allocation comes from brk(2) and the second comes from mmap(2).

Two SELinux policies

With a little bit of research, I found the two SELinux policies at play here. In my minimal example, I was blocked by allow_execheap.

/selinux/booleans/allow_execheap

This prohibits programs from setting the execute protection on any “heap” page.

The POSIX specification does not permit it, but the Linux implementation of mprotect allows changing the access protection of memory on the heap (e.g., allocated using malloc). This error indicates that heap memory was supposed to be made executable. Doing this is really a bad idea. If anonymous, executable memory is needed it should be allocated using mmap which is the only portable mechanism.

Obviously this is pretty loose since I was still able to do it with posix_memalign(3), which, technically speaking, allocates from the heap. So this policy applies to pages mapped by brk(2).

The second policy was allow_execmod.

/selinux/booleans/allow_execmod

The program mapped from a file with mmap and the MAP_PRIVATE flag and write permission. Then the memory region has been written to, resulting in copy-on-write (COW) of the affected page(s). This memory region is then made executable […]. The mprotect call will fail with EACCES in this case.

I don’t understand what purpose this policy serves, but this is what was causing my original problem. Pages mapped to /dev/zero are not actually considered anonymous by Linux, at least as far as this policy is concerned. I think this is a mistake, and that mapping the special /dev/zero device should result in effectively anonymous pages.

From this I learned a little lesson about baking assumptions — that mprotect(2) was solely at fault — into my minimal debugging examples. And the fix was ultimately easy: I just had to suck it up and use the slightly less pure MAP_ANONYMOUS flag.

The Missing Computer Skills of High School Students

This article was discussed on Hacker News and discussed on Reddit.

It’s been just over fours years since I started mentoring high school students at work, and I recently began mentoring my fourth such student. That’s enough students for me to start observing patterns. Of course, each arrives with different computer knowledge and experience, but there have been two consistent and alarming gaps. One is a concept and the other is a skill, both of which I expect an advanced high schooler, especially one interested in computers, to have before they arrive. This gap persists despite students taking computer classes at school.

File, Directories, and Paths

The vital gap in concepts is files, directories, or, broadly speaking, paths. Students do initially arrive with a basic notion of files and directories (i.e. “folders”) and maybe some rough idea that there’s a hierarchy to it all. But they’ve never learned the notation: a location to a file specified by a sequence of directory components which may be either relative or absolute. More specifically, they’ve never been exposed to the concepts of . (current directory) nor .. (parent directory).

The first thing I do with my students is walk them though a Linux installation process and then sit them in front of a terminal. Since most non-option arguments are file paths, shell commands are rather limited if you don’t know anything about paths. You can’t navigate between directories or talk about files outside of your home directory. So one of the first things I have to teach is how paths work. We go through exercises constructing and reasoning about paths, and it takes some time and practice for students to really “get” them.

And this takes longer than you might think! Even once the student has learned the basic concepts, it still takes practice to internalize and reason about them. It’s been a consistent enough issue that I really should assemble a booklet to cover it, and perhaps some sort of interactive practice. I could just hand this to the student so they can learn on their own as they do with other materials.

Paths aren’t just imporant for command line use. They come up in every day programming when programs need to access files. In some contexts it even has security implications regardless of the programming language. For example, care must be taken handling and validating paths supplied from an untrusted source. A web application may need to translate a path-like string in a query into a file path, and not understanding how .. works can make this dangerous. Or not understanding how paths need to be normalized before being compared.

I consider paths as falling under file and directory basics, and it’s part of the baseline for a person to be considered computer literate.

Touch Typing

The other major gap is touch typing. None of my students have been touch typists, and it slows them all down far more than they realize. I spend a lot of time next to them at the keyboard as they drive, so I’ve seen the costs myself. In a couple of cases, the students have to finger peck while looking at the keyboard.

An important step in mastering the use of computers is quickly iterating on new ideas and concepts — trying out and playing with things as they are learned. Being a slow typist not only retards this process, the tedium of poor, plodding keyboarding actively discourages experimentation, becoming a barrier. Advanced computer use isn’t much fun if you can’t type well.

To be fair, I’ve only been a proper touch typist for under two years. I wish I had learned it much earlier in life, and I really only have myself to blame that it took so long. Fortunately I had developed my own pseudo touch touching style that required neither finger pecking nor looking at the keyboard. My main issue was accuracy, not that typing was tedious or slow.

The bad news is that, unlike paths, this is completely outside my control. First, one of the major guidelines of the mentoring program is that we’re not supposed to spend a lot of time on basic skills. Learning to touch type takes several weeks of daily effort. That’s just too much time that we don’t have. Second, this won’t work anyway unless the student is motivated to learn it. I have no idea how to provide that sort of motivation. (And if the student is motivated, they’ll do it on their own time anyway.) I think that’s where schools get stuck.

The really bad news is that this problem is going to get worse. The mobile revolution happened, and, for most people, mobile devices are gradually replacing the home computer, even laptops. I already know one student who doesn’t have access to a general purpose computer at home. The big difference between a tablet and a laptop is that a tablet is purely for consumption.

In the future, kids will be less and less exposed to keyboards, and productive computing in general. Keyboards are, and will continue to be for the foreseeable future, a vital tool for professionals. I wonder if the future will be a bit like, say, the 1980s again, where only a small fraction of kids will be exposed to a proper computer. Only instead of a PC clone, Commodore, or Apple II, it’s a Raspberry Pi.

Conclusions

I want to make something really clear: I’m not blaming the students for these gaps. It’s not in any way their fault. What they’re taught and exposed to is, at this point in life, largely outside of their control.

I lay most of the blame on the schools. My mentees have all taken high school programming classes of some sort, but these classes somehow manage to skip over the fundamentals. Instead it’s rote learning some particular IDE without real understanding. Finally I can relate to all those mathematicians’ complaining about how math class is taught!

What can be done? If you’re a parent, make sure your kid has access to a general purpose computer, even if it’s only a Raspberry Pi or one of its clones, along with a keyboard and mouse. (Of course, if you’re reading this article you’re not one of the parents that needs this advice.) It’s good exposure all around.

After reflecting on this recently, I think one of the shortcomings of my mentoring is that I don’t spend enough time — generally none at all — at the keyboard driving with my mentee as the passenger, where they can really observe me in action. Usually it’s me approaching them to check on their progress, and the opportunity just isn’t there. Perhaps it would be motivating to see how efficient and fun computing can be at higher levels of proficiency — to show them how touch typing and a powerful text editor can lead to such a dramatic difference in experience. It would be the answer to that question of, “Why should I learn this?”

From Vimperator to Tridactyl

Earlier this month I experienced a life-changing event — or so I thought it would be. It was fully anticipated, and I had been dreading the day for almost a year, wondering what I was going to do. Could I overcome these dire straits? Would I ever truly accept the loss, or will I become a cranky old man who won’t stop talking about how great it all used to be?

So what was this big event? On September 5th, Mozilla officially and fully ended support for XUL extensions (XML User Interface Language), a.k.a. “legacy” extensions. The last Firefox release to support these extensions was Firefox 52 ESR, the browser I had been using for some time. A couple days later, Firefox 60 ESR entered Debian Stretch to replace it.

The XUL extension API was never well designed. It was clunky, quirky, and the development process for extensions was painful, requiring frequently restarts. It was bad enough that I was never interested in writing my own extensions. Poorly-written extensions unfairly gave Firefox a bad name, causing memory leaks and other issues, and Firefox couldn’t tame the misbehavior.

Yet this extension API was incredibly powerful, allowing for rather extreme UI transformations that really did turn Firefox into a whole new browser. For the past 15 years I wasn’t using Firefox so much as a highly customized browser based on Firefox. It’s how Firefox has really stood apart from everyone else, including Chrome.

The wide open XUL extension API was getting in the way of Firefox moving forward. Continuing to support it required sacrifices that Mozilla was less and less willing to make. To replace it, they introduced the WebExtensions API, modeled very closely after Chrome’s extension API. These extensions are sandboxed, much less trusted, and the ecosystem more closely resembles the “app store” model (Ugh!). This is great for taming poorly-behaved extensions, but they are far less powerful and capable.

The powerful, transformative extension I’d been using the past decade was Vimperator — and occasionally with temporary stints in its fork, Pentadactyl. It overhauled most of Firefox’s interface, turning it into a Vim-like modal interface. In normal mode I had single keys bound to all sorts of useful functionality.

The problem is that Vimperator is an XUL extension, and it’s not possible to fully implement using the WebExtensions API. It needs capabilities that WebExtensions will likely never provide. Losing XUL extensions would mean being thrown back 10 years in terms my UI experience. The possibility of having to use the web without it sounded unpleasant.

Fortunately there was a savior on the horizon already waiting for me: Tridactyl! It is essentially a from-scratch rewrite of Vimperator using the WebExtensions API. To my complete surprise, these folks have managed to recreate around 85% of what I had within the WebExtensions limitations. It will never be 100%, but it’s close enough to keep me happy.

What matters to me

There are some key things Vimperator gave me that I was afraid of losing.

I keep all my personal configuration dotfiles under source control. It’s a shame that Firefox, despite being so flexible, has never supported this approach to configuration. Fortunately Vimperator filled this gap with its .vimperatorrc file, which could not only be used to configure the extension but also access nearly everything on the about:config page. It’s the killer feature Firefox never had.

Since WebExtensions are sandboxed, they cannot (normally) access files. Fortunately there’s a work around: native messaging. It’s a tiny, unsung backdoor that closes the loop on some vital features. Tridactyl makes it super easy to set up (:installnative), and doing so enables the .tridactylrc file to be loaded on startup. Due to WebExtensions limitations it’s not nearly as powerful as the old .vimperatorrc but it covers most of my needs.

In Vimperator, when a text input is focused I could press CTRL+i to pop up my $EDITOR (Vim, Emacs, etc.) to manipulate the input much more comfortably. This is so, so nice when writing long form content on the web. The alternative is to copy-paste back and forth, which is tedious and error prone.

Since WebExtensions are sandboxed, they cannot (normally) start processes. Again, native messaging comes to the rescue and allows Tridactyl to reproduce this feature perfectly.

In Vimperator I could press f or F to enter a special mode that allowed me to simulate a click to a page element, usually a hyperlink. This could be used to navigate without touching the mouse. It’s really nice for “productive” browsing, where my fingers are already on home row due to typing (programming or writing), and I need to switch to a browser to look something up. I rarely touch the mouse when I’m in productive mode.

This actually mostly works fine under WebExtensions, too. However, due to sandboxing, WebExtensions aren’t active on any of Firefox’s “meta” pages (configuration, errors, etc.), or Mozilla’s domains. This means no mouseless navigation on these pages.

The good news is that Tridactyl has better mouseless browsing than Vimperator. Its “tag” overlay is alphabetic rather than numeric, so it’s easier to type. When it’s available, the experience is better.

In normal mode, which is the usual state Vimperator/Tridactyl is in, I’ve got useful functionality bound to single keys. There’s little straining for the CTRL key. I use d to close a tab, u to undo it. In my own configuration I use w and e to change tabs, and x and c to move through the history. I can navigate to any “quickmark” in three keystrokes. It’s all very fast and fluid.

Since WebExtensions are sandboxed, extensions have limited ability to capture these keystrokes. If the wrong browser UI element is focused, they don’t work. If the current page is one of those extension-restricted pages, these keys don’t work.

The worse problem of all, by far, is that WebExtensions are not active until the current page has loaded. This is the most glaring flaw in WebExtensions, and I’m surprised it still hasn’t been addressed. It negatively affects every single extension I use. What this means for Tridactyl is that for a second or so after navigating a link, I can’t interact with the extension, and the inputs are completely lost. This is incredibly frustrating. I have to wait on slow, remote servers to respond before regaining control of my own browser, and I often forget about this issue, which results in a bunch of eaten keystrokes. (Update: Months have passed and I’ve never gotten used to this issue. It irritates me a hundred times every day. This is by far Firefox’s worst design flaw.)

Other extensions

I’m continuing to use uBlock Origin. Nothing changes. As I’ve said before, an ad-blocker is by far the most important security tool on your computer. If you practice good computer hygiene, malicious third-party ads/scripts are the biggest threat vector for your system. A website telling you to turn off your ad-blocker should be regarded as suspiciously as being told to turn off your virus scanner (for all you Windows users who are still using one).

The opposite of mouseless browsing is keyboardless browsing. When I’m not being productive, I’m often not touching the keyboard, and navigating with just the mouse is most comfortable. However, clicking little buttons is not. So instead of clicking the backward and forward buttons, I prefer to swipe the mouse, e.g. make a gesture.

I previously used FireGestures, an XUL extension. I’m now using Gesturefy. (Update: Gesturefy doesn’t support ESR either.) I also considered Foxy Gestures, but it doesn’t currently support ESR releases. Unfortunately all mouse gesture WebExtensions suffer from the page load problem: any gesture given before the page loads is lost. It’s less of any annoyance than with Tridactyl, but it still trips me up. They also don’t work on extension-restricted pages.

Firefox 60 ESR is the first time I’m using a browser supported by uMatrix — another blessing from the author of uBlock Origin (Raymond Hill) — so I’ve been trying it out. Effective use requires some in-depth knowledge of how the web works, such as the same-origin policy, etc. It’s not something I’d recommend for most people.

GreaseMonkey was converted to the WebExtensions API awhile back. As a result it’s a bit less capable than it used to be, and I had to adjust a couple of my own scripts before they’d work again. I use it as a “light extension” system.

XUL alternatives

Many people have suggested using one of the several Firefox forks that’s maintaining XUL compatibility. I haven’t taken this seriously for a couple of reasons:

Even the Debian community gave up on that idea long ago, and they’ve made a special exception that allows recent versions of Firefox and Chrome into the stable release. Web browsers are huge and complex because web standards are huge and complex (a situation that concerns me in the long term). The vulnerabilities that pop up regularly are frightening.

In Back to the Future Part II, Biff Tannen was thinking too small. Instead of a sports almanac, he should have brought a copy of the CVE database.

This is why I also can’t just keep using an old version of Firefox. If I was unhappy with, say, the direction of Emacs 26, I could keep using Emacs 25 essentially forever, frozen in time. However, Firefox is internet software. Internet software decays and must be maintained.

Most importantly, the Vimperator extension is no longer maintained. There’s no reason to stick around this ghost town.

Special Tridactyl customizations

The syntax for .tridactylrc is a bit different than .vimperatorrc, so I couldn’t just reuse my old configuration file. Key bindings are simple enough to translate, and quickmarks are configured almost the same way. However, it took me some time to figure out the rest.

With Vimperator I’d been using Firefox’s obscure “bookmark keywords” feature, where a bookmark is associated with a single word. In Vimperator I’d use this as a prefix when opening a new tab to change the context of the location I was requesting.

For example, to visit the Firefox subreddit I’d press o to start opening a new tab, then r firefox. I had r registered via .vimperatorrc as the bookmark keyword for the URL template https://old.reddit.com/r/%s.

WebExtensions doesn’t expose bookmark keywords, and keywords are likely to be removed in a future Firefox release. So instead someone showed me this trick:

set searchurls.r   https://old.reddit.com/r/%s
set searchurls.w   https://en.wikipedia.org/w/index.php?search=%s
set searchurls.wd  https://en.wiktionary.org/wiki/?search=%s

These lines in .tridactylrc recreates the old functionality. Works like a charm!

Another initial annoyance is that WebExtensions only exposes the X clipboard (XA_CLIPBOARD), not the X selection (XA_PRIMARY). However, I nearly always use the X selection for copy-paste, so it was like I didn’t have any clipboard access. (Honestly, I’d prefer XA_CLIPBOARD didn’t exist at all.) Again, native messaging routes around the problem nicely, and it’s trivial to configure:

set yankto both
set putfrom selection

There’s an experimental feature, guiset to remove most of Firefox’s UI elements, so that it even looks nearly like the old Vimperator. As of this writing, this feature works poorly, so I’m not using it. It’s really not important to me anyway.

Today’s status

So I’m back to about 85% of the functionality I had before the calamity, which is far better than I had imagined. Other than the frequent minor annoyances, I’m pretty satisfied.

In exchange I get better mouseless browsing and much better performance. I’m not kidding, the difference Firefox Quantum makes is night and day. In my own case, Firefox 60 ESR is using one third of the memory of Firefox 52 ESR (Update: after more experience with it, I realize its just as much of a memory hog as before), and I’m not experiencing the gradual memory leak. This really makes a difference on my laptop with 4GB of RAM.

So was it worth giving up that 15% capability for these improvements? Perhaps it was. Now that I’ve finally made the leap, I’m feeling a lot better about the whole situation.

Brute Force Incognito Browsing

Both Firefox and Chrome have a feature for creating temporary private browsing sessions. Firefox calls it Private Browsing and Chrome calls it Incognito Mode. Both work essentially the same way. A temporary browsing session is started without carrying over most existing session state (cookies, etc.), and no state (cookies, browsing history, cached data, etc.) is preserved after ending the session. Depending on the configuration, some browser extensions will be enabled in the private session, and their own internal state may be preserved.

The most obvious use is for visiting websites that you don’t want listed in your browsing history. Another use for more savvy users is to visit websites with a fresh, empty cookie file. For example, some news websites use a cookie to track the number visits and require a subscription after a certain number of “free” articles. Manually deleting cookies is a pain (especially without a specialized extension), but opening the same article in a private session is two clicks away.

For web development there’s yet another use. A private session is a way to view your website from the perspective of a first-time visitor. You’ll be logged out and will have little or no existing state.

However, sometimes it just doesn’t go far enough. Some of those news websites have adapted, and in addition to counting the number of visits, they’ve figured out how to detect private sessions and block them. I haven’t looked into how they do this — maybe something to do with local storage, or detecting previously cached content. Sometimes I want a private session that’s truly fully isolated. The existing private session features just aren’t isolated enough or they behave differently, which is how they’re being detected.

Some time ago I put together a couple of scripts to brute force my own private sessions when I need them, generally for testing websites in a guaranteed fresh, fully-functioning instance. It also lets me run multiple such sessions in parallel. My scripts don’t rely on any private session feature of the browser, so the behavior is identical to a real browser, making it undetectable.

The downside is that, for better or worse, no browser extensions are carried over. In some ways this can be considered a feature, but a lot of the time I would like my ad-blocker to carry over. Your ad-blocker is probably the most important security software on your computer, so you should hesitate to give it up.

Another downside is that both Firefox and Chrome have some irritating first-time behaviors that can’t be disabled. The intent is to be newbie-friendly but it just gets in my way. For example, both bug me about logging into their browser platforms. Firefox starts with two tabs. Chrome creates a popup to ask me to configure a printer. Both start with a junk URL in the location bar so I can’t just middle-click paste (i.e. the X11 selection clipboard) into it. It’s definitely not designed for my use case.

Firefox

Here’s my brute force private session script for Firefox:

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/firefox-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
firefox -profile "$TEMP" -no-remote "$@"

It creates a temporary directory under $XDG_CACHE_HOME and tells Firefox to use the profile in that directory. No such profile exists, of course, so Firefox creates a fresh profile.

In theory I could just create a new profile alongside the default within my existing ~/.mozilla directory. However, I’ve never liked Firefox’s profile feature, especially with the intentionally unpredictable way it stores the profile itself: behind random path. I also don’t trust it to be fully isolated and to fully clean up when I’m done.

Before starting Firefox, I register a trap with the shell to clean up the profile directory regardless of what happens. It doesn’t matter if Firefox exits cleanly, if it crashes, or if I CTRL-C it to death.

The -no-remote option prevents the new Firefox instance from joining onto an existing Firefox instance, which it really prefers to do even though it’s technically supposed to be a different profile.

Note the "$@", which passes arguments through to Firefox — most often the URL of the site I want to test.

Chromium

I don’t actually use Chrome but rather the open source version, Chromium. I think this script will also work with Chrome.

#!/bin/sh -e
DIR="${XDG_CACHE_HOME:-$HOME/.cache}"
mkdir -p -- "$DIR"
TEMP="$(mktemp -d -- "$DIR/chromium-XXXXXX")"
trap "rm -rf -- '$TEMP'" INT TERM EXIT
chromium --user-data-dir="$TEMP" \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

It’s exactly the same as the Firefox script and only the browser arguments have changed. I tell it not to ask about being the default browser, and --no-first-run disables some of the irritating first-time behaviors.

Chromium is very noisy on the command line, so I also redirect all output to /dev/null.

If you’re on Debian like me, its version of Chromium comes with a --temp-profile option that handles the throwaway profile automatically. So the script can be simplified:

#!/bin/sh -e
chromium --temp-profile \
         --no-default-browser-check \
         --no-first-run \
         "$@" >/dev/null 2>&1

In my own use case, these scripts have fully replaced the built-in private session features. In fact, since Chromium is not my primary browser, my brute force private session script is how I usually launch it. I only run it to test things, and I always want to test using a fresh profile.

Prospecting for Hash Functions

I recently got an itch to design my own non-cryptographic integer hash function. Firstly, I wanted to better understand how hash functions work, and the best way to learn is to do. For years I’d been treating them like magic, shoving input into it and seeing random-looking, but deterministic, output come out the other end. Just how is the avalanche effect achieved?

Secondly, could I apply my own particular strengths to craft a hash function better than the handful of functions I could find online? Especially the classic ones from Thomas Wang and Bob Jenkins. Instead of struggling with the mathematics, maybe I could software engineer my way to victory, working from the advantage of access to the excessive computational power of today.

Suppose, for example, I wrote tool to generate a random hash function definition, then JIT compile it to a native function in memory, then execute that function across various inputs to evaluate its properties. My tool could rapidly repeat this process in a loop until it stumbled upon an incredible hash function the world had never seen. That’s what I actually did. I call it the Hash Prospector:

https://github.com/skeeto/hash-prospector

It only works on x86-64 because it uses the same JIT compiling technique I’ve discussed before: allocate a page of memory, write some machine instructions into it, set the page to executable, cast the page pointer to a function pointer, then call the generated code through the function pointer.

Generating a hash function

My focus is on integer hash functions: a function that accepts an n-bit integer and returns an n-bit integer. One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.

This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversable, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input.

For example, XOR with a constant is trivially reversible: XOR the output with the same constant to reverse it. Addition with a constant is reversed by subtraction with the same constant. Since the integer operations are modular arithmetic, modulo 2^n for n-bit integers, multiplication by an odd number is reversible. Odd numbers are coprime with the power-of-two modulus, so there is some modular multiplicative inverse that reverses the operation.

Bret Mulvey’s hash function article provides a convenient list of some reversible operations available for constructing integer hash functions. This list was the catalyst for my little project. Here are the ones used by the hash prospector:

x  = ~x;
x ^= constant;
x *= constant | 1; // e.g. only odd constants
x += constant;
x ^= x >> constant;
x ^= x << constant;
x += x << constant;
x -= x << constant;
x <<<= constant; // left rotation

I’ve come across a couple more useful operations while studying existing integer hash functions, but I didn’t put these in the prospector.

hash += ~(hash << constant);
hash -= ~(hash << constant);

The prospector picks some operations at random and fills in their constants randomly within their proper constraints. For example, here’s an awful hash function I made it generate as an example:

// do NOT use this!
uint32_t
badhash32(uint32_t x)
{
    x *= UINT32_C(0x1eca7d79);
    x ^= x >> 20;
    x  = (x << 8) | (x >> 24);
    x  = ~x;
    x ^= x << 5;
    x += UINT32_C(0x10afe4e7);
    return x;
}

That function is reversible, and it would be relatively straightforward to define its inverse. However, it has awful biases and poor avalanche. How do I know this?

The measure of a hash function

There are two key properties I’m looking for in randomly generated hash functions.

  1. High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.

  2. Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

Initially I screwed up and only measured the first property. This lead to some hash functions that seemed to be amazing before close inspection, since, for a 32-bit hash function, it was flipping over 15 output bits on average. However, the particular bits being flipped were heavily biased, resulting in obvious patterns in the output.

For example, when hashing a counter starting from zero, the high bits would follow a regular pattern. 15 to 16 bits were being flipped each time, but it was always the same bits.

Conveniently it’s easy to measure both properties at the same time. For an n-bit integer hash function, create an n by n table initialized to zero. The rows are input bits and the columns are output bits. The ith row and jth column track the correlation between the ith input bit and jth output bit.

Then exhaustively iterate over all 2^n inputs, and flip each bit one at a time. Increment the appropriate element in the table if the output bit flips.

When you’re done, ideally each element in the table is exactly 2^(n-1). That is, each output bit was flipped exactly half the time by each input bit. Therefore the bias of the hash function is the distance (the error) of the computed table from the ideal table.

For example, the ideal bias table for an 8-bit hash function would be:

128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128
128 128 128 128 128 128 128 128

The hash prospector computes the standard deviation in order to turn this into a single, normalized measurement. Lower scores are better.

However, there’s still one problem: the input space for a 32-bit hash function is over 4 billion values. The full test takes my computer about an hour and a half. Evaluating a 64-bit hash function is right out.

Again, Monte Carlo to the rescue! Rather than sample the entire space, just sample a random subset. This provides a good estimate in less than a second, allowing lots of terrible hash functions to be discarded early. The full test can be saved only for the known good 32-bit candidates. 64-bit functions will only ever receive the estimate.

What did I find?

Once I got the bias issue sorted out, and after hours and hours of running, followed up with some manual tweaking on my part, the prospector stumbled across this little gem:

// DO use this one!
uint32_t
prospector32(uint32_t x)
{
    x ^= x >> 15;
    x *= UINT32_C(0x2c1b3c6d);
    x ^= x >> 12;
    x *= UINT32_C(0x297a2d39);
    x ^= x >> 15;
    return x;
}

According to a full (e.g. not estimated) bias evaluation, this function beats the snot out of most of 32-bit hash functions I could find. It even comes out ahead of this well known hash function that I believe originates from the H2 SQL Database. (Update: Thomas Mueller has confirmed that, indeed, this is his hash function.)

uint32_t
hash32(uint32_t x)
{
    x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
    x = ((x >> 16) ^ x) * UINT32_C(0x45d9f3b);
    x = (x >> 16) ^ x;
    return x;
}

It’s still an excellent hash function, just slightly more biased than mine.

Very briefly, prospector32() was the best 32-bit hash function I could find, and I thought I had a major breakthrough. Then I noticed the finalizer function for the 32-bit variant of MurmurHash3. It’s also a 32-bit hash function:

uint32_t
murmurhash32_mix32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x85ebca6b);
    x ^= x >> 13;
    x *= UINT32_C(0xc2b2ae35);
    x ^= x >> 16;
    return x;
}

This one is just barely less biased than mine. So I still haven’t discovered the best 32-bit hash function, only the second best one. :-)

A pattern emerges

If you’re paying close enough attention, you may have noticed that all three functions above have the same structure. The prospector had stumbled upon it all on its own without knowledge of the existing functions. It may not be so obvious for the second function, but here it is refactored:

uint32_t
hash32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x45d9f3b);
    x ^= x >> 16;
    x *= UINT32_C(0x45d9f3b);
    x ^= x >> 16;
    return x;
}

I hadn’t noticed this until after the prospector had come across it on its own. The pattern for all three is XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift. There’s something particularly useful about this multiply-xorshift construction. The XOR-right-shift diffuses bits rightward and the multiply diffuses bits leftward. I like to think it’s “sloshing” the bits right, left, right, left.

It seems that multiplication is particularly good at diffusion, so it makes perfect sense to exploit it in non-cryptographic hash functions, especially since modern CPUs are so fast at it. Despite this, it’s not used much in cryptography due to issues with completing it in constant time.

I like to think of this construction in terms of a five-tuple. For the three functions it’s the following:

(15, 0x2c1b3c6d, 12, 0x297a2d39, 15)  // prospector32()
(16, 0x045d9f3b, 16, 0x045d9f3b, 16)  // hash32()
(16, 0x85ebca6b, 13, 0xc2b2ae35, 16)  // murmurhash32_mix32()

The prospector actually found lots of decent functions following this pattern, especially where the middle shift is smaller than the outer shift. Thinking of it in terms of this tuple, I specifically directed it to try different tuple constants. That’s what I meant by “tweaking.” Eventually my new function popped out with its really low bias.

The prospector has a template option (-p) if you want to try it yourself:

$ ./prospector -p xorr,mul,xorr,mul,xorr

If you really have your heart set on certain constants, such as my specific selection of shifts, you can lock those in while randomizing the other constants:

$ ./prospector -p xorr:15,mul,xorr:12,mul,xorr:15

Or the other way around:

$ ./prospector -p xorr,mul:2c1b3c6d,xorr,mul:297a2d39,xorr

My function seems a little strange using shifts of 15 bits rather than a nice, round 16 bits. However, changing those constants to 16 increases the bias. Similarly, neither of the two 32-bit constants is a prime number, but nudging those constants to the nearest prime increases the bias. These parameters really do seem to be a local minima in the bias, and using prime numbers isn’t important.

What about 64-bit integer hash functions?

So far I haven’t been able to improve on 64-bit hash functions. The main function to beat is SplittableRandom / SplitMix64:

uint64_t
splittable64(uint64_t x)
{
    x ^= x >> 30;
    x *= UINT64_C(0xbf58476d1ce4e5b9);
    x ^= x >> 27;
    x *= UINT64_C(0x94d049bb133111eb);
    x ^= x >> 31;
    return x;
}

I also came across this one:

uint64_t
hash64(uint64_t x)
{
    x ^= x >> 32;
    x *= UINT64_C(0xd6e8feb86659fd93);
    x ^= x >> 32;
    x *= UINT64_C(0xd6e8feb86659fd93);
    x ^= x >> 32;
    return x;
}

Again, these follow the same construction as before. There really is something special about it, and many other people have noticed, too.

Both functions have about the same bias. (Remember, I can only estimate the bias for 64-bit hash functions.) The prospector has found lots of functions with about the same bias, but nothing provably better. Until it does, I have no new 64-bit integer hash functions to offer.

String hash

I’m also experimenting with using my hash function as a sort of primitive for a string hash function. Here I’m using my function in the loop to mix in one byte at a time, and finishing it with the same finalizer as MurmurHash3.

uint32_t
prospector32s(const void *buf, uint32_t len, uint32_t key)
{
    uint32_t hash = key;
    const unsigned char *p = buf;
    for (uint32_t i = 0; i < len; i++) {
        hash += p[i];
        hash ^= hash >> 15;
        hash *= UINT32_C(0x2c1b3c6d);
        hash ^= hash >> 12;
        hash *= UINT32_C(0x297a2d39);
        hash ^= hash >> 15;
    }
    hash ^= len;
    hash ^= hash >> 16;
    hash *= UINT32_C(0x85ebca6b);
    hash ^= hash >> 13;
    hash *= UINT32_C(0xc2b2ae35);
    hash ^= hash >> 16;
    return hash + key;
}

It has the typical amount of collisions when running it on a large dictionary, so it seems decent enough but I don’t know if this hash function is worth much. More experimentation needed.

Right now the prospector does a completely random, unstructured search hoping to stumble upon something good by chance. Perhaps it would be worth using a genetic algorithm to breed those 5-tuples towards optimum? Others have had success in this area with simulated annealing.

There’s probably more to exploit from the multiply-xorshift construction that keeps popping up. If anything, the prospector is searching too broadly, looking at constructions that could never really compete no matter what the constants. In addition to everything above, I’ve been looking for good 32-bit hash functions that don’t use any 32-bit constants, but I’m really not finding any with a competitively low bias.

Update after one week

About one week after publishing this article I found an even better hash function. I believe this is the least biased 32-bit integer hash function of this form ever devised. It’s even less biased than the MurmurHash3 finalizer.

// exact bias: 0.17353355999581582
uint32_t
lowbias32(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x7feb352d);
    x ^= x >> 15;
    x *= UINT32_C(0x846ca68b);
    x ^= x >> 16;
    return x;
}

// inverse
uint32_t
lowbias32_r(uint32_t x)
{
    x ^= x >> 16;
    x *= UINT32_C(0x43021123);
    x ^= x >> 15 ^ x >> 30;
    x *= UINT32_C(0x1d69e2a5);
    x ^= x >> 16;
    return x;
}

If you’re willing to use an additional round of multiply-xorshift, this next function actually reaches the theoretical bias limit (bias = ~0.021) as exhibited by a perfect integer hash function:

// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
    x ^= x >> 17;
    x *= UINT32_C(0xed5ad4bb);
    x ^= x >> 11;
    x *= UINT32_C(0xac4c1b51);
    x ^= x >> 15;
    x *= UINT32_C(0x31848bab);
    x ^= x >> 14;
    return x;
}

It’s statistically indistinguishable from a random permutation of all 32-bit integers.

The Value of Undefined Behavior

In several places, the C and C++ language specifications use a curious, and fairly controversial, phrase: undefined behavior. For certain program constructs, the specification prescribes no specific behavior, instead allowing anything to happen. Such constructs are considered erroneous, and so the result depends on the particulars of the platform and implementation. The original purpose of undefined behavior was for implementation flexibility. In other words, it’s slack that allows a compiler to produce appropriate and efficient code for its target platform.

Specifying a particular behavior would have put unnecessary burden on implementations — especially in the earlier days of computing — making for inefficient programs on some platforms. For example, if the result of dereferencing a null pointer was defined to trap — to cause the program to halt with an error — then platforms that do not have hardware trapping, such as those without virtual memory, would be required to instrument, in software, each pointer dereference.

In the 21st century, undefined behavior has taken on a somewhat different meaning. Optimizers use it — or abuse it depending on your point of view — to lift constraints that would otherwise inhibit more aggressive optimizations. It’s not so much a fundamentally different application of undefined behavior, but it does take the concept to an extreme.

The reasoning works like this: A program that evaluates a construct whose behavior is undefined cannot, by definition, have any meaningful behavior, and so that program would be useless. As a result, compilers assume programs never invoke undefined behavior and use those assumptions to prove its optimizations.

Under this newer interpretation, mistakes involving undefined behavior are more punishing and surprising than before. Programs that seem to make some sense when run on a particular architecture may actually compile into a binary with a security vulnerability due to conclusions reached from an analysis of its undefined behavior.

This can be frustrating if your programs are intended to run on a very specific platform. In this situation, all behavior really could be locked down and specified in a reasonable, predictable way. Such a language would be like an extended, less portable version of C or C++. But your toolchain still insists on running your program on the abstract machine rather than the hardware you actually care about. However, even in this situation undefined behavior can still be desirable. I will provide a couple of examples in this article.

Signed integer overflow

To start things off, let’s look at one of my all time favorite examples of useful undefined behavior, a situation involving signed integer overflow. The result of a signed integer overflow isn’t just unspecified, it’s undefined behavior. Full stop.

This goes beyond a simple matter of whether or not the underlying machine uses a two’s complement representation. From the perspective of the abstract machine, just the act a signed integer overflowing is enough to throw everything out the window, even if the overflowed result is never actually used in the program.

On the other hand, unsigned integer overflow is defined — or, more accurately, defined to wrap, not overflow. Both the undefined signed overflow and defined unsigned overflow are useful in different situations.

For example, here’s a fairly common situation, much like what actually happened in bzip2. Consider this function that does substring comparison:

int
cmp_signed(int i1, int i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

int
cmp_unsigned(unsigned i1, unsigned i2, unsigned char *buf)
{
    for (;;) {
        int c1 = buf[i1];
        int c2 = buf[i2];
        if (c1 != c2)
            return c1 - c2;
        i1++;
        i2++;
    }
}

In this function, the indices i1 and i2 will always be some small, non-negative value. Since it’s non-negative, it should be unsigned, right? Not necessarily. That puts an extra constraint on code generation and, at least on x86-64, makes for a less efficient function. Most of the time you actually don’t want overflow to be defined, and instead allow the compiler to assume it just doesn’t happen.

The constraint is that the behavior of i1 or i2 overflowing as an unsigned integer is defined, and the compiler is obligated to implement that behavior. On x86-64, where int is 32 bits, the result of the operation must be truncated to 32 bits one way or another, requiring extra instructions inside the loop.

In the signed case, incrementing the integers cannot overflow since that would be undefined behavior. This permits the compiler to perform the increment only in 64-bit precision without truncation if it would be more efficient, which, in this case, it is.

Here’s the output of Clang 6.0.0 with -Os on x86-64. Pay close attention to the main loop, which I named .loop:

cmp_signed:
        movsxd rdi, edi             ; use i1 as a 64-bit integer
        mov    al, [rdx + rdi]
        movsxd rsi, esi             ; use i2 as a 64-bit integer
        mov    cl, [rdx + rsi]
        jmp    .check

.loop:  mov    al, [rdx + rdi + 1]
        mov    cl, [rdx + rsi + 1]
        inc    rdx                  ; increment only the base pointer
.check: cmp    al, cl
        je     .loop

        movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx             ; return c1 - c2
        ret

cmp_unsigned:
        mov    eax, edi
        mov    al, [rdx + rax]
        mov    ecx, esi
        mov    cl, [rdx + rcx]
        cmp    al, cl
        jne    .ret
        inc    edi
        inc    esi

.loop:  mov    eax, edi             ; truncated i1 overflow
        mov    al, [rdx + rax]
        mov    ecx, esi             ; truncated i2 overflow
        mov    cl, [rdx + rcx]
        inc    edi                  ; increment i1
        inc    esi                  ; increment i2
        cmp    al, cl
        je     .loop

.ret:   movzx  eax, al
        movzx  ecx, cl
        sub    eax, ecx
        ret

As unsigned values, i1 and i2 can overflow independently, so they have to be handled as independent 32-bit unsigned integers. As signed values they can’t overflow, so they’re treated as if they were 64-bit integers and, instead, the pointer, buf, is incremented without concern for overflow. The signed loop is much more efficient (5 instructions versus 8).

The signed integer helps to communicate the narrow contract of the function — the limited range of i1 and i2 — to the compiler. In a variant of C where signed integer overflow is defined (i.e. -fwrapv), this capability is lost. In fact, using -fwrapv deoptimizes the signed version of this function.

Side note: Using size_t (an unsigned integer) is even better on x86-64 for this example since it’s already 64 bits and the function doesn’t need the initial sign/zero extension. However, this might simply move the sign extension out to the caller.

Strict aliasing

Another controversial undefined behavior is strict aliasing. This particular term doesn’t actually appear anywhere in the C specification, but it’s the popular name for C’s aliasing rules. In short, variables with types that aren’t compatible are not allowed to alias through pointers.

Here’s the classic example:

int
foo(int *a, int *b)
{
    *b = 0;    // store
    *a = 1;    // store
    return *b; // load
}

Naively one might assume the return *b could be optimized to a simple return 0. However, since a and b have the same type, the compiler must consider the possibility that they alias — that they point to the same place in memory — and must generate code that works correctly under these conditions.

If foo has a narrow contract that forbids a and b to alias, we have a couple of options for helping our compiler.

First, we could manually resolve the aliasing issue by returning 0 explicitly. In more complicated functions this might mean making local copies of values, working only with those local copies, then storing the results back before returning. Then aliasing would no longer matter.

int
foo(int *a, int *b)
{
    *b = 0;
    *a = 1;
    return 0;
}

Second, C99 introduced a restrict qualifier to communicate to the compiler that pointers passed to functions cannot alias. For example, the pointers to memcpy() are qualified with restrict as of C99. Passing aliasing pointers through restrict parameters is undefined behavior, e.g. this doesn’t ever happen as far as a compiler is concerned.

int foo(int *restrict a, int *restrict b);

The third option is to design an interface that uses incompatible types, exploiting strict aliasing. This happens all the time, usually by accident. For example, int and long are never compatible even when they have the same representation.

int foo(int *a, long *b);

If you use an extended or modified version of C without strict aliasing (-fno-strict-aliasing), then the compiler must assume everything aliases all the time, generating a lot more precautionary loads than necessary.

What irritates a lot of people is that compilers will still apply the strict aliasing rule even when it’s trivial for the compiler to prove that aliasing is occurring:

/* note: forbidden */
long a;
int *b = (int *)&a;

It’s not just a simple matter of making exceptions for these cases. The language specification would need to define all the rules about when and where incompatible types are permitted to alias, and developers would have to understand all these rules if they wanted to take advantage of the exceptions. It can’t just come down to trusting that the compiler is smart enough to see the aliasing when it’s sufficiently simple. It would need to be carefully defined.

Besides, there are probably conforming, portable solutions that, with contemporary compilers, will safely compile to the efficient code you actually want anyway.

There is one special exception for strict aliasing: char * is allowed to alias with anything. This is important to keep in mind both when you intentionally want aliasing, but also when you want to avoid it. Writing through a char * pointer could force the compiler to generate additional, unnecessary loads.

In fact, there’s a whole dimension to strict aliasing that, even today, no compiler yet exploits: uint8_t is not necessarily unsigned char. That’s just one possible typedef definition for it. It could instead typedef to, say, some internal __byte type.

In other words, technically speaking, uint8_t does not have the strict aliasing exemption. If you wanted to write bytes to a buffer without worrying the compiler about aliasing issues with other pointers, this would be the tool to accomplish it. Unfortunately there’s far too much existing code that violates this part of strict aliasing that no toolchain is willing to exploit it for optimization purposes.

Other undefined behaviors

Some kinds of undefined behavior don’t have performance or portability benefits. They’re only there to make the compiler’s job a little simpler. Today, most of these are caught trivially at compile time as syntax or semantic issues (i.e. a pointer cast to a float).

Some others are obvious about their performance benefits and don’t require much explanation. For example, it’s undefined behavior to index out of bounds (with some special exceptions for one past the end), meaning compilers are not obligated to generate those checks, instead relying on the programmer to arrange, by whatever means, that it doesn’t happen.

Undefined behavior is like nitro, a dangerous, volatile substance that makes things go really, really fast. You could argue that it’s too dangerous to use in practice, but the aggressive use of undefined behavior is not without merit.

Intercepting and Emulating Linux System Calls with Ptrace

The ptrace(2) (“process trace”) system call is usually associated with debugging. It’s the primary mechanism through which native debuggers monitor debuggees on unix-like systems. It’s also the usual approach for implementing strace — system call trace. With Ptrace, tracers can pause tracees, inspect and set registers and memory, monitor system calls, or even intercept system calls.

By intercept, I mean that the tracer can mutate system call arguments, mutate the system call return value, or even block certain system calls. Reading between the lines, this means a tracer can fully service system calls itself. This is particularly interesting because it also means a tracer can emulate an entire foreign operating system. This is done without any special help from the kernel beyond Ptrace.

The catch is that a process can only have one tracer attached at a time, so it’s not possible emulate a foreign operating system while also debugging that process with, say, GDB. The other issue is that emulated systems calls will have higher overhead.

For this article I’m going to focus on Linux’s Ptrace on x86-64, and I’ll be taking advantage of a few Linux-specific extensions. For the article I’ll also be omitting error checks, but the full source code listings will have them.

You can find runnable code for the examples in this article here:

https://github.com/skeeto/ptrace-examples

strace

Before getting into the really interesting stuff, let’s start by reviewing a bare bones implementation of strace. It’s no DTrace, but strace is still incredibly useful.

Ptrace has never been standardized. Its interface is similar across different operating systems, especially in its core functionality, but it’s still subtly different from system to system. The ptrace(2) prototype generally looks something like this, though the specific types may be different.

long ptrace(int request, pid_t pid, void *addr, void *data);

The pid is the tracee’s process ID. While a tracee can have only one tracer attached at a time, a tracer can be attached to many tracees.

The request field selects a specific Ptrace function, just like the ioctl(2) interface. For strace, only two are needed:

The other two fields, addr and data, serve as generic arguments for the selected Ptrace function. One or both are often ignored, in which case I pass zero.

The strace interface is essentially a prefix to another command.

$ strace [strace options] program [arguments]

My minimal strace doesn’t have any options, so the first thing to do — assuming it has at least one argument — is fork(2) and exec(2) the tracee process on the tail of argv. But before loading the target program, the new process will inform the kernel that it’s going to be traced by its parent. The tracee will be paused by this Ptrace system call.

pid_t pid = fork();
switch (pid) {
    case -1: /* error */
        FATAL("%s", strerror(errno));
    case 0:  /* child */
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        execvp(argv[1], argv + 1);
        FATAL("%s", strerror(errno));
}

The parent waits for the child’s PTRACE_TRACEME using wait(2). When wait(2) returns, the child will be paused.

waitpid(pid, 0, 0);

Before allowing the child to continue, we tell the operating system that the tracee should be terminated along with its parent. A real strace implementation may want to set other options, such as PTRACE_O_TRACEFORK.

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL);

All that’s left is a simple, endless loop that catches on system calls one at a time. The body of the loop has four steps:

  1. Wait for the process to enter the next system call.
  2. Print a representation of the system call.
  3. Allow the system call to execute and wait for the return.
  4. Print the system call return value.

The PTRACE_SYSCALL request is used in both waiting for the next system call to begin, and waiting for that system call to exit. As before, a wait(2) is needed to wait for the tracee to enter the desired state.

ptrace(PTRACE_SYSCALL, pid, 0, 0);
waitpid(pid, 0, 0);

When wait(2) returns, the registers for the thread that made the system call are filled with the system call number and its arguments. However, the operating system has not yet serviced this system call. This detail will be important later.

The next step is to gather the system call information. This is where it gets architecture specific. On x86-64, the system call number is passed in rax, and the arguments (up to 6) are passed in rdi, rsi, rdx, r10, r8, and r9. Reading the registers is another Ptrace call, though there’s no need to wait(2) since the tracee isn’t changing state.

struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, 0, &regs);
long syscall = regs.orig_rax;

fprintf(stderr, "%ld(%ld, %ld, %ld, %ld, %ld, %ld)",
        syscall,
        (long)regs.rdi, (long)regs.rsi, (long)regs.rdx,
        (long)regs.r10, (long)regs.r8,  (long)regs.r9);

There’s one caveat. For internal kernel purposes, the system call number is stored in orig_rax rather than rax. All the other system call arguments are straightforward.

Next it’s another PTRACE_SYSCALL and wait(2), then another PTRACE_GETREGS to fetch the result. The result is stored in rax.

ptrace(PTRACE_GETREGS, pid, 0, &regs);
fprintf(stderr, " = %ld\n", (long)regs.rax);

The output from this simple program is very crude. There is no symbolic name for the system call and every argument is printed numerically, even if it’s a pointer to a buffer. A more complete strace would know which arguments are pointers and use process_vm_readv(2) to read those buffers from the tracee in order to print them appropriately.

However, this does lay the groundwork for system call interception.

System call interception

Suppose we want to use Ptrace to implement something like OpenBSD’s pledge(2), in which a process pledges to use only a restricted set of system calls. The idea is that many programs typically have an initialization phase where they need lots of system access (opening files, binding sockets, etc.). After initialization they enter a main loop in which they processing input and only a small set of system calls are needed.

Before entering this main loop, a process can limit itself to the few operations that it needs. If the program has a flaw allowing it to be exploited by bad input, the pledge significantly limits what the exploit can accomplish.

Using the same strace model, rather than print out all system calls, we could either block certain system calls or simply terminate the tracee when it misbehaves. Termination is easy: just call exit(2) in the tracer. Since it’s configured to also terminate the tracee. Blocking the system call and allowing the child to continue is a little trickier.

The tricky part is that there’s no way to abort a system call once it’s started. When tracer returns from wait(2) on the entrance to the system call, the only way to stop a system call from happening is to terminate the tracee.

However, not only can we mess with the system call arguments, we can change the system call number itself, converting it to a system call that doesn’t exist. On return we can report a “friendly” EPERM error in errno via the normal in-band signaling.

for (;;) {
    /* Enter next system call */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    /* Is this system call permitted? */
    int blocked = 0;
    if (is_syscall_blocked(regs.orig_rax)) {
        blocked = 1;
        regs.orig_rax = -1; // set to invalid syscall
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }

    /* Run system call and stop on exit */
    ptrace(PTRACE_SYSCALL, pid, 0, 0);
    waitpid(pid, 0, 0);

    if (blocked) {
        /* errno = EPERM */
        regs.rax = -EPERM; // Operation not permitted
        ptrace(PTRACE_SETREGS, pid, 0, &regs);
    }
}

This simple example only checks against a whitelist or blacklist of system calls. And there’s no nuance, such as allowing files to be opened (open(2)) read-only but not as writable, allowing anonymous memory maps but not non-anonymous mappings, etc. There’s also no way to the tracee to dynamically drop privileges.

How could the tracee communicate to the tracer? Use an artificial system call!

Creating an artificial system call

For my new pledge-like system call — which I call xpledge() to distinguish it from the real thing — I picked system call number 10000, a nice high number that’s unlikely to ever be used for a real system call.

#define SYS_xpledge 10000

Just for demonstration purposes, I put together a minuscule interface that’s not good for much in practice. It has little in common with OpenBSD’s pledge(2), which uses a string interface. Actually designing robust and secure sets of privileges is really complicated, as the pledge(2) manpage shows. Here’s the entire interface and implementation of the system call for the tracee:

#define _GNU_SOURCE
#include <unistd.h>

#define XPLEDGE_RDWR  (1 << 0)
#define XPLEDGE_OPEN  (1 << 1)

#define xpledge(arg) syscall(SYS_xpledge, arg)

If it passes zero for the argument, only a few basic system calls are allowed, including those used to allocate memory (e.g. brk(2)). The PLEDGE_RDWR bit allows various read and write system calls (read(2), readv(2), pread(2), preadv(2), etc.). The PLEDGE_OPEN bit allows open(2).

To prevent privileges from being escalated back, pledge() blocks itself — though this also prevents dropping more privileges later down the line.

In the xpledge tracer, I just need to check for this system call:

/* Handle entrance */
switch (regs.orig_rax) {
    case SYS_pledge:
        register_pledge(regs.rdi);
        break;
}

The operating system will return ENOSYS (Function not implemented) since this isn’t a real system call. So on the way out I overwrite this with a success (0).

/* Handle exit */
switch (regs.orig_rax) {
    case SYS_pledge:
        ptrace(PTRACE_POKEUSER, pid, RAX * 8, 0);
        break;
}

I wrote a little test program that opens /dev/urandom, makes a read, tries to pledge, then tries to open /dev/urandom a second time, then confirms it can read from the original /dev/urandom file descriptor. Running without a pledge tracer, the output looks like this:

$ ./example
fread("/dev/urandom")[1] = 0xcd2508c7
XPledging...
XPledge failed: Function not implemented
fread("/dev/urandom")[2] = 0x0be4a986
fread("/dev/urandom")[1] = 0x03147604

Making an invalid system call doesn’t crash an application. It just fails, which is a rather convenient fallback. When run under the tracer, it looks like this:

$ ./xpledge ./example
fread("/dev/urandom")[1] = 0xb2ac39c4
XPledging...
fopen("/dev/urandom")[2]: Operation not permitted
fread("/dev/urandom")[1] = 0x2e1bd1c4

The pledge succeeds but the second fopen(3) does not since the tracer blocked it with EPERM.

This concept could be taken much further, to, say, change file paths or return fake results. A tracer could effectively chroot its tracee, prepending some chroot path to the root of any path passed through a system call. It could even lie to the process about what user it is, claiming that it’s running as root. In fact, this is exactly how the Fakeroot NG program works.

Foreign system emulation

Suppose you don’t just want to intercept some system calls, but all system calls. You’ve got a binary intended to run on another operating system, so none of the system calls it makes will ever work.

You could manage all this using only what I’ve described so far. The tracer would always replace the system call number with a dummy, allow it to fail, then service the system call itself. But that’s really inefficient. That’s essentially three context switches for each system call: one to stop on the entrance, one to make the always-failing system call, and one to stop on the exit.

The Linux version of PTrace has had a more efficient operation for this technique since 2005: PTRACE_SYSEMU. PTrace stops only once per a system call, and it’s up to the tracer to service that system call before allowing the tracee to continue.

for (;;) {
    ptrace(PTRACE_SYSEMU, pid, 0, 0);
    waitpid(pid, 0, 0);

    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, 0, &regs);

    switch (regs.orig_rax) {
        case OS_read:
            /* ... */

        case OS_write:
            /* ... */

        case OS_open:
            /* ... */

        case OS_exit:
            /* ... */

        /* ... and so on ... */
    }
}

To run binaries for the same architecture from any system with a stable (enough) system call ABI, you just need this PTRACE_SYSEMU tracer, a loader (to take the place of exec(2)), and whatever system libraries the binary needs (or only run static binaries).

In fact, this sounds like a fun weekend project.

See also

null program

Chris Wellons