Debugging Emacs or: How I Learned to Stop Worrying and Love DTrace

January 17, 2018

nullprogram.com/blog/2018/01/17/

Update: This article was featured on BSD Now 233 (starting at 21:38).

For some time Elfeed was experiencing a strange, spurious failure. Every so often users were seeing an error (spoiler warning) when updating feeds: “error in process sentinel: Search failed.” If you use Elfeed, you might have even seen this yourself. From the surface it appeared that curl, tasked with the responsibility for downloading feed data, was producing incomplete output despite reporting a successful run. Since the run was successful, Elfeed assumed certain data was in curl’s output buffer, but, since it wasn’t, it failed hard.

Unfortunately this issue was not reproducible. Manually running curl outside of Emacs never revealed any issues. Asking Elfeed to retry fetching the feeds would work fine. The issue would only randomly rear its head when Elfeed was fetching many feeds in parallel, under stress. By the time the error was discovered, the curl process had exited and vital debugging information was lost. Considering that this was likely to be a bug in Emacs itself, there really wasn’t a reliable way to capture the necessary debugging information from within Emacs Lisp. And, indeed, this later proved to be the case.

A quick-and-dirty work around is to use condition-case to catch and swallow the error. When the bizarre issue shows up, rather than fail badly in front of the user, Elfeed could attempt to swallow the error — assuming it can be reliably detected — and treat the fetch as simply a failure. That didn’t sit comfortably with me. Elfeed had done its due diligence checking for errors already. Someone was lying to Elfeed, and I intended to catch them with their pants on fire. Someday.

I’d just need to witness the bug on one of my own machines. Elfeed is part of my daily routine, so surely I’d have to experience this issue myself someday. My plan was, should that day come, to run a modified Elfeed, instrumented to capture extra data. I would have also routinely run Emacs under GDB so that I could inspect the failure more deeply.

For now I just had to wait to hunt that zebra.

Bryan Cantrill, DTrace, and FreeBSD

Over the holidays I re-discovered Bryan Cantrill, a systems software engineer who worked for Sun between 1996 and 2010, and is most well known for DTrace. My first exposure to him was in a BSD Now interview in 2015. I had re-watched that interview and decided there was a lot more I had to learn from him. He’s become a personal hero to me. So I scoured the internet for more of his writing and talks. Besides what I’ve already linked in this article, here are a couple more great presentations:

You can also find some of his writing scattered around the DTrace blog.

Some interesting operating system technology came out of Sun during its final 15 or so years — most notably DTrace and ZFS — and Bryan speaks about it passionately. Almost as a matter of luck, most of it survived the Oracle acquisition thanks to Sun releasing it as open source in just the nick of time. Otherwise it would have been lost forever. The scattered ex-Sun employees, still passionate about their prior work at Sun, along with some of their old customers have since picked up the pieces and kept going as a community under the name illumos. It’s like an open source flotilla.

Naturally I wanted to get my hands on this stuff to try it out for myself. Is it really as good as they say? Normally I stick to Linux, but it (generally) doesn’t have these Sun technologies. The main reason is license incompatibility. Sun released its code under the CDDL, which is incompatible with the GPL. Ubuntu does infamously include ZFS, but other distributions are unwilling to take that risk. Porting DTrace is a serious undertaking since it’s got its fingers throughout the kernel, which also makes the licensing issues even more complicated.

(Update Feburary 2018: DTrace has been released under the GPLv2, allowing it to be legally integrated with Linux.)

Linux has a reputation for Not Invented Here (NIH) syndrome, and these licensing issues certainly contribute to that. Rather than adopt ZFS and DTrace, they’ve been reinvented from scratch: btrfs instead of ZFS, and a slew of partial options instead of DTrace. Normally I’m most interested in system call tracing, and my go to is strace, though it certainly has its limitations — including this situation of debugging curl under Emacs. Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2).

So, if I want to try these for myself, I’ll need to install a different operating system. I’ve dabbled with OmniOS, an OS built on illumos, in virtual machines, using it as an alien environment to test some of my software (e.g. enchive). OmniOS has a philosophy called Keep Your Software To Yourself (KYSTY), which is really just code for “we don’t do packaging.” Honestly, you can’t blame them since they’re a tiny community. The best solution to this is probably pkgsrc, which is essentially a universal packaging system. Otherwise you’re on your own.

There’s also openindiana, which is a more friendly desktop-oriented illumos distribution. Still, the short of it is that you’re very much on your own when things don’t work. The situation is like running Linux a couple decades ago, when it was still difficult to do.

If you’re interested in trying DTrace, the easiest option these days is probably FreeBSD. It’s got a big, active community, thorough documentation, and a huge selection of packages. Its license (the BSD license, duh) is compatible with the CDDL, so both ZFS and DTrace have been ported to FreeBSD.

What is DTrace?

I’ve done all this talking but haven’t yet described what DTrace really is. I won’t pretend to write my own tutorial, but I’ll provide enough information to follow along. DTrace is a tracing framework for debugging production systems in real time, both for the kernel and for applications. The “production systems” part means it’s stable and safe — using DTrace won’t put your system at risk of crashing or damaging data. The “real time” part means it has little impact on performance. You can use DTrace on live, active systems with little impact. Both of these core design principles are vital for troubleshooting those really tricky bugs that only show up in production.

There are DTrace probes scattered all throughout the system: on system calls, scheduler events, networking events, process events, signals, virtual memory events, etc. Using a specialized language called D (unrelated to the general purpose programming language D), you can dynamically add behavior at these instrumentation points. Generally the behavior is to capture information, but it can also manipulate the event being traced.

Each probe is fully identified by a 4-tuple delimited by colons: provider, module, function, and probe name. An empty element denotes a sort of wildcard. For example, syscall::open:entry is a probe at the beginning (i.e. “entry”) of open(2). syscall:::entry matches all system call entry probes.

Unlike strace on Linux which monitors a specific process, DTrace applies to the entire system when active. To run curl under strace from Emacs, I’d have to modify Emacs’ behavior to do so. With DTrace I can instrument every curl process without making a single change to Emacs, and with negligible impact to Emacs. That’s a big deal.

So, when it comes to this Elfeed issue, FreeBSD is much better poised for debugging the problem. All I have to do is catch it in the act. However, it’s been months since that bug report and I’m not really making this connection yet. I’m just hoping I eventually find an interesting problem where I can apply DTrace.

FreeBSD on a Raspberry Pi 2

So I’ve settled in FreeBSD as the playground for these technologies, I just have to decide where. I could always run it in a virtual machine, but it’s always more interesting to try things out on real hardware. FreeBSD supports the Raspberry Pi 2 as a Tier 2 system, and I had a Raspberry Pi 2 sitting around collecting dust, so I put it to use.

I wrote the image to an SD card, and for a few days I stretched my legs on this new system. I cloned a couple dozen of my own git repositories, ran the builds and the tests, and just got a feel for things. I tried out the ports system for the first time, mainly to discover that the low-powered Raspberry Pi 2 takes days to build some of the packages I want to try.

I mostly program in Vim these days, so it’s some days before I even set up Emacs. Eventually I do build Emacs, clone my configuration, fire it up, and give Elfeed a spin.

And that’s when the “search failed” bug strikes! Not just once, but dozens of times. Perfect! This low-powered platform is the jackpot for this particular bug, triggering it left and right. Given that I’ve got DTrace at my disposal, it’s the perfect place to debug this. Something is lying to Elfeed and DTrace will play the judge.

Before I dive in I see three possibilities:

curl is reporting success but truncating its output.
Emacs is quietly truncating curl’s output.
Emacs is misinterpreting curl’s exit status.

With Dtrace I can observe what every curl process writes to Emacs, and I can also double check curl’s exit status. I come up with the following (newbie) DTrace script:

syscall::write:entry
/execname == "curl"/
{
    printf("%d WRITE %d \"%s\"\n",
           pid, arg2, stringof(copyin(arg1, arg2)));
}

syscall::exit:entry
/execname == "curl"/
{
    printf("%d EXIT  %d\n", pid, arg0);
}

The /execname == "curl"/ is a predicate that (obviously) causes the behavior to only fire for curl processes. The first probe has DTrace print a line for every write(2) from curl. arg0, arg1, and arg2 correspond to the arguments of write(2): fd, buf, count. It logs the process ID (pid) of the write, the length of the write, and the actual contents written. Remember that these curl processes are run in parallel by Emacs, so the pid allows me to associate the separate writes and the exit status.

The second probe prints the pid and the exit status (the first argument to exit(2)).

I also want to compare this to exactly what is delivered to Elfeed when curl exits, so I modify the process sentinel — the callback that handles a subprocess exiting — to call write-file before any action is taken. I can compare these buffer dumps to the logs produced by DTrace.

There are two important findings.

First, when the “search failed” bug occurs, the buffer was completely empty (95% of the time) or truncated at the end of the HTTP headers (5% of the time), right at the blank line. DTrace indicates that curl did its job to the full, so it’s Emacs who’s the liar. It’s not delivering all of curl’s data to Elfeed. That’s pretty annoying.

Second, curl was line-buffered. Each line was a separate, independent write(2). I was certainly not expecting this. Normally the C library only does line buffering when the output is a terminal. That’s because it’s guessing a user may be watching, expecting the output to arrive a line at a time.

Here’s a sample of what it looked like in the log:

88188 WRITE 32 "Server: Apache/2.4.18 (Ubuntu)
"
88188 WRITE 46 "Location: https://blog.plover.com/index.atom
"
88188 WRITE 21 "Content-Length: 299
"
88188 WRITE 45 "Content-Type: text/html; charset=iso-8859-1
"
88188 WRITE 2 "
"

Why would curl think Emacs is a terminal?

Oh. That’s right. This is the same problem I ran into four years ago when writing EmacSQL. By default Emacs connects to subprocesses through a pseudo-terminal (pty). I called this a mistake in Emacs back then, and I still stand by that claim. The pty causes weird, annoying problems for little benefit:

Interpreting control characters. Hope you weren’t transferring binary data!
Subprocesses will generally get line buffered. This makes them slower, though in some situations it might be desirable.
Stdout and stderr get mixed together. (Optional since Emacs 25.)
New! There’s a bug somewhere in Emacs that causes truncation when ptys are used heavily in parallel.

Just from eyeballing the DTrace log I knew what to do: dump the pty and switch to a pipe. This is controlled with the process-connection-type variable, and fixing it is a one-liner.

Not only did this completely resolve the truncation issue, Elfeed is noticeably faster at fetching feeds on all machines. It’s no longer receiving mountains of XML one line at a time, like sucking pudding through a straw. It’s now quite zippy even on my Raspberry Pi 2, which had never been the case before (without the “search failed” bug). Even if you were never affected by this bug, you will benefit from the fix.

I haven’t officially reported this as an Emacs bug yet because reproducibility is still an issue. It needs something better than “fire off a bunch of HTTP requests across the internet in parallel from a Raspberry Pi.”

The fix reminds me of that old boilermaker story about charging a lot of money just to swing a hammer. Once the problem arose, DTrace quickly helped to identify the place to hit Emacs with the hammer.

Finally, a big thanks to alphapapa for originally taking the time to report this bug months ago.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.