Articles tagged elfeed at null program

I have officially retired from Emacs

2026-04-26T00:00:00Z

This article was discussed on reddit and on Hacker News.

This past Tuesday I typed C-x C-c in Emacs for the last time after 20 years of daily use. Though nearly half that time was gradually retiring it, switching to modal editing, then to Vim. Emacs is a platform, and I’d grown accustomed to its applications, especially those I built myself. There was no particular hurry, so replacements came slowly. With my newly-acquired superpowers I could knock out the last two pieces in a few days’ work, namely M-x calc with stackcalc and Elfeed with Elfeed2. I’m especially excited about the latter because it already exceeds the original. Both are multi-platform, native C++ GUI applications using native UI components.

These actively-in-use packages require new maintainers (apply on the project’s issues/discussion):

No wonder it took so long for me to move on! I’m not handing these off to just anyone, and you’ll need to establish your reputation. Having already made contributions is a good sign, even if never merged. I’m willing to transfer them off my namespace, though you’ll need to manage the Melpa hand-off (on which I’ll sign-off). If there are no takers, these projects will be archived but not deleted.

Trying out wxWidgets

The Emacs Calculator is amazing and the best calculator I’ve ever used, which is why nothing I could find was going to replace it. My clone uses GMP and MPFR for multi-precision, so it’s far faster, as to be expected, but it’s not nearly at feature parity. It’s missing esoteric features including symbolic processing. Though it’s enough to cover all of my own usage. I can add more features later. The Emacs Calculator manual served as a specification when building stackcalc.

Elfeed has been a cornerstone of my daily routines for the past 13 years. Nothing else I’ve found scratches that itch for me, so I’ve always known it would require a rewrite someday. Knowing it would take a few weeks of work, and that I already had the feed reader I wanted, made motivation difficult to find. Though now that I can accomplish ~3 weeks of old-way work in a new-way day, this sort of project becomes that much easier to start and finish. Though it’s not yet at a 1.0 release, after a couple days Elfeed2 was working well enough to replace the original Elfeed.

While Dear ImGui was the right choice for dcmake, it would not be so for these two applications. Active rendering doesn’t suit a feed reader left running all day, and I needed a richer toolkit. Professionally I work in Qt, but I wanted something lighter-weight for my projects, accessible via CMake FetchContent. That naturally led to wxWidgets. While it has issues — mitigatable character encoding problems, accidental quadratic time in many places — it’s worked better than I anticipated, letting me rapidly produce native-looking applications on Windows, macOS, and Linux.

Unlike Dear ImGui, wxWidgets is a platform, including sane I/O and path handling. I mostly don’t need platform layers when building applications like these. I can simply rely on wxWidgets’ utilities.

Both of these projects build out-of-the-box on w64devkit thanks to the dependencies being FetchContent-compatible. On all platforms you just need a C++ toolchain and CMake:

$ cmake -B build
$ cmake --build build

Now that I have experience with wxWidgets, learning its limitations and capabilities, it’s likely to be a foundation of most of my GUI projects to come, except where something like Dear ImGui is a better git.

Debugging Emacs or: How I Learned to Stop Worrying and Love DTrace

2018-01-17T23:59:49Z

Update: This article was featured on BSD Now 233 (starting at 21:38).

For some time Elfeed was experiencing a strange, spurious failure. Every so often users were seeing an error (spoiler warning) when updating feeds: “error in process sentinel: Search failed.” If you use Elfeed, you might have even seen this yourself. From the surface it appeared that curl, tasked with the responsibility for downloading feed data, was producing incomplete output despite reporting a successful run. Since the run was successful, Elfeed assumed certain data was in curl’s output buffer, but, since it wasn’t, it failed hard.

Unfortunately this issue was not reproducible. Manually running curl outside of Emacs never revealed any issues. Asking Elfeed to retry fetching the feeds would work fine. The issue would only randomly rear its head when Elfeed was fetching many feeds in parallel, under stress. By the time the error was discovered, the curl process had exited and vital debugging information was lost. Considering that this was likely to be a bug in Emacs itself, there really wasn’t a reliable way to capture the necessary debugging information from within Emacs Lisp. And, indeed, this later proved to be the case.

A quick-and-dirty work around is to use condition-case to catch and swallow the error. When the bizarre issue shows up, rather than fail badly in front of the user, Elfeed could attempt to swallow the error — assuming it can be reliably detected — and treat the fetch as simply a failure. That didn’t sit comfortably with me. Elfeed had done its due diligence checking for errors already. Someone was lying to Elfeed, and I intended to catch them with their pants on fire. Someday.

I’d just need to witness the bug on one of my own machines. Elfeed is part of my daily routine, so surely I’d have to experience this issue myself someday. My plan was, should that day come, to run a modified Elfeed, instrumented to capture extra data. I would have also routinely run Emacs under GDB so that I could inspect the failure more deeply.

For now I just had to wait to hunt that zebra.

Bryan Cantrill, DTrace, and FreeBSD

Over the holidays I re-discovered Bryan Cantrill, a systems software engineer who worked for Sun between 1996 and 2010, and is most well known for DTrace. My first exposure to him was in a BSD Now interview in 2015. I had re-watched that interview and decided there was a lot more I had to learn from him. He’s become a personal hero to me. So I scoured the internet for more of his writing and talks. Besides what I’ve already linked in this article, here are a couple more great presentations:

You can also find some of his writing scattered around the DTrace blog.

Some interesting operating system technology came out of Sun during its final 15 or so years — most notably DTrace and ZFS — and Bryan speaks about it passionately. Almost as a matter of luck, most of it survived the Oracle acquisition thanks to Sun releasing it as open source in just the nick of time. Otherwise it would have been lost forever. The scattered ex-Sun employees, still passionate about their prior work at Sun, along with some of their old customers have since picked up the pieces and kept going as a community under the name illumos. It’s like an open source flotilla.

Naturally I wanted to get my hands on this stuff to try it out for myself. Is it really as good as they say? Normally I stick to Linux, but it (generally) doesn’t have these Sun technologies. The main reason is license incompatibility. Sun released its code under the CDDL, which is incompatible with the GPL. Ubuntu does infamously include ZFS, but other distributions are unwilling to take that risk. Porting DTrace is a serious undertaking since it’s got its fingers throughout the kernel, which also makes the licensing issues even more complicated.

(Update Feburary 2018: DTrace has been released under the GPLv2, allowing it to be legally integrated with Linux.)

Linux has a reputation for Not Invented Here (NIH) syndrome, and these licensing issues certainly contribute to that. Rather than adopt ZFS and DTrace, they’ve been reinvented from scratch: btrfs instead of ZFS, and a slew of partial options instead of DTrace. Normally I’m most interested in system call tracing, and my go to is strace, though it certainly has its limitations — including this situation of debugging curl under Emacs. Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2).

So, if I want to try these for myself, I’ll need to install a different operating system. I’ve dabbled with OmniOS, an OS built on illumos, in virtual machines, using it as an alien environment to test some of my software (e.g. enchive). OmniOS has a philosophy called Keep Your Software To Yourself (KYSTY), which is really just code for “we don’t do packaging.” Honestly, you can’t blame them since they’re a tiny community. The best solution to this is probably pkgsrc, which is essentially a universal packaging system. Otherwise you’re on your own.

There’s also openindiana, which is a more friendly desktop-oriented illumos distribution. Still, the short of it is that you’re very much on your own when things don’t work. The situation is like running Linux a couple decades ago, when it was still difficult to do.

If you’re interested in trying DTrace, the easiest option these days is probably FreeBSD. It’s got a big, active community, thorough documentation, and a huge selection of packages. Its license (the BSD license, duh) is compatible with the CDDL, so both ZFS and DTrace have been ported to FreeBSD.

What is DTrace?

I’ve done all this talking but haven’t yet described what DTrace really is. I won’t pretend to write my own tutorial, but I’ll provide enough information to follow along. DTrace is a tracing framework for debugging production systems in real time, both for the kernel and for applications. The “production systems” part means it’s stable and safe — using DTrace won’t put your system at risk of crashing or damaging data. The “real time” part means it has little impact on performance. You can use DTrace on live, active systems with little impact. Both of these core design principles are vital for troubleshooting those really tricky bugs that only show up in production.

There are DTrace probes scattered all throughout the system: on system calls, scheduler events, networking events, process events, signals, virtual memory events, etc. Using a specialized language called D (unrelated to the general purpose programming language D), you can dynamically add behavior at these instrumentation points. Generally the behavior is to capture information, but it can also manipulate the event being traced.

Each probe is fully identified by a 4-tuple delimited by colons: provider, module, function, and probe name. An empty element denotes a sort of wildcard. For example, syscall::open:entry is a probe at the beginning (i.e. “entry”) of open(2). syscall:::entry matches all system call entry probes.

Unlike strace on Linux which monitors a specific process, DTrace applies to the entire system when active. To run curl under strace from Emacs, I’d have to modify Emacs’ behavior to do so. With DTrace I can instrument every curl process without making a single change to Emacs, and with negligible impact to Emacs. That’s a big deal.

So, when it comes to this Elfeed issue, FreeBSD is much better poised for debugging the problem. All I have to do is catch it in the act. However, it’s been months since that bug report and I’m not really making this connection yet. I’m just hoping I eventually find an interesting problem where I can apply DTrace.

FreeBSD on a Raspberry Pi 2

So I’ve settled in FreeBSD as the playground for these technologies, I just have to decide where. I could always run it in a virtual machine, but it’s always more interesting to try things out on real hardware. FreeBSD supports the Raspberry Pi 2 as a Tier 2 system, and I had a Raspberry Pi 2 sitting around collecting dust, so I put it to use.

I wrote the image to an SD card, and for a few days I stretched my legs on this new system. I cloned a couple dozen of my own git repositories, ran the builds and the tests, and just got a feel for things. I tried out the ports system for the first time, mainly to discover that the low-powered Raspberry Pi 2 takes days to build some of the packages I want to try.

I mostly program in Vim these days, so it’s some days before I even set up Emacs. Eventually I do build Emacs, clone my configuration, fire it up, and give Elfeed a spin.

And that’s when the “search failed” bug strikes! Not just once, but dozens of times. Perfect! This low-powered platform is the jackpot for this particular bug, triggering it left and right. Given that I’ve got DTrace at my disposal, it’s the perfect place to debug this. Something is lying to Elfeed and DTrace will play the judge.

Before I dive in I see three possibilities:

curl is reporting success but truncating its output.
Emacs is quietly truncating curl’s output.
Emacs is misinterpreting curl’s exit status.

With Dtrace I can observe what every curl process writes to Emacs, and I can also double check curl’s exit status. I come up with the following (newbie) DTrace script:

syscall::write:entry
/execname == "curl"/
{
    printf("%d WRITE %d \"%s\"\n",
           pid, arg2, stringof(copyin(arg1, arg2)));
}

syscall::exit:entry
/execname == "curl"/
{
    printf("%d EXIT  %d\n", pid, arg0);
}

The /execname == "curl"/ is a predicate that (obviously) causes the behavior to only fire for curl processes. The first probe has DTrace print a line for every write(2) from curl. arg0, arg1, and arg2 correspond to the arguments of write(2): fd, buf, count. It logs the process ID (pid) of the write, the length of the write, and the actual contents written. Remember that these curl processes are run in parallel by Emacs, so the pid allows me to associate the separate writes and the exit status.

The second probe prints the pid and the exit status (the first argument to exit(2)).

I also want to compare this to exactly what is delivered to Elfeed when curl exits, so I modify the process sentinel — the callback that handles a subprocess exiting — to call write-file before any action is taken. I can compare these buffer dumps to the logs produced by DTrace.

There are two important findings.

First, when the “search failed” bug occurs, the buffer was completely empty (95% of the time) or truncated at the end of the HTTP headers (5% of the time), right at the blank line. DTrace indicates that curl did its job to the full, so it’s Emacs who’s the liar. It’s not delivering all of curl’s data to Elfeed. That’s pretty annoying.

Second, curl was line-buffered. Each line was a separate, independent write(2). I was certainly not expecting this. Normally the C library only does line buffering when the output is a terminal. That’s because it’s guessing a user may be watching, expecting the output to arrive a line at a time.

Here’s a sample of what it looked like in the log:

88188 WRITE 32 "Server: Apache/2.4.18 (Ubuntu)
"
88188 WRITE 46 "Location: https://blog.plover.com/index.atom
"
88188 WRITE 21 "Content-Length: 299
"
88188 WRITE 45 "Content-Type: text/html; charset=iso-8859-1
"
88188 WRITE 2 "
"

Why would curl think Emacs is a terminal?

Oh. That’s right. This is the same problem I ran into four years ago when writing EmacSQL. By default Emacs connects to subprocesses through a pseudo-terminal (pty). I called this a mistake in Emacs back then, and I still stand by that claim. The pty causes weird, annoying problems for little benefit:

Interpreting control characters. Hope you weren’t transferring binary data!
Subprocesses will generally get line buffered. This makes them slower, though in some situations it might be desirable.
Stdout and stderr get mixed together. (Optional since Emacs 25.)
New! There’s a bug somewhere in Emacs that causes truncation when ptys are used heavily in parallel.

Just from eyeballing the DTrace log I knew what to do: dump the pty and switch to a pipe. This is controlled with the process-connection-type variable, and fixing it is a one-liner.

Not only did this completely resolve the truncation issue, Elfeed is noticeably faster at fetching feeds on all machines. It’s no longer receiving mountains of XML one line at a time, like sucking pudding through a straw. It’s now quite zippy even on my Raspberry Pi 2, which had never been the case before (without the “search failed” bug). Even if you were never affected by this bug, you will benefit from the fix.

I haven’t officially reported this as an Emacs bug yet because reproducibility is still an issue. It needs something better than “fire off a bunch of HTTP requests across the internet in parallel from a Raspberry Pi.”

The fix reminds me of that old boilermaker story about charging a lot of money just to swing a hammer. Once the problem arose, DTrace quickly helped to identify the place to hit Emacs with the hammer.

Finally, a big thanks to alphapapa for originally taking the time to report this bug months ago.

Domain-Specific Language Compilation in Elfeed

2016-12-27T21:46:30Z

Last night I pushed another performance enhancement for Elfeed, this time reducing the time spent parsing feeds. It’s accomplished by compiling, during macro expansion, a jQuery-like domain-specific language within Elfeed.

Heuristic parsing

Given the nature of the domain — an under-specified standard and a lack of robust adherence — feed parsing is much more heuristic than strict. Sure, everyone’s feed XML is strictly conforming since virtually no feed reader tolerates invalid XML (thank you, XML libraries), but, for the schema, the situation resembles the de facto looseness of HTML. Sometimes important or required information is missing, or is only available in a different namespace. Sometimes, especially in the case of timestamps, it’s in the wrong format, or encoded incorrectly, or ambiguous. It’s real world data.

To get a particular piece of information, Elfeed looks in a number of different places within the feed, starting with the preferred source and stopping when the information is found. For example, to find the date of an Atom entry, Elfeed first searches for elements in this order:

Failing to find any of these elements, or if no parsable date is found, it settles on the current time. Only the updated element is required, but published usually has the desired information, so it goes first. The last three are only valid for another namespace, but are useful fallbacks.

Before Elfeed even starts this search, the XML text is parsed into an s-expression using xml-parse-region — a pure Elisp XML parser included in Emacs. The search is made over the resulting s-expression.

For example, here’s a sample from the Atom specification.

 xmlns="http://www.w3.org/2005/Atom">

  </span>Example Feed<span class="nt">
   href="http://example.org/"/>
  2003-12-13T18:30:02Z
  
    John Doe
  
  urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6

  
    </span>Atom-Powered Robots Run Amok<span class="nt">
     rel="alternate" href="http://example.org/2003/12/13/atom03"/>
    urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a
    2003-12-13T18:30:02Z
    Some text.
  

Which is parsed to into this s-expression.

((feed ((xmlns . "http://www.w3.org/2005/Atom"))
       (title () "Example Feed")
       (link ((href . "http://example.org/")))
       (updated () "2003-12-13T18:30:02Z")
       (author () (name () "John Doe"))
       (id () "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6")
       (entry ()
              (title () "Atom-Powered Robots Run Amok")
              (link ((rel . "alternate")
                     (href . "http://example.org/2003/12/13/atom03")))
              (id () "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a")
              (updated () "2003-12-13T18:30:02Z")
              (summary () "Some text."))))

Each XML element is converted to a list. The first item is a symbol that is the element’s name. The second item is an alist of attributes — cons pairs of symbols and strings. And the rest are its children, both string nodes and other elements. I’ve trimmed the extraneous string nodes from the sample s-expression.

A subtle detail is that xml-parse-region doesn’t just return the root element. It returns a list of elements, which always happens to be a single element list, which is the root element. I don’t know why this is, but I’ve built everything to assume this structure as input.

Elfeed strips all namespaces stripped from both elements and attributes to make parsing simpler. As I said, it’s heuristic rather than strict, so namespaces are treated as noise.

A domain-specific language

Coding up Elfeed’s s-expression searches in straight Emacs Lisp would be tedious, error-prone, and difficult to understand. It’s a lot of loops, assoc, etc. So instead I invented a jQuery-like, CSS selector-like, domain-specific language (DSL) to express these searches concisely and clearly.

For example, all of the entry links are “selected” using this expression:

(feed entry link [rel "alternate"] :href)

Reading right-to-left, this matches every href attribute under every link element with the rel="alternate" attribute, under every entry element, under the feed root element. Symbols match element names, two-element vectors match elements with a particular attribute pair, and keywords (which must come last) narrow the selection to a specific attribute value.

Imagine hand-writing the code to navigate all these conditions for each piece of information that Elfeed requires. The RSS parser makes up to 16 such queries, and the Atom parser makes as many as 24. That would add up to a lot of tedious code.

The package (included with Elfeed) that executes this query is called “xml-query.” It comes in two flavors: xml-query and xml-query-all. The former returns just the first match, and the latter returns all matches. The naming parallels the querySelector() and querySelectorAll() DOM methods in JavaScript.

(let ((xml (elfeed-xml-parse-region)))
  (xml-query-all '(feed entry link [rel "alternate"] :href) xml))

;; => ("http://example.org/2003/12/13/atom03")

That date search I mentioned before looks roughly like this. The * matches text nodes within the selected element. It must come last just like the keyword matcher.

(or (xml-query '(feed entry published *))
    (xml-query '(feed entry updated *))
    (xml-query '(feed entry date *))
    (xml-query '(feed entry modified *))
    (xml-query '(feed entry issued *))
    (current-time))

Over the past three years, Elfeed has gained more and more of these selectors as it collects more and more information from feeds. Most recently, Elfeed collects author and category information provided by feeds. Each new query slows feed parsing a little bit, and it’s a perfect example of a program slowing down as it gains more features and capabilities.

But I don’t want Elfeed to slow down. I want it to get faster!

Optimizing the domain-specific language

Just like the primary jQuery function ($), both xml-query and xml-query-all are functions. The xml-query engine processes the selector from scratch on each invocation. It examines the first element, dispatches on its type/value to apply it to the input, and then recurses on the rest of selector with the narrowed input, stopping when it hits the end of the list. That’s the way it’s worked from the start.

However, every selector argument in Elfeed is a static, quoted list. Unlike user-supplied filters, I know exactly what I want to execute ahead of time. It would be much better if the engine didn’t have to waste time reparsing the DSL for each query.

This is the classic split between interpreters and compilers. An interpreter reads input and immediately executes it, doing what the input tells it to do. A compiler reads input and, rather than execute it, produces output, usually in a simpler language, that, when evaluated, has the same effect as executing the input.

Rather than interpret the selector, it would be better to compile it into Elisp code, compile that into byte-code, and then have the Emacs byte-code virtual machine (VM) execute the query each time it’s needed. The extra work of parsing the DSL is performed ahead of time, the dispatch is entirely static, and the selector ultimately executes on a much faster engine (byte-code VM). This should be a lot faster!

So I wrote a function that accepts a selector expression and emits Elisp source that implements that selector: a compiler for my DSL. Having a readily-available syntax tree is one of the big advantages of homoiconicity, and this sort of function makes perfect sense in a lisp. For the external interface, this compiler function is called by a new pair of macros, xml-query* and xml-query-all*. These macros consume a static selector and expand into the compiled Elisp form of the selector.

To demonstrate, remember that link query from before? Here’s the macro version of that selection, but only returning the first match. Notice the selector is no longer quoted. This is because it’s consumed by the macro, not evaluated.

(xml-query* (feed entry title [rel "alternate"] :href) xml)

This will expand into the following code.

(catch 'done
  (dolist (v xml)
    (when (and (consp v) (eq (car v) 'feed))
      (dolist (v (cddr v))
        (when (and (consp v) (eq (car v) 'entry))
          (dolist (v (cddr v))
            (when (and (consp v) (eq (car v) 'title))
              (let ((value (cdr (assq 'rel (cadr v)))))
                (when (equal value "alternate")
                  (let ((v (cdr (assq 'href (cadr v)))))
                    (when v
                      (throw 'done v))))))))))))

As soon as it finds a match, it’s thrown to the top level and returned. Without the DSL, the expansion is essentially what would have to be written by hand. This is exactly the sort of leverage you should be getting from a compiler. It compiles to around 130 byte-code instructions.

The xml-query-all* form is nearly the same, but instead of a throw, it pushes the result into the return list. Only the prologue (the outermost part) and the epilogue (the innermost part) are different.

Parsing feeds is a hot spot for Elfeed, so I wanted the compiler’s output to be as efficient as possible. I had three goals for this:

No extraneous code. It’s easy for the compiler to emit unnecessary code. The byte-code compiler might be able to eliminate some of it, but I don’t want to rely on that. Except for the identifiers, it should basically look like a human wrote it.
Avoid function calls. I don’t want to pay function call overhead, and, with some care, it’s easy to avoid. In the xml-query* expansion, the only function call is throw, which is unavoidable. The xml-query-all* version makes no function calls whatsoever. Notice that I used assq rather than assoc. First, it only needs to match symbols, so it should be faster. Second, assq has its own byte-code instruction (158) and assoc does not.
No unnecessary memory allocations. The xml-query* expansion makes no allocations. The xml-query-all* version only conses once per output, which is the minimum possible.

The end result is at least as optimal as hand-written code, but without the chance of human error (typos, fat fingering) and sourced from an easy-to-read DSL.

Performance

In my tests, the xml-query macros are a full order of magnitude faster than the functions. Yes, ten times faster! It’s an even bigger gain than I expected.

In the full picture, xml-query is only one part of parsing a feed. Measuring the time starting from raw XML text (as delivered by cURL) to a list of database entry objects, I’m seeing an overall 25% speedup with the macros. The remaining time is dominated by xml-parse-region, which is mostly out of my control.

With xml-query so computationally cheap, I don’t need to worry about using it more often. Compared to parsing XML text, it’s virtually free.

When it came time to validate my DSL compiler, I was really happy that Elfeed had a test suite. I essentially rewrote a core component from scratch, and passing all of the unit tests was a strong sign that it was correct. Many times that test suite has provided confidence in changes made both by me and by others.

I’ll end by describing another possible application: Apply this technique to regular expressions, such that static strings containing regular expressions are compiled into Elisp/byte-code via macro expansion. I wonder if situationally this would be faster than Emacs’ own regular expression engine.

Faster Elfeed Search Through JIT Byte-code Compilation

2016-12-11T23:16:42Z

Today I pushed an update for Elfeed that doubles the speed of the search filter in the worse case. This is the user-entered expression that dynamically narrows the entry listing to a subset that meets certain criteria: published after a particular date, with/without particular tags, and matching/non-matching zero or more regular expressions. The filter is live, applied to the database as the expression is edited, so it’s important for usability that this search completes under a threshold that the user might notice.

The typical workaround for these kinds of interfaces is to make filtering/searching asynchronous. It’s possible to do this well, but it’s usually a terrible, broken design. If the user acts upon the asynchronous results — say, by typing the query and hitting enter to choose the current or expected top result — then the final behavior is non-deterministic, a race between the user’s typing speed and the asynchronous search. Elfeed will keep its synchronous live search.

For anyone not familiar with Elfeed, here’s a filter that finds all entries from within the past year tagged “youtube” (+youtube) that mention Linux or Linus (linu[sx]), but aren’t tagged “bsd” (-bsd), limited to the most recent 15 entries (#15):

@1-year-old +youtube linu[xs] -bsd #15

The database is primarily indexed over publication date, so filters on publication dates are the most efficient filters. Entries are visited in order starting with the most recently published, and the search can bail out early once it crosses the filter threshold. Time-oriented filters have been encouraged as the solution to keep the live search feeling lively.

Filtering Overview

The first step in filtering is parsing the filter text entered by the user. This string is broken into its components using the elfeed-search-parse-filter function. Date filter components are converted into a unix epoch interval, tags are interned into symbols, regular expressions are gathered up as strings, and the entry limit is parsed into a plain integer. Absence of a filter component is indicated by nil.

(elfeed-search-parse-filter "@1-year-old +youtube linu[xs] -bsd #15")
;; => (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)

Previously, the next step was to apply the elfeed-search-filter function with this structured filter representation to the database. Except for special early-bailout situations, it works left-to-right across the filter, checking each condition against each entry. This is analogous to an interpreter, with the filter being a program.

Thinking about it that way, what if the filter was instead compiled into an Emacs byte-code function and executed directly by the Emacs virtual machine? That’s what this latest update does.

Benchmarks

With six different filter components, the actual filtering routine is a bit too complicated for an article, so I’ll set up a simpler, but roughly equivalent, scenario. With a reasonable cut-off date, the filter was already sufficiently fast, so for benchmarking I’ll focus on the worst case: no early bailout opportunities. An entry will be just a list of tags (symbols), and the filter will have to test every entry.

My real-world Elfeed database currently has 46,772 entries with 36 distinct tags. For my benchmark I’ll round this up to a nice 100,000 entries, and use 26 distinct tags (A–Z), which has the nice alphabet property and more closely reflects the number of tags I still care about.

First, here’s make-random-entry to generate a random list of 1–5 tags (i.e. an entry). The state parameter is the random state, allowing for deterministic benchmarks on a randomly-generated database.

(cl-defun make-random-entry (&key state (min 1) (max 5))
  (cl-loop repeat (+ min (cl-random (1+ (- max min)) state))
           for letter = (+ ?A (cl-random 26 state))
           collect (intern (format "%c" letter))))

The database is just a big list of entries. In Elfeed this is actually an AVL tree. Without dates, the order doesn’t matter.

(cl-defun make-random-database (&key state (count 100000))
  (cl-loop repeat count collect (make-random-entry :state state)))

Here’s my old time macro. An important change I’ve made since years ago is to call garbage-collect before starting the clock, eliminating bad samples from unlucky garbage collection events. Depending on what you want to measure, it may even be worth disabling garbage collection during the measurement by setting gc-cons-threshold to a high value.

(defmacro measure-time (&rest body)
  (declare (indent defun))
  (garbage-collect)
  (let ((start (make-symbol "start")))
    `(let ((,start (float-time)))
       ,@body
       (- (float-time) ,start))))

Finally, the benchmark harness. It uses a hard-coded seed to generate the same pseudo-random database. The test is run against the a filter function, f, 100 times in search for the same 6 tags, and the timing results are averaged.

(cl-defun benchmark (f &optional (n 100) (tags '(A B C D E F)))
  (let* ((state (copy-sequence [cl-random-state-tag -1 30 267466518]))
         (db (make-random-database :state state)))
    (cl-loop repeat n
             sum (measure-time
                   (funcall f db tags))
             into total
             finally return (/ total (float n)))))

The baseline will be memq (test for membership using identity, eq). There are two lists of tags to compare: the list that is the entry, and the list from the filter. This requires a nested loop for each entry, one explicit (cl-loop) and one implicit (memq), both with early bailout.

(defun memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq tag entry)
                    return t)))

Byte-code compiling everything and running the benchmark on my laptop I get:

(benchmark #'memq-count)
;; => 0.041 seconds

That’s actually not too bad. One of the advantages of this definition is that there are no function calls. The memq built-in function has its own opcode (62), and the rest of the definition is special forms and macros expanding to special forms (cl-loop). It’s exactly the thing I need to exploit to make filters faster.

As a sanity check, what would happen if I used member instead of memq? In theory it should be slower because it uses equal for tests instead of eq.

(defun member-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (member tag entry)
                    return t)))

It’s only slightly slower because member, like many other built-ins, also has an opcode (157). It’s just a tiny bit more overhead.

(benchmark #'member-count)
;; => 0.047 seconds

To test function call overhead while still using the built-in (e.g. written in C) memq, I’ll alias it so that the byte-code compiler is forced to emit a function call.

(defalias 'memq-alias 'memq)

(defun memq-alias-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (memq-alias tag entry)
                    return t)))

To verify that this is doing what I expect, I M-x disassemble the function and inspect the byte-code disassembly. Here’s a simple example.

(disassemble
 (byte-compile (lambda (list) (memq :foo list))))

When compiled under lexical scope (lexical-binding is true), here’s the disassembly. To understand what this means, see Emacs Byte-code Internals.

     constant  :foo
     stack-ref 1
     memq
     return

Notice the memq instruction. Try using memq-alias instead:

(disassemble
 (byte-compile (lambda (list) (memq-alias :foo list))))

Resulting in a function call:

     constant  memq-alias
     constant  :foo
     stack-ref 2
     call      2
     return

And the benchmark:

(benchmark #'memq-alias-count)
;; => 0.052 seconds

So the function call adds about 27% overhead. This means it would be a good idea to avoid calling functions in the filter if I can help it. I should rely on these special opcodes.

Suppose memq was written in Emacs Lisp rather than C. How much would that hurt performance? My version of my-memq below isn’t quite the same since it returns t rather than the sublist, but it’s good enough for this purpose. (I’m using cl-loop because writing early bailout in plain Elisp without recursion is, in my opinion, ugly.)

(defun my-memq (needle haystack)
  (cl-loop for element in haystack
           when (eq needle element)
           return t))

(defun my-memq-count (db tags)
  (cl-loop for entry in db count
           (cl-loop for tag in tags
                    when (my-memq tag entry)
                    return t)))

And the benchmark:

(benchmark #'my-memq-count)
;; => 0.137 seconds

Oof! It’s more than 3 times slower than the opcode. This means I should use built-ins as much as possible in the filter.

Dynamic vs. lexical scope

There’s one last thing to watch out for. Everything so far has been compiled with lexical scope. You should really turn this on by default for all new code that you write. It has three important advantages:

It allows the compiler to catch more mistakes.
It eliminates a class of bugs related to dynamic scope: Local variables are exposed to manipulation by callees.
Lexical scope has better performance.

Here are all the benchmarks with the default dynamic scope:

(benchmark #'memq-count)
;; => 0.065 seconds

(benchmark #'member-count)
;; => 0.070 seconds

(benchmark #'memq-alias-count)
;; => 0.074 seconds

(benchmark #'my-memq-count)
;; => 0.256 seconds

It halves the performance in this benchmark, and for no benefit. Under dynamic scope, local variables use the varref opcode — a global variable lookup — instead of the stack-ref opcode — a simple array index.

(defun norm (a b)
  (* (- a b) (- a b)))

Under dynamic scope, this compiles to:

     varref    a
     varref    b
     diff
     varref    a
     varref    b
     diff
     mult
     return

And under lexical scope (notice the variable names disappear):

     stack-ref 1
     stack-ref 1
     diff
     stack-ref 2
     stack-ref 2
     diff
     mult
     return

JIT-compiled filters

So far I’ve been moving in the wrong direction, making things slower rather than faster. How can I make it faster than the straight memq version? By compiling the filter into byte-code.

I won’t write the byte-code directly, but instead generate Elisp code and use the byte-code compiler on it. This is safer, will work correctly in future versions of Emacs, and leverages the optimizations performed by the byte-compiler. This sort of thing recently got a bad rap on Emacs Horrors, but I was happy to see that this technique is already established.

(defun jit-count (db tags)
  (let* ((memq-list (cl-loop for tag in tags
                             collect `(memq ',tag entry)))
         (function `(lambda (db)
                      (cl-loop for entry in db
                               count (or ,@memq-list))))
         (compiled (byte-compile function)))
    (funcall compiled db)))

It dynamically builds the code as an s-expression, runs that through the byte-code compiler, executes it, and throws it away. It’s “just-in-time,” though compiling to byte-code and not native code. For the benchmark tags of (A B C D E F), this builds the following:

(lambda (db)
  (cl-loop for entry in db
           count (or (memq 'A entry)
                     (memq 'B entry)
                     (memq 'C entry)
                     (memq 'D entry)
                     (memq 'E entry)
                     (memq 'F entry))))

Due to its short-circuiting behavior, or is a special form, so this function is just special forms and memq in its opcode form. It’s as fast as Elisp can get.

Having s-expressions is a real strength for lisp, since the alternative (in, say, JavaScript) would be to assemble the function by concatenating code strings. By contrast, this looks a lot like a regular lisp macro. Invoking the byte-code compiler does add some overhead compared to the interpreted filter, but it’s insignificant.

How much faster is this?

(benchmark #'jit-count)
;; => 0.017s

It’s more than twice as fast! The big gain here is through loop unrolling. The outer loop has been unrolled into the or expression. That section of byte-code looks like this:

     constant  A
     stack-ref 1
     memq
     goto-if-not-nil-else-pop 1
     constant  B
     stack-ref 1
     memq
     goto-if-not-nil-else-pop 1
    constant  C
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  D
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  E
    stack-ref 1
    memq
    goto-if-not-nil-else-pop 1
    constant  F
    stack-ref 1
    memq
1    return

In Elfeed, not only does it unroll these loops, it completely eliminates the overhead for unused filter components. Comparing to this benchmark, I’m seeing roughly matching gains in Elfeed’s worst case. In Elfeed, I also bind lexical-binding around the byte-compile call to force lexical scope, since otherwise it just uses the buffer-local value (usually nil).

Filter compilation can be toggled on and off by setting elfeed-search-compile-filter. If you’re up to date, try out live filters with it both enabled and disabled. See if you can notice the difference.

Result summary

Here are the results in a table, all run with Emacs 24.4 on x86-64.

(ms)      memq      member    memq-alias my-memq   jit
lexical   41        47        52         137       17
dynamic   65        70        74         256       21

And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where I also occasionally use Elfeed, and where I have been very interested in improving performance.

(ms)      memq      member    memq-alias my-memq   jit
lexical   170       235       242        614       79
dynamic   274       340       345        1130      92

And here’s how you can run the benchmarks for yourself, perhaps with different parameters:

jit-bench.el

The header explains how to run the benchmark in batch mode:

$ emacs -Q -batch -f batch-byte-compile jit-bench.el
$ emacs -Q -batch -l jit-bench.elc -f benchmark-batch

An Elfeed Database Analysis

2016-08-12T03:20:16Z

The end of the month marks Elfeed’s third birthday. Surprising to nobody, it’s also been three years of heavy, daily use by me. While I’ve used Elfeed concurrently on a number of different machines over this period, I’ve managed to keep an Elfeed database index with a lineage going all the way back to the initial development stages, before the announcement. It’s a large, organically-grown database that serves as a daily performance stress test. Hopefully this means I’m one of the first people to have trouble if an invisible threshold is ever exceeded.

I’m also the sort of person who gets excited when I come across an interesting dataset, and I have this gem sitting right in front of me. So a couple of days ago I pushed a new Elfeed function, elfeed-csv-export, which exports a database index into three CSV files. These are intended to serve as three tables in a SQL database, exposing the database to interesting relational queries and joins. Entry content (HTML, etc.) has always been considered volatile, so this is not exported. The export function isn’t interactive (yet?), so if you want to generate your own you’ll need to (require 'elfeed-csv) and evaluate it yourself.

All the source code for performing the analysis below on your own database can be found here:

https://github.com/skeeto/elfeed-analysis

The three exported tables are feeds, entries, and tags. Here are the corresponding columns (optional CSV header) for each:

url, title, canonical-url, author
id, feed, title, link, date
entry, feed, tag

And here’s the SQLite schema I’m using for these tables:

CREATE TABLE feeds (
    url TEXT PRIMARY KEY,
    title TEXT,
    canonical_url TEXT,
    author TEXT
);

CREATE TABLE entries (
    id TEXT NOT NULL,
    feed TEXT NOT NULL REFERENCES feeds (url),
    title TEXT,
    link TEXT NOT NULL,
    date REAL NOT NULL,
    PRIMARY KEY (id, feed)
);

CREATE TABLE tags (
    entry TEXT NOT NULL,
    feed TEXT NOT NULL,
    tag TEXT NOT NULL,
    FOREIGN KEY (entry, feed) REFERENCES entries (id, feed)
);

Web authors are notoriously awful at picking actually-unique entry IDs, even when using the smarter option, Atom. I still simply don’t trust that entry IDs are unique, so, as usual, I’ve qualified them by their source feed URL, hence the primary key on both columns in entries.

At this point I wish I had collected a lot more information. If I were to start fresh today, Elfeed’s database schema would not only fully match Atom’s schema, but also exceed it with additional logging:

When was each entry actually fetched?
How did each entry change since the last fetch?
When and for what reason did a feed fetch fail?
When did an entry stop appearing in a feed?
How long did fetching take?
How long did parsing take?
Which computer (hostname) performed the fetch?
What interesting HTTP headers were included?
Even if not kept for archival, how large was the content?

I may start tracking some of these. If I don’t, I’ll be kicking myself three years from now when I look at this again.

A look at my index

So just how big is my index? It’s 25MB uncompressed, 2.5MB compressed. I currently follow 117 feeds, but my index includes 43,821 entries from 309 feeds. These entries are marked with 53,360 tags from a set of 35 unique tags. Some of these datapoints are the result of temporarily debugging Elfeed issues and don’t represent content that I actually follow. I’m more careful these days to test in a temporary database as to avoid contamination. Some are duplicates due to feeds changing URLs over the years. Some are artifacts from old bugs. This all represents a bit of noise, but should be negligible. During my analysis I noticed some of these anomalies and took a moment to clean up obviously bogus data (weird dates, etc.), all by adjusting tags.

The first thing I wanted to know is the weekday frequency. A number of times I’ve blown entire Sundays working on Elfeed, and, as if to frustrate my testing, it’s not unusual for several hours to pass between new entries on Sundays. Is this just my perception or are Sundays really that slow?

Here’s my query. I’m using SQLite’s strftime to shift the result into my local time zone, Eastern Time. This time zone is the source, or close to the source, of a large amount of the content. This also automatically accounts for daylight savings time, which can’t be done with a simple divide and subtract.

SELECT tag,
       cast(strftime('%w', date, 'unixepoch', 'localtime') AS INT) AS day,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, day;

The most frequent tag (13,666 appearances) is “youtube”, which marks every YouTube video, and I’ll use gnuplot to visualize it. The input “file” is actually a command since gnuplot is poor at filtering data itself, especially for histograms.

plot '< grep ^youtube, weekdays.csv' using 2:3 with boxes

Wow, things do quiet down dramatically on weekends! From the glass-half-full perspective, this gives me a chance to catch up when I inevitably fall behind on these videos during the week.

The same is basically true for other types of content, including “comic” (12,465 entries) and “blog” (7,505 entries).

However, “emacs” (2,404 entries) is a different story. It doesn’t slow down on the weekend, but Emacs users sure love to talk about Emacs on Mondays. In my own index, this spike largely comes from Planet Emacsen. Initially I thought maybe this was an artifact of Planet Emacsen’s date handling — i.e. perhaps it does a big fetch on Mondays and groups up the dates — but I double checked: they pass the date directly through from the original articles.

Conclusion: Emacs users love Mondays. Or maybe they hate Mondays and talk about Emacs as an escape.

I can reuse the same query to look at different time scales. When during the day do entries appear? Adjusting the time zone here becomes a lot more important.

SELECT tag,
       cast(strftime('%H', date, 'unixepoch', 'localtime') AS INT) AS hour,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, hour;

Emacs bloggers tend to follow a nice Eastern Time sleeping schedule. (I wonder how Vim bloggers compare, since, as an Emacs user, I naturally assume Vim users’ schedules are as undisciplined as their bathing habits.) However, this also might be prolific the Irreal breaking the curve.

The YouTube channels I follow are a bit more erratic, but there’s still a big drop in the early morning and a spike in the early afternoon. It’s unclear if the timestamp published in the feed is the upload time or the publication time. This would make a difference in the result (e.g. overnight video uploads).

Do you suppose there’s a slow month?

SELECT tag,
       cast(strftime('%m', date, 'unixepoch', 'localtime') AS INT) AS day,
       count(id) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, day;

December is a big drop across all tags, probably for the holidays. Both “comic” and “blog” also have an interesting drop in August. For brevity, I’ll only show one. This might be partially due my not waiting until the end of this month for this analysis, since there are only 2.5 Augusts in my 3-year dataset.

Unfortunately the timestamp is the only direct numerical quantity in the data. So far I’ve been binning data points and counting to get a second numerical quantity. Everything else is text, so I’ll need to get more creative to find other interesting relationships.

So let’s have a look a the lengths of entry titles.

SELECT tag,
       length(title) AS length,
       count(*) AS count
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed
GROUP BY tag, length
ORDER BY length;

The shortest are the webcomics. I’ve complained about poor webcomic titles before, so this isn’t surprising. The spikes are from comics that follow a strict (uncreative) title format.

Emacs article titles follow a nice distribution. You can tell these are programmers because so many titles are exactly 32 characters long. Picking this number is such a natural instinct that we aren’t even aware of it. Or maybe all their database schemas have VARCHAR(32) title columns?

Blogs in general follow a nice distribution. The big spike is from the Dwarf Fortress development blog, which follows a strict date format.

The longest on average are YouTube videos. This is largely due to the kinds of videos I watch (“Let’s Play” videos), which tend to have long, predictable names.

And finally, here’s the most interesting-looking graph of them all.

SELECT ((date - 4*60*60) % (24*60*60)) / (60*60) AS day_time,
       length(title) AS length
FROM entries
JOIN tags ON tags.entry = entries.id AND tags.feed = entries.feed;

This is the title length versus time of day (not binned). Each point is one of the 53,360 posts.

set style fill transparent solid 0.25 noborder
set style circle radius 0.04
plot 'length-vs-daytime.csv' using 1:2 with circles

(This is a good one to follow through to the full size image.)

Again, all Eastern Time since I’m self-centered like that. Vertical lines are authors rounding their post dates to the hour. Horizontal lines are the length spikes from above, such as the line of entries at title length 10 in the evening (Dwarf Fortress blog). There’s a the mid-day cloud of entries of various title lengths, with the shortest title cloud around mid-morning. That’s probably when many of the webcomics come up.

Additional analysis could look further at textual content, beyond simply length, in some quantitative way (n-grams? soundex?). But mostly I really need to keep track of more data!

Elfeed, cURL, and You

2016-06-16T18:22:16Z

This morning I pushed out an important update to Elfeed, my web feed reader for Emacs. The update should be available in MELPA by the time you read this. Elfeed now has support for fetching feeds using a cURL through a curl inferior process. You’ll need the program in your PATH or configured through elfeed-curl-program-name.

I’ve been using it for a couple of days now, but, while I work out the remaining kinks, it’s disabled by default. So in addition to having cURL installed, you’ll need to set elfeed-use-curl to non-nil. Sometime soon it will be enabled by default whenever cURL is available. The original url-retrieve fetcher will remain in place for time time being. However, cURL may become a requirement someday.

Fetching with a curl inferior process has some huge advantages.

It’s much faster

The most obvious change is that you should experience a huge speedup on updates and better responsiveness during updates after the first cURL run. There are important two reasons:

Asynchronous DNS and TCP: Emacs 24 and earlier performs DNS queries synchronously even for asynchronous network processes. This is being fixed on some platforms (including Linux) in Emacs 25, but now we don’t have to wait.

On Windows it’s even worse: the TCP connection is also established synchronously. This is especially bad when fetching relatively small items such as feeds, because the DNS look-up and TCP handshake dominate the overall fetch time. It essentially makes the whole process synchronous.

Conditional GET: HTTP has two mechanism to avoid transmitting information that a client has previously fetched. One is the Last-Modified header delivered by the server with the content. When querying again later, the client echos the date back like a token in the If-Modified-Since header.

The second is the “entity tag,” an arbitrary server-selected token associated with each version of the content. The server delivers it along with the content in the ETag header, and the client hands it back later in the If-None-Match header, sort of like a cookie.

This is highly valuable for feeds because, unless the feed is particularly active, most of the time the feed hasn’t been updated since the last query. This avoids sending anything other hand a handful of headers each way. In Elfeed’s case, it means it doesn’t have to parse the same XML over and over again.

Both of these being outside of cURL’s scope, Elfeed has to manage conditional GET itself. I had no control over the HTTP headers until now, so I couldn’t take advantage of it. Emacs’ url-retrieve function allows for sending custom headers through dynamically binding url-request-extra-headers, but this isn’t available when calling url-queue-retrieve since the request itself is created asynchronously.

Both the ETag and Last-Modified values are stored in the database and persist across sessions. This is the reason the full speedup isn’t realized until the second fetch. The initial cURL fetch doesn’t have these values.

Fewer bugs

As mentioned previously, Emacs has a built-in URL retrieval library called url. The central function is url-retrieve which asynchronously fetches the content at an arbitrary URL (usually HTTP) and delivers the buffer and status to a callback when it’s ready. There’s also a queue front-end for it, url-queue-retrieve which limits the number of parallel connections. Elfeed hands this function a pile of feed URLs all at once and it fetches them N at a time.

Unfortunately both these functions are incredibly buggy. It’s been a thorn in my side for years.

Here’s what the interface looks like for both:

(url-retrieve URL CALLBACK &optional CBARGS SILENT INHIBIT-COOKIES)

It takes a URL and a callback. Seeing this, the sane, unsurprising expectation is the callback will be invoked exactly once for time url-retrieve was called. In any case where the request fails, it should report it through the callback. This is not the case. The callback may be invoked any number of times, including zero.

In this example, suppose you have a webserver that will return an HTTP 404 for a requested URL. Below, I fire off 10 asynchronous requests in a row.

(defvar results ())
(dotimes (i 10)
  (url-retrieve "http://127.0.0.1:8080/404"
                (lambda (status) (push (cons i status) results))))

What would you guess is the length of results? It’s initially 0 before any requests complete and over time (a very short time) I would expect this to top out at 10. On Emacs 24, here’s the real answer:

(length results)
;; => 46

The same error is reported multiple times to the callback. At least the pattern is obvious.

(cl-count 0 results :key #'car)
;; => 9
(cl-count 1 results :key #'car)
;; => 8
(cl-count 2 results :key #'car)
;; => 7

(cl-count 9 results :key #'car)
;; => 1

Here’s another one, this time to the non-existent foo.example. The DNS query should never resolve.

(setf results ())
(dotimes (i 10)
  (url-retrieve "http://foo.example/"
                (lambda (status) (push (cons i status) results))))

What’s the length of results? This time it’s zero. Remember how DNS is synchronous? Because of this, DNS failures are reported synchronously as a signaled error. This gets a lot worse with url-queue-retrieve. Since the request is put off until later, DNS doesn’t fail until later, and you get neither a callback nor an error signal. This also puts the queue in a bad state and necessitated elfeed-unjam for manually clear it. This one should get fixed in Emacs 25 when DNS is asynchronous.

This last one assumes you don’t have anything listening on port 57432 (pulled out of nowhere) so that the connection fails.

(setf results ())
(dotimes (i 10)
  (url-retrieve "http://127.0.0.1:57432/"
                (lambda (status) (push (cons i status) results))))

On Linux, we finally get the sane result of 10. However, on Windows, it’s zero. The synchronous TCP connection will fail, signaling an error just like DNS failures. Not only is it broken, it’s broken in different ways on different platforms.

There are many more cases of callback weirdness which depend on the connection and HTTP session being in various states when thing go awry. These were just the easiest to demonstrate. By using cURL, I get to bypass this mess.

No more GnuTLS issues

At compile time, Emacs can optionally be linked against GnuTLS, giving it robust TLS support so long as the shared library is available. url-retrieve uses this for fetching HTTPS content. Unfortunately, this library is noisy and will occasionally echo non-informational messages in the minibuffer and in *Messages* that cannot be suppressed.

When not linked against GnuTLS, Emacs will instead run the GnuTLS command line program as an inferior process, just like Elfeed now does with cURL. Unfortunately this interface is very slow and frequently fails, basically preventing Elfeed from fetching HTTPS feeds. I suspect it’s in part due to an improper coding-system-for-read.

cURL handles all the TLS negotation itself, so both these problems disappear. The compile-time configuration doesn’t matter.

Windows is now supported

Emacs’ Windows networking code is so unstable, even in Emacs 25, that I couldn’t make any practical use of Elfeed on that platform. Even the Cygwin emacs-w32 version couldn’t cut it. It hard crashes Emacs every time I’ve tried to fetch feeds. Fortunately the inferior process code is a whole lot more stable, meaning fetching with cURL works great. As of today, you can now use Elfeed on Windows. The biggest obstable is getting cURL installed and configured.

Interface changes

With cURL, obviously the values of url-queue-timeout and url-queue-parallel-processes no longer have any meaning to Elfeed. If you set these for yourself, you should instead call the functions elfeed-set-timeout and elfeed-set-max-connections, which will do the appropriate thing depending on the value of elfeed-use-curl. Each also comes with a getter so you can query the current value.

The deprecated elfeed-max-connections has been removed.

Feed objects now have meta tags :etag, :last-modified, and :canonical-url. The latter can identify feeds that have been moved, though it needs a real UI.

See any bugs?

If you use Elfeed, grab the current update and give the cURL fetcher a shot. Please open a ticket if you find problems. Be sure to report your Emacs version, operating system, and cURL version.

As of this writing there’s just one thing missing compared to url-queue: connection reuse. cURL supports it, so I just need to code it up.

9 Elfeed Features You Might Not Know

2015-12-03T22:33:17Z

It’s been two years since I last wrote about Elfeed, my Atom/RSS feed reader for Emacs. I’ve used it every single day since, and I continue to maintain it with help from the community. So far 18 people besides me have contributed commits. Over the last couple of years it’s accumulated some new features, some more obvious than others.

Every time I mark a new release, I update the ChangeLog at the top of elfeed.el which lists what’s new. Since it’s easy to overlook many of the newer useful features, I thought I’d list the more important ones here.

Custom Entry Colors

You can now customize entry faces through elfeed-search-face-alist. This variable maps tags to faces. An entry inherits the face of any tag it carries. Previously “unread” was a special tag that got a bold face, but this is now implemented as nothing more than an initial entry in the alist.

I’ve been using it to mark different kinds of content (videos, podcasts, comics) with different colors.

Autotagging

You can specify the starting tags for entries from particular feeds directly in the feed listing. This has been a feature for awhile now, but it’s not something you’d want to miss. It started out as a feature in my personal configuration that eventually migrated into Elfeed proper.

For example, your elfeed-feeds may initially look like this, especially if you imported from OPML.

("https://nullprogram.com/feed/"
 "http://nedroid.com/feed/"
 "https://www.youtube.com/feeds/videos.xml?user=quill18")

If you wanted certain tags applied to entries from each, you would need to putz around with elfeed-make-tagger. For the most common case — apply certain tags to all entries from a URL — it’s much simpler to specify the information as part of the listing itself,

(("https://nullprogram.com/feed/" blog emacs)
 ("http://nedroid.com/feed/" webcomic)
 ("https://www.youtube.com/feeds/videos.xml?user=quill18" youtube))

Today I only use custom tagger functions in my own configuration to filter within a couple of particularly noisy feeds.

Arbitrary Metadata

Metadata is more for Elfeed extensions (i.e. elfeed-org) than regular users. You can attach arbitrary, readable metadata to any Elfeed object (entry, feed). This metadata is automatically stored in the database. It’s a plist.

Metadata is accessed entirely through one setf-able function: elfeed-meta. For example, you might want to track when you’ve read something, not just that you’ve read it. You could use this to selectively update certain feeds or just to evaluate your own habits.

(defun my-elfeed-mark-read (entry)
  (elfeed-untag entry 'unread)
  (let ((date (format-time-string "%FT%T%z")))
    (setf (elfeed-meta entry :read-date) date)))

Two things motivated this feature. First, without a plist, if I added more properties in the future, I would need to change the database format to support them. I modified the database format to add metadata, requiring an upgrade function to quietly upgrade older databases as they were loaded. I’d really like to avoid this in the future.

Second, I wanted to make it easy for extension authors to store their own data. I still imagine an extension someday to update feeds intelligently based on their history. For example, the database doesn’t track when the feed was last fetched, just the date of the most recent entry (if any). A smart-update extension could use metadata to tag feeds with this information.

Elfeed itself already uses two metadata keys: :failures on feeds and :title on both. :failures counts the total number of times fetching that feed resulted in an error. You could use this get a listing of troublesome feeds like so,

(cl-loop for url in (elfeed-feed-list)
         for feed = (elfeed-db-get-feed url)
         for failures = (elfeed-meta feed :failures)
         when failures
         collect (cons url failures))

The :title property allows for a custom title for both feeds and entries in the search buffer listing, assuming you’re using the default function (see below). It overrides the title provided by the feed itself. This is different than elfeed-entry-title and elfeed-feed-title, which is kept in sync with feed content. Metadata is not kept in sync with the feed itself.

Filter Inversion

You can invert filter components by prefixing them with !. For example, say you’re looking at all my posts from the past 6 months:

@6-months nullprogram.com

But say you’re tired of me and decide you want to see every entry from the past 6 months excluding my posts.

@6-months !nullprogram.com

Filter Limiter

Normally you limit the number of results by date, but you can now limit the result by count using #n. For example, to see my most recent 12 posts regardless of date,

nullprogram.com #12

This is used internally in the live filter to limit the number of results to the height of the screen. If you noticed that live filtering has been much more responsive in the last few months, this is probably why.

Bookmark Support

Elfeed properly integrates with Emacs’ bookmarks (thanks to groks). You can bookmark the current filter with M-x bookmark-set (C-x r m). By default, Emacs will persist bookmarks between sessions. To revisit a filter in the future, M-x bookmark-jump (C-x r b).

Since this requires no configuration, this may serve as an easy replacement for manually building “view” toggles — filters bound to certain keys — which I know many users have done, including me.

New Header

If you’ve updated very recently, you probably noticed Elfeed got a brand new header. Previously it faked a header by writing to the first line of the buffer. This is because somehow I had no idea Emacs had official support for buffer headers (despite notmuch using them all this time).

The new header includes additional information, such as the current filter, the number of unread entries, the total number of entries, and the number of unique feeds currently in view. You’ll see this as /: in the middle of the header.

As of this writing, the new header has not been made part of a formal release. So if you’re only tracking stable releases, you won’t see this for awhile longer.

You can supply your own header via elfeed-search-header-function (thanks to Gergely Nagy).

Scoped Updates

As you already know, in the search buffer listing you can press G to update your feeds. But did you know you it takes a prefix argument? Run as C-u G, it only updates feeds with entries currently listed in the buffer.

As of this writing, this is another feature not yet in a formal release. I’d been wanting something like this for awhile but couldn’t think of a reasonable interface. Directly prompting the user for feeds is neither elegant nor composable. However, groks suggested the prefix argument, which composes perfectly with Elfeed’s existing idioms.

Listing Customizations

In addition to custom faces, there are a number of ways to customize the listing.

Choose the sort order with elfeed-sort-order.
Set a custom date format with elfeed-search-date-format.
Adjust field widths with elfeed-search-*-width.
Or override everything with elfeed-search-print-entry-function.

Gergely Nagy has been throwing lots of commits at me over the last couple of weeks to open up lots of Elfeed’s behavior to customization, so there are more to come.

Thank You, Emacs Community

Apologies about any features I missed or anyone I forgot to mention who’s made contributions. The above comes from my ChangeLogs, the commit log, the GitHub issue listing, and my own memory, so I’m likely to have forgotten some things. A couple of these features I had forgotten about myself!

Elfeed Tips and Tricks

2013-11-26T00:38:20Z

This past weekend I had some questions from next-user-here (NUH) on my original Elfeed post about changing some of Elfeed’s behavior. NUH is an Elisp novice so accomplishing some of the requested modifications wasn’t obvious. A novice is mostly limited to setting variables, not defining advice or using hooks. I’ve also been using Elfeed daily for about three months now as my sole web feed reader and along the way I’ve developed some best practices. In addition to responding to some of NIH’s questions here, I’d like to share some tips and tricks.

Custom Entry Launchers

Currently you can press “b” to launch one or more entries in your browser. You can use “y” to copy an single entry to the clipboard. What if you want to make another action.

In my configuration I have a fancy binding that sends the entry URLs in the selected region to youtube-dl for downloading the videos. It’s too large to share as a snippet so here’s a small example of something similar using a program called xcowsay.

(defun xcowsay (message)
  (call-process "xcowsay" nil nil nil message))

(defun elfeed-xcowsay ()
  (interactive)
  (let ((entry (elfeed-search-selected :single)))
    (xcowsay (elfeed-entry-title entry))))

(define-key elfeed-search-mode-map "x" #'elfeed-xcowsay)

Now when I hit “x” over an entry in Elfeed I’m greeted by a cow announcing the title.

Entry Listing Customization

The search buffer you see when starting Elfeed, where entries are listed, can be customized a few different ways. First, this buffer does grow dynamically. After re-sizing the window/frame horizontally you just have to refresh the view by pressing g (an Emacs convention). How it fills out depends on the settings of these variables,

elfeed-search-title-max-width
elfeed-search-title-min-width
elfeed-search-trailing-width

They control how wide the different columns should be as the window size changes. An important caveat to this is that the cache stored in elfeed-search-cache must be cleared before the changes will be reflected in the display. This cache exists because building the display, assembling all the special faces, is actually quite CPU-intensive. It was an optimization I established early on.

(clrhash elfeed-search-cache)

If you set these variables in your start-up configuration you don’t need to worry about clearing the cache because it will already be empty. It’s only a concern when playing with the settings.

Date Display

Another question was about adding time to the entry listing. Elfeed only displays the entry’s date. Dates are formatted by the function elfeed-search-format-date. This can be redefined to display dates differently.

(defun elfeed-search-format-date (date)
  (format-time-string "%Y-%m-%d %H:%M" (seconds-to-time date)))

It’s given epoch seconds as a float and it returns a string to display as a date.

Faces and Colors

All of the faces used in the display are declared for customization, so these can be changed to whatever you like.

elfeed-search-date-face
elfeed-search-title-face
elfeed-search-feed-face
elfeed-search-tag-face

Say you suffered a head injury and decided you want your Elfeed dates to be bold, purple, and underlined,

(custom-set-faces
 '(elfeed-search-date-face
   ((t :foreground "#f0f"
       :weight extra-bold
       :underline t))))

Database Manipulation

Feeds and entries in the database can be manipulated to become whatever you want them to be. Because Elfeed is regularly modifying the database, the trick is to perform the manipulation at just the right time.

Feed Title Changes

Say you want to change a feed title because you don’t like the title supplied by the feed. For example, the title to my blog’s feed is “null program” but instead you think it should be “Seriously Handsome Programmer” (head injury, remember?). The function elfeed-db-get-feed can be used to fetch a feed’s data structure from the database, given it’s exact URL as listed in your elfeed-feeds.

(let ((feed (elfeed-db-get-feed "https://nullprogram.com/feed/")))
  (setf (elfeed-feed-title feed) "Seriously Handsome Programmer"))

Hold it, that didn’t work. First, that display cache is getting in the way again. Feed titles change very infrequently so they’re cached aggressively. More importantly, next time you update your feeds Elfeed will re-synchronize the feed title with the official title. It’s going to fight against your intervention.

The solution is to do it with a little bit of advice just before the title is displayed. Advise the function elfeed-search-update with some “before” advice.

(defadvice elfeed-search-update (before nullprogram activate)
  (let ((feed (elfeed-db-get-feed "https://nullprogram.com/feed/")))
    (setf (elfeed-feed-title feed) "Seriously Handsome Programmer")))

Entry Tweaking

Automatic entry modification should happen immediately upon discovery so that it looks like the entry arrived that way. This is done through the elfeed-new-entry-hook. Generally this would be used for applying custom tags. These examples are from the documentation:

;; Mark all YouTube entries
(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :feed-url "youtube\\.com"
                              :add '(video youtube)))

;; Entries older than 2 weeks are marked as read
(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :before "2 weeks ago"
                              :remove 'unread))

;; Building subset feeds
(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :feed-url "example\\.com"
                              :entry-title '(not "something interesting")
                              :add 'junk
                              :remove 'unread))

Due to a feature I recently ported from my personal configuration, this tagger helper function is less necessary. You can put lists in your elfeed-feeds list to supply automatic tags.

(setq elfeed-feeds
      '(("https://nullprogram.com/feed/" blog emacs)
        "http://www.50ply.com/atom.xml"  ; no autotagging
        ("http://nedroid.com/feed/" webcomic)))

Content Tweaking

Going beyond tagging you could change the content of the feed. Say you want to make feeds 100 times better.

(defun hundred-times-better (entry)
  (let* ((original (elfeed-deref (elfeed-entry-content entry)))
         (replace (replace-regexp-in-string "keyboard" "leopard" original)))
    (setf (elfeed-entry-content entry) (elfeed-ref replace))))

(add-hook 'elfeed-new-entry-hook #'hundred-times-better)

The same trick could be used to remove advertising, change the date, change the title, etc. The elfeed-deref and elfeed-ref parts are needed to fetch and store content in the content database. Only a reference is stored on the structure. You can actually use these functions at any time outside of Elfeed, but they’ll eventually get garbage collected if Elfeed doesn’t know about them.

(setf ref (elfeed-ref "Hello, World"))
;; => [cl-struct-elfeed-ref "907d14fb3af2b0d4f18c2d46abe8aedce17367bd"]

(elfeed-deref ref)
;; => "Hello, World"

Deletion

A question that’s been asked few times is if entries can be deleted. To start off, the answer to that question is “no.” There is no function provided to remove entries from the database. If you want to remove entries you’re probably taking the wrong approach.

The main problem with removal is that Elfeed needs to keep track of what it’s seen before. If an entry is removed and then rediscovered, it will reappear as unread. There are better ways to “remove” entries, such as tagging them specially.

On a moderately-powerful computer Elfeed can easily handle at least several tens of thousands of database entries. If “too many entries” ever becomes a performance problem I’d rather solve it by making the database faster than by removing information from the database. It’s already very date-oriented so that older entries are infrequently touched.

If storage is a concern, you shouldn’t get too worked up about that. As of this post I have about 6,000 entries in my database and the index file is only 3.5 MB. The content database after garbage collection, which is the data/ directory under ~/.elfeed/, with these 6k entries is 17MB. When I run M-x elfeed-db-compact, currently an experimental feature, it drops down to 1.8MB. That’s less than 1 kB per entry. It’s also less than my personal Liferea database of roughly the same amount of content (~15MB) before I wrote Elfeed.

If even this storage is still too much you can always blow away your data/ content database directory. This is safe to do even while Emacs is running. You’ll still see all of the entries listed in the search buffer but won’t be able to read them within Emacs until after the next database update (when it re-fetches the most recent entry content).

You can also clear out the content database from within Elisp by visiting every entry and clearing its content field.

(with-elfeed-db-visit (entry _)
  (setf (elfeed-entry-content entry) nil))

(elfeed-db-gc)  ;; garbage collect everything

The same sort of expression can be used to run over all known entries to perform other changes. If there was a delete function you might use it here to remove entries older than a certain date, then hope they’re not rediscovered.

If you never want to store entry content (you never read entries within Emacs), you can use a hook to always drop it on the floor as it arrives,

(add-hook 'elfeed-new-entry-hook
          (lambda (entry) (setf (elfeed-entry-content entry) nil)))

Questions?

If you have any questions or suggestions about how to make Elfeed do what you want it to do, feel free to ask. Some things may actually require that I make changes to Elfeed to support it, though I hope I’ve anticipated your particular need well enough to avoid that.

Atom vs. RSS

2013-09-23T06:23:51Z

From working on Elfeed, I’ve recently become fairly intimate with the Atom and RSS specifications. I needed to write a parser for each that would properly handle valid feeds but would also reasonably handle all sorts of broken feeds that it would come across. At this point I’m quite confident in saying that Atom is by far the better specification and I really wish RSS didn’t exist. This isn’t surprising: Atom was created specifically in response to RSS’s flawed and ambiguous specification.

One consequence of this realization is that I’ve added an Atom feed to this blog and made it the the primary feed. Because so many people are still using the RSS feed, it will continue to be supported even though there are no longer links to it (Ha, try to find it now!). You may have noticed that I also started including the full post body in my feed entries. Now that my feed usage habits have changed, I felt that truncating content was actually rather rude. There’s still the issue that it contains relative URLs, but I’m not aware of any way to fix this with Jekyll. I also got a lot more precise with dates. Until recently, all posts occurred at midnight PST on the post date.

For reference, here are the specifications. Just these two documents cover about 99% of the web feeds out there.

Atom
RSS 2.0

Not that it matters too much, but it’s unfortunate that RSS has sort of “won” this format war. Of the feeds that I follow, about 75% are RSS and 25% are Atom. That’s still a significant number of web feeds and Atom is well-supported by all the clients that I’m aware of, so it’s in no danger of falling out of use. The broken (but still valid) RSS feeds I’m come across probably wouldn’t be broken if they were originally created as Atom feeds. Atom is a stricter standard and, therefore, would have guided these authors to create their feeds correctly from the start. RSS encourages authors to do the wrong thing.

The Flaws of RSS

For reference, here’s a typical, friendly RSS 2.0 feed.

 version="2.0">
  
    </span>Example RSS Feed<span class="nt">
    
      </span>Example Item<span class="nt">
      A summary.
      http://www.example.com/foo
      http://www.example.com/foo
      Mon, 23 Sep 2013 03:00:05 GMT
    
  

guid, the misnomer

Two of the biggest RSS flaws — flaws that forced me to make a major design compromise when writing Elfeed — have to do with the guid tag. That’s GUID, as in Global Unique Identifier. Not only did it not appear until RSS 2.0, but the guid tag is not required. In practice an RSS client will be rereading the same feed items over and over, so it’s critical that it’s able to identify what items it’s seen before.

Without a guid tag it’s up to the client to guess what items have been seen already, and there’s no guidance in the specification for doing so. Without a guid tag, some clients use contents of the link tag as an identifier (Elfeed, The Old Reader). In practice it’s very unlikely for two unique items to have the same link. Other clients track the entire contents of the item, so when any part changes, such as the description, it’s treated as a brand new item (Liferea). Some guid-less feeds regularly change their description (advertising, etc.), so they’re not handled well by the latter clients. It’s a mess.

In contrast, Atom’s id element is required. If someone doesn’t have one you can send them angry e-mails for having an invalid feed.

The bigger flaw of the guid tag is that, by default, guid tag content is not actually a GUID! This was a huge oversight by the specification’s authors. By default, the content of the guid tag must be a permanent URL. Only if the isPermalink attribute is set to false can it actually be a GUID (but even that’s unlikely). If two different feeds contain items that link to content with the same permalink then that “GUID” is obviously no longer unique. Two unique items have the same “unique” ID. Doh! Even if the guid tag was required, I still couldn’t rely on it in Elfeed.

In contrast, Atom’s id element must contain an Internationalized Resource Identifier (IRI). This is guaranteed to be unique.

Unlike Atom, RSS feeds themselves also don’t have identifiers. Due to RSS guids never actually being GUIDs, in order to uniquely identify feed entries in Elfeed I have to use a tuple of the feed URL and whatever identifier I can gather from the entry itself. It’s a lot messier than it should be.

In a purely Atom world, the GUID alone would be enough to identify an entry and the feed URL wouldn’t matter for identification: I wouldn’t care where the feed came from, just what it’s called. If the same feed was hosted at two different URLs, a user could list both, the second appearance acting as a backup mirror, and Elfeed would merge them effortlessly.

pubDate, the incorrectly specified

RSS didn’t have any sort of date tag until version 2.0! A standard specifically oriented around syndication sure took a long time to have date information. Before 2.0 the workaround was to pull in a date tag from another XML namespace, such as Dublin Core.

In contrast, Atom has always had published and updated tags for communicating date information.

Finally, in RSS 2.0, dates arrived in the form of the pubDate tag. For some reason the name “date” wasn’t good enough so they went with this ugly camel-case name. Despite all the extra time, they still screwed this part up. The specification says that dates must conform to the outdated RFC 822, then provides examples that aren’t RFC 822 dates! Doh! This is because RFC 822 only allows for 2-digit years, so no one should be using it anymore. The RSS authors unwittingly created yet another date specification — a mash-up between these two RFCs. In practice everyone just pretends RSS uses RFC 2822, which superseded RFC 822.

In contrast, Atom consistently uses RFC 3339 dates, along with a couple of additional restrictions. These dates are much simpler to parse than RFC 2822, which is complex because it attempts to be backwards compatible with RFC 822.

RSS 1.0, the problem child

RSS changed a lot between versions. There was the 0.9x series, several of which were withdrawn. Later on there was version 1.0 (2000) and 2.0 (2002). The big problem here is that RSS 1.0 has very little in common with 0.9x and 2.0. It’s practically a whole different format. In order to officially support RSS, a client has to be able to parse all of these different formats. In fact, in Elfeed I have an entirely separate parser for RSS 1.0.

What’s so weird about RSS 1.0? If you thought the name “pubDate” was ugly you might want to skip this part. In practice it’s namespace hell. For example, look at this Gmane RSS 1.0 feed. Unlike the other RSS versions, the top level element is rdf:RDF. That’s not a typo.

 xmlns="http://purl.org/rss/1.0/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns">
  
    </span>RSS 1.0 Example<span class="nt">
    
         rdf:resource="http://example.com/foo"/>
      
    </span>Example Item<span class="nt">
    A summary.
    http://www.example.com/foo

Remember, if you want dates you’ll need to import another namespace.

Notice the completely redundant items tag. It’s not like you’re going to download a partial feed and use the items tag to avoid grabbing full content. It’s just noise.

Even more important: notice that the items are outside the channel tag! Why would they completely restructure everything in 1.0? It’s madness. Fortunately everything here was dumped in RSS 2.0 and, except for a very small number of feeds, it’s almost just a bad memory.

channel, the vestigial tag

Notice in the example RSS feed it goes rss -> channel -> item*. Having a channel tag suggests a single feed can have a number of different channels. Nope! Only one channel is allowed, meaning the channel tag serves absolutely no purpose. It’s just more noise. Why was this ever added?

The good news is that RSS has a category tag which serves this purpose much better anyway. Tagging is preferable to hierarchies — e.g. an item could only belong to one channel but it could belong to multiple categories.

Atom

Atom is a much cleaner specification, with much clearer intent, and without all the mistakes and ambiguities. It’s also more general, designed for the syndication of many types and shapes of content. This is what made it popular for use with podcasts. Everything I listed above I discovered myself while writing Elfeed. There are surely many other problems with RSS I haven’t noticed yet.

If I only had to support Atom, things would have been significantly simpler. At the moment I have no complaints about Atom. It’s given me no trouble.

Someday if you’re going to create a new feed for some content, please do the web a favor and choose Atom! You’re much more likely to get things right the first time and you’ll make someone else’s job a lot easier. As the author of a web feed client you can take my word for it.

The Elfeed Database

2013-09-09T05:53:41Z

The design of Elfeed’s database took some experimentation before any part of it was settled. A major design constraint was Emacs’ very limited file input/output. There’s no random access and, without the aid of an external program, files must always be read and written wholesale. That’s not database-friendly at all! In the end I settled on a design that minimized the size of the frequently rewritten parts, an index with two different data models, by storing immutable data in a loose-file, content-addressable database.

At the moment there really aren’t any pure-Elisp database solutions for Emacs. This is almost certainly due to the aforementioned I/O limitations. I ran into this same problem last year when I created an Emacs pastebin server. I attempted, and failed, to interface with a SQLite database through it’s command line program. Nic Ferrier has published a generic database interface, but it lacks concrete implementations.

As a bit of good news, as far as I know Emacs does properly handle atomic file updates across all platforms, so a pure-Elisp database developer would never have to worry about only writing half the database. It’s always a safe operation. Worst case scenario you’re left with an old version of data rather than no data at all.

A real possibility for a database would be connecting to an established database server via TCP with an Emacs network process. If the server has a specified wire protocol Elisp could talk to it efficiently. In fact, there’s exists pg.el that does exactly this for PostgreSQL. Unfortunately I was not able to get this working with my pastebin, nor is this solution appropriate for Elfeed. It would be unreasonable to require users to first set up a PostgreSQL server just to read web feeds!

Ultimately it would seem that any efficient Emacs database requires the help of an external program. The notmuch mail client, which inspired Elfeed, does this. To access the notmuch database a command line program is run once for each request. A query is passed as a program argument and the output of the program is parsed into the result.

The Early Database

For the first few days of its existence Elfeed only had an in-memory database. Closing Emacs would lose everything. For my personal usage patterns, where I read, or at least address, all entries that arrive — and especially because I use Elfeed on a couple of different computers — I don’t really need to track things long term. I could easily mark everything after a certain date as read and forget about them. However, it would be nice to have and, more importantly, many people wouldn’t use Elfeed without persistence between Emacs sessions.

So, for the first database I did what I always do: dumped the data structure to a file using the printer and parsed it back in later using the reader. This is dead simple in Lisp, it’s very fast, and it even works for circular data structures. It’s something I missed so much with the much-less-capable JSON format earlier this year that I wrote a JavaScript library to do it.

(defun save-data (file data)
  (with-temp-file file
    (let ((standard-output (current-buffer))
          (print-circle t))  ; Allow circular data
      (prin1 data))))

(defun load-data (file)
  (with-temp-buffer
    (insert-file-contents file)
    (read (current-buffer))))

(save-data "demo.dat" '(a b c ["1" 2 3]))
(load-data "demo.dat")
;; => (a b c ["1" 2 3])

Anything with a printed representation can be serialized and stored this way, including symbols, string, numbers, lists, vectors (structs, objects), hash tables, and even compiled functions (.elc files). Basically every Emacs library that stores data on disk uses this technique.

Unfortunately, this is where I hit another serious database constraint: print-circle is broken in Emacs 24.3, the current stable release. This means Elfeed cannot take advantage of this useful feature, at least not for a long time, as I had been counting on. The final database is slightly slower and larger than strictly required as a result.

The Content Database

After breaking the circular references of the in-memory database I finally had persistence for the first time. With the naive printer/reader approach it was slow, almost 1 second to write just a few thousand entries on my 6-year-old laptop (my minimum requirements target machine). I wanted Elfeed to support hundreds of thousands of entries, if not millions, so this was much too slow.

The big slowdown was writing out all the entry content each time the database is saved. These large strings containing HTML that rarely change. There’s no reason to write these out every time, nor is there a reason to even keep them in memory all the time, as it’s rarely accessed. The solution is a loose-file, content-addressable database, very similar to an unpacked Git object database.

The content database stores immutable sequences of characters — not just raw bytes, but rather multibyte strings — using an unspecified coding system (right now it’s UTF-8 for all platforms). The filename for the content is the content hashed with SHA-1 (“content-addressable”). To limit the number of files per directory, these files are stored in subdirectories named by the first hex-encoded byte of the hash (just like Git). A database of 4 items might look like this:

data/
   18/
      18ff6f11945b1e9f3e3c4cae8b5275d36b9944e1
      184c06a83f0bc73a8345c6d886f9043bcae095f8
   6b/
      6b59ae257f2bea24703d8adf5747049c138dfc82
   cc/
      cc47d53872ae2a9186151ef1a68392a94e1f091f

Something really neat about the content database is that it’s completely agnostic about Elfeed. If it weren’t for Elfeed’s garbage collector, anyone could use it to store arbitrary content. The function elfeed-ref accepts a string and returns a reference into the database. Because of the hash, providing the same string in the future will return the same reference without actually performing a write. References are dereferenced with elfeed-deref.

(setf ref (elfeed-ref "Hello, world!"))
;; => [cl-struct-elfeed-ref "943a702d06f34599aee1f8da8ef9f7296031d699"]

(elfeed-deref ref)
;; => "Hello, world"

With content stored elsewhere, entries are a struct containing only some small metadata: title, link, date, and a content database reference. Writing out many of them at once is much, much faster.

I don’t expect it happens often, but this also means content is de-duplicated. If two entries happen to have the same content they’ll share content database storage. A small savings.

At this point it’s really tempting to get fancier and really put this content database to use. The core index itself could be stored as raw content, and the root to accessing the database would be a single SHA-1 hash referencing it — again, very similar to Git. If an index stores a reference to the previously written index, then the the Elfeed database would be an immutable structure tracking its entire history. Such a change would cost virtually nothing in performance, just disk space.

Multiple Representations

With all the content out of the way, the database is now just a lean index. At this point it’s a hash table mapping feed IDs to feeds. Feeds contain a list of its entries. To build the entry listing for the elfeed-search buffer, Elfeed needs to visit each feed in the hash table, gather its entries into one giant list, then finally sort that list by date. At around O(n log n), that sort operation is a real performance killer. Completely unacceptable. To fix this we need to think about how the data is updated and used.

First, entries are always viewed in date order, no exceptions. From my experience of using web feeds for the last six years I never had a reason to list feed entries by any other order. The vast majority of the time, newer entries are most relevant, and if I need to look for something specific I can search for it.

We definitely want to store entries in date-order so we can create entry listings without performing a sort: something around O(n) or so. Inserting new entries into this structure should also be efficient.

Second, entries are never removed from the database. This isn’t e-mail. Even if a user doesn’t want to see an entry again, we have to keep track of it. Otherwise it will show up as new if it’s discovered in a feed again, which is likely. Things are added to the database and never removed. In Elfeed, I use a junk tag to completely hide entries I don’t want to see, and I always have a -junk element in my filter.

There’s an important caveat to this one that I had missed until after the public release: entry dates can change! When a previously discovered entry is read from a feed, Elfeed updates (read: mutates) the entry struct to reflect the new state. This includes the date. It’s very likely that a date-sorted representation won’t tolerate date changes underneath it since it’s keying off of them. Either we refuse to update the entry date, or we remove the entry, update the date, and then re-insert it (how it currently works).

Third, entries are generally added with a recent date. After the database is initially populated, it’s only picking up new items. We should prefer adding recently-dated entries be faster than adding older entries. I didn’t get a chance to take advantage of this, but it’s something to keep in mind.

Fourth, entries need to be keyed by an ID string. Each entry has a unique, unchanging identifier string, either provided by the feed itself (RSS’s guid or Atom’s id) or generated intelligently by Elfeed. Especially because of the print-circle bug, we need to be able to talk about feeds in terms of their ID — an indirect pointer.

(Actually, even when RSS guid tags are present, they’re permalinks by default. So, unfortunately, RSS IDs are not at all resistant to collisions across feeds. To work around this, entry identifiers are a pair of strings: feed ID and entry ID. Atom doesn’t have this problem, but we’re stuck with the lowest common denominator.)

A date-oriented representation would be unable to efficiently look up an entry by its ID, so it needs to be supplemented by an ID-oriented representation. This means we need two representations in our database: date-oriented and ID-oriented.

So what do we use? Well, for keeping entries sorted by date we want some sort of balanced tree. A B-tree is probably a good choice. Rather than write one I went with an AVL tree since Emacs comes with a library for it (avl-tree). It’s already debugged and optimized! The bad news is that the internal structure is unspecified, so there are no guarantees that it can be serialized. A future update to the library may break the Elfeed database. I also had to hack into it to work around a security issue. The comparison function is embedded in the tree. After deserializing the database, Elfeed needs to ensure that no one stuck a malicious function in there.

The choice for an ID database was super-easy: a hash table. Due to the print-circle bug, this is actually the main representation. The AVL tree only stores IDs and it has to reach into the hash table to do any date comparisons. If print-circle was working I could store the same exact entry objects in the AVL tree as the hash table, so mutating them would update them in all representations. However, with print-circle off, on deserialization these would become unique objects and updates would break.

The Future

That’s where the database is today. I put in a few extra fields that aren’t actually used yet, so that there’s room to make a few changes without breaking the database. Perhaps someday I’ll work out a whole new database structure, or maybe a proper database library will come into existence, and this post will simply document the old database.

Introducing Elfeed, an Emacs Web Feed Reader

2013-09-04T05:33:10Z

Unsatisfied with my the results of recent search for a new web feed reader, I created my own from scratch, called Elfeed. It’s built on top of Emacs and is available for download through MELPA. I intend it to be highly extensible, a power user’s web feed reader. It supports both Atom and RSS.

https://github.com/skeeto/elfeed

The design of Elfeed was inspired by notmuch, which is my e-mail client of choice. I’ve enjoyed the notmuch search interface and the extensibility of the whole system — a side-effect of being written in Emacs Lisp — so much that I wanted a similar interface for my web feed reader.

The search buffer

Unlike many other feed readers, Elfeed is oriented around entries — the Atom term for articles — rather than feeds. It cares less about where entries came from and more about listing relevant entries for reading. This listing is the *elfeed-search* buffer. It looks like this,

This buffer is not necessarily about listing unread or recent entries, it’s a filtered view of all entries in the local Elfeed database. Hence the “search” buffer. Entries are marked with various tags, which play a role in view filtering — the notmuch model. By default, all new entries are tagged unread (customize with elfeed-initial-tags). I’ll cover the filtering syntax shortly.

From the search buffer there are a number of ways to interact with entries. You can select an single entry with the point, or multiple entries at once with a region, and interact with them.

b: visit the selected entries in a browser
y: copy the selected entry URL to the clipboard
r: mark selected entries as read
u: mark selected entries as unread
+: add a specific tag to selected entries
-: remove a specific tag from selected entries
RET: view selected entry in a buffer

(This list can be viewed within Emacs with the standard C-h m.)

The last action uses the Simple HTTP Renderer (shr), now part of Emacs, to render entry content into a buffer for viewing. It will even fetch and display images in the buffer, assuming your Emacs has been built for it. (Note: the GNU-provided Windows build of Emacs doesn’t ship with the necessary libraries.) It looks a lot like reading an e-mail within Emacs,

The standard read-only keys are in action. Space and backspace are for page up/down. The n and p keys switch between the next and previous entries from the search buffer. The idea is that you should be able to hop into the first entry and work your way along reading them within Emacs when possible.

Configuration

Elfeed maintains a database in ~/.elfeed/ (configurable). It will start out empty because you need to tell it what feeds you’d like to follow. List your feeds elfeed-feeds variable. You would do this in your .emacs or other initialization files.

(setq elfeed-feeds
      '("http://www.50ply.com/atom.xml"
        "http://possiblywrong.wordpress.com/feed/"
        ;; ...
        "http://www.devrand.org/feeds/posts/default"))

Once set, hitting G (capitalized) in the search buffer or running elfeed-update will tell Elfeed to fetch each of these feeds and load in their entries. Entries will populate the search buffer as they are discovered (assuming they pass the current filter), where they can be immediately acted upon. Pressing g (lower case) refreshes the search buffer view without fetching any feeds.

Everything fetched will be added to the database for next time you run Emacs. It’s not required at all in order to use Elfeed, but I’ll discuss some of the details of the database format in another post.

Pressing s in the search buffer will allow you to edit the search filter in action.

There are three kinds of ways to filter on entries, in order of efficiency: by age, by tag, and by regular expression. For an entry to be shown, it must pass each of the space-delimited components of the filter.

Ages are described by plain language relative time, starting with @. This component is ultimately parsed by Emacs’ time-duration function. Here are some examples.

@1-year-old
@5-days-ago
@2-weeks

Tag filters start with + and -. When +, entries must be tagged with that tag. When -, entries must not be tagged with that tag. Some examples,

+unread: show only unread posts.
-junk +unread: don’t show unread “junk” entries.

Anything else is treated like a regular expression. However, the regular expression is applied only to titles and URLs for both entries and feeds. It’s not currently possible to filter on entry content, and I’ve found that I never want to do this anyway.

Putting it all together, here are some examples.

linu[xs] @1-year-old: only show entries about Linux or Linus from the last year.
-unread +youtube: only show previously-read entries tagged with youtube.

Note: the database is date-oriented, so age filtering is by far the fastest. Including an age limit will greatly increase the performance of the search buffer, so I recommend adding it to the default filter (elfeed-search-search-filter).

Tagging

Generally you don’t want to spend time tagging entries. Fortunately this step can easily be automated using elfeed-make-tagger. To tag all YouTube entries with youtube and video,

(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :feed-url "youtube\\.com"
                              :add '(video youtube)))

Any functions added to elfeed-new-entry-hook are called with the new entry as its argument. The elfeed-make-tagger function returns a function that applies tags to entries matching specific criteria.

This tagger tags old entries as read. It’s handy for initializing an Elfeed database on a new computer, since I’ve likely already read most of the entries being discovered.

(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :before "2 weeks ago"
                              :remove 'unread))

Creating custom subfeeds

Tagging is also really handy for fixing some kinds of broken feeds or otherwise filtering out unwanted content. I like to use a junk tag to indicate uninteresting entries.

(add-hook 'elfeed-new-entry-hook
          (elfeed-make-tagger :feed-url "example\\.com"
                              :entry-title '(not "something interesting")
                              :add 'junk
                              :remove 'unread))

There are a few feeds I’d like to follow but do not because the entries lack dates. This makes them difficult to follow without a shared, persistent database. I’ve contacted the authors of these feeds to try to get them fixed but have not gotten any responses. I haven’t quite figured out how to do it yet, but I will eventually create a function for elfeed-new-entry-hook that adds reasonable dates to these feeds.

Custom actions

In my own .emacs.d configuration I’ve added a new entry action to Elfeed: video downloads with youtube-dl. When I hit d on a YouTube entry either in the entry “show” buffer or the search buffer, Elfeed will download that video into my local drive. I consume quite a few YouTube videos on a regular basis (I’m a “cord-never”), so this has already saved me a lot of time.

Adding custom actions like this to Elfeed is exactly the extensibility I’m interested in supporting. I want this to be easy. After just a week of usage I’ve already customized Elfeed a lot for myself — very specific customizations which are not included with Elfeed.

Web interface

Elfeed also includes a web interface! If you’ve loaded/installed elfeed-web, start it with elfeed-web-start and visit this URL in your browser (check your httpd-port).

http://localhost:8080/elfeed/

Elfeed exposes a RESTful JSON API, consumable by any application. The web interface builds on this using AngularJS, behaving as a single-page application. It includes a filter search box that filters out entries as you type. I think it’s pretty slick, though still a bit rough.

It still needs some work to truly be useful. I’m intending for this to become the “mobile” interface to Elfeed, for remote access on a phone or tablet. Patches welcome.

Try it out

After Google Reader closed I tried The Old Reader for awhile. When that collapsed under its own popularity I decided to go with a local client reader. Canto was crushed under the weight of all my feeds, so I ended up using Liferea for awhile. Frustrated at Liferea’s lack of extensibility and text-file configuration, I ended up writing Elfeed.

Elfeed now serving 100% of my personal web feed reader needs. I think it’s already far better than any reader I’ve used before. Another case of “I should have done this years ago,” though I think I lacked the expertise to pull it off well until fairly recently.

At the moment I believe Elfeed is already the most extensible and powerful web feed reader in the world.