nullprogram.com/blog/2016/12/27/
Last night I pushed another performance enhancement for Elfeed, this
time reducing the time spent parsing feeds. It’s accomplished by
compiling, during macro expansion, a jQuery-like domain-specific
language within Elfeed.
Heuristic parsing
Given the nature of the domain — an under-specified standard
and a lack of robust adherence — feed parsing is much more heuristic
than strict. Sure, everyone’s feed XML is strictly conforming since
virtually no feed reader tolerates invalid XML (thank you, XML
libraries), but, for the schema, the situation resembles the de
facto looseness of HTML. Sometimes important or required information
is missing, or is only available in a different namespace.
Sometimes, especially in the case of timestamps, it’s in the wrong
format, or encoded incorrectly, or ambiguous. It’s real world data.
To get a particular piece of information, Elfeed looks in a number of
different places within the feed, starting with the preferred source
and stopping when the information is found. For example, to find the
date of an Atom entry, Elfeed first searches for elements in this
order:
<published>
<updated>
<date>
<modified>
<issued>
Failing to find any of these elements, or if no parsable date is
found, it settles on the current time. Only the updated
element is
required, but published
usually has the desired information, so it
goes first. The last three are only valid for another namespace, but
are useful fallbacks.
Before Elfeed even starts this search, the XML text is parsed into an
s-expression using xml-parse-region
— a pure Elisp XML parser
included in Emacs. The search is made over the resulting s-expression.
For example, here’s a sample from the Atom specification.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Example Feed</title>
<link href="http://example.org/"/>
<updated>2003-12-13T18:30:02Z</updated>
<author>
<name>John Doe</name>
</author>
<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
<entry>
<title>Atom-Powered Robots Run Amok</title>
<link rel="alternate" href="http://example.org/2003/12/13/atom03"/>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2003-12-13T18:30:02Z</updated>
<summary>Some text.</summary>
</entry>
</feed>
Which is parsed to into this s-expression.
((feed ((xmlns . "http://www.w3.org/2005/Atom"))
(title () "Example Feed")
(link ((href . "http://example.org/")))
(updated () "2003-12-13T18:30:02Z")
(author () (name () "John Doe"))
(id () "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6")
(entry ()
(title () "Atom-Powered Robots Run Amok")
(link ((rel . "alternate")
(href . "http://example.org/2003/12/13/atom03")))
(id () "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a")
(updated () "2003-12-13T18:30:02Z")
(summary () "Some text."))))
Each XML element is converted to a list. The first item is a symbol
that is the element’s name. The second item is an alist of attributes
— cons pairs of symbols and strings. And the rest are its children,
both string nodes and other elements. I’ve trimmed the extraneous
string nodes from the sample s-expression.
A subtle detail is that xml-parse-region
doesn’t just return the
root element. It returns a list of elements, which always happens to
be a single element list, which is the root element. I don’t know why
this is, but I’ve built everything to assume this structure as input.
Elfeed strips all namespaces stripped from both elements and
attributes to make parsing simpler. As I said, it’s heuristic rather
than strict, so namespaces are treated as noise.
A domain-specific language
Coding up Elfeed’s s-expression searches in straight Emacs Lisp would
be tedious, error-prone, and difficult to understand. It’s a lot of
loops, assoc
, etc. So instead I invented a jQuery-like, CSS
selector-like, domain-specific language (DSL) to express these
searches concisely and clearly.
For example, all of the entry links are “selected” using this
expression:
(feed entry link [rel "alternate"] :href)
Reading right-to-left, this matches every href
attribute under every
link
element with the rel="alternate"
attribute, under every
entry
element, under the feed
root element. Symbols match element
names, two-element vectors match elements with a particular attribute
pair, and keywords (which must come last) narrow the selection to a
specific attribute value.
Imagine hand-writing the code to navigate all these conditions for
each piece of information that Elfeed requires. The RSS parser makes
up to 16 such queries, and the Atom parser makes as many as 24. That
would add up to a lot of tedious code.
The package (included with Elfeed) that executes this query is called
“xml-query.” It comes in two flavors: xml-query
and xml-query-all
.
The former returns just the first match, and the latter returns all
matches. The naming parallels the querySelector()
and
querySelectorAll()
DOM methods in JavaScript.
(let ((xml (elfeed-xml-parse-region)))
(xml-query-all '(feed entry link [rel "alternate"] :href) xml))
;; => ("http://example.org/2003/12/13/atom03")
That date search I mentioned before looks roughly like this. The *
matches text nodes within the selected element. It must come last just
like the keyword matcher.
(or (xml-query '(feed entry published *))
(xml-query '(feed entry updated *))
(xml-query '(feed entry date *))
(xml-query '(feed entry modified *))
(xml-query '(feed entry issued *))
(current-time))
Over the past three years, Elfeed has gained more and more of these
selectors as it collects more and more information from feeds. Most
recently, Elfeed collects author and category information provided by
feeds. Each new query slows feed parsing a little bit, and it’s a
perfect example of a program slowing down as it gains more features
and capabilities.
But I don’t want Elfeed to slow down. I want it to get faster!
Optimizing the domain-specific language
Just like the primary jQuery function ($
), both xml-query
and
xml-query-all
are functions. The xml-query engine processes the
selector from scratch on each invocation. It examines the first
element, dispatches on its type/value to apply it to the input, and
then recurses on the rest of selector with the narrowed input,
stopping when it hits the end of the list. That’s the way it’s worked
from the start.
However, every selector argument in Elfeed is a static, quoted list.
Unlike user-supplied filters, I know exactly what I want to
execute ahead of time. It would be much better if the engine didn’t
have to waste time reparsing the DSL for each query.
This is the classic split between interpreters and compilers. An
interpreter reads input and immediately executes it, doing what the
input tells it to do. A compiler reads input and, rather than execute
it, produces output, usually in a simpler language, that, when
evaluated, has the same effect as executing the input.
Rather than interpret the selector, it would be better to compile it
into Elisp code, compile that into byte-code, and then have the
Emacs byte-code virtual machine (VM) execute the query each time it’s
needed. The extra work of parsing the DSL is performed ahead of time,
the dispatch is entirely static, and the selector ultimately executes
on a much faster engine (byte-code VM). This should be a lot faster!
So I wrote a function that accepts a selector expression and emits
Elisp source that implements that selector: a compiler for my DSL.
Having a readily-available syntax tree is one of the big advantages
of homoiconicity, and this sort of function makes perfect sense
in a lisp. For the external interface, this compiler function is
called by a new pair of macros, xml-query*
and xml-query-all*
.
These macros consume a static selector and expand into the compiled
Elisp form of the selector.
To demonstrate, remember that link query from before? Here’s the macro
version of that selection, but only returning the first match. Notice
the selector is no longer quoted. This is because it’s consumed by the
macro, not evaluated.
(xml-query* (feed entry title [rel "alternate"] :href) xml)
This will expand into the following code.
(catch 'done
(dolist (v xml)
(when (and (consp v) (eq (car v) 'feed))
(dolist (v (cddr v))
(when (and (consp v) (eq (car v) 'entry))
(dolist (v (cddr v))
(when (and (consp v) (eq (car v) 'title))
(let ((value (cdr (assq 'rel (cadr v)))))
(when (equal value "alternate")
(let ((v (cdr (assq 'href (cadr v)))))
(when v
(throw 'done v))))))))))))
As soon as it finds a match, it’s thrown to the top level and
returned. Without the DSL, the expansion is essentially what would
have to be written by hand. This is exactly the sort of leverage
you should be getting from a compiler. It compiles to around 130
byte-code instructions.
The xml-query-all*
form is nearly the same, but instead of a
throw
, it pushes the result into the return list. Only the prologue
(the outermost part) and the epilogue (the innermost part) are
different.
Parsing feeds is a hot spot for Elfeed, so I wanted the compiler’s
output to be as efficient as possible. I had three goals for this:
-
No extraneous code. It’s easy for the compiler to emit
unnecessary code. The byte-code compiler might be able to eliminate
some of it, but I don’t want to rely on that. Except for the
identifiers, it should basically look like a human wrote it.
-
Avoid function calls. I don’t want to pay function call
overhead, and, with some care, it’s easy to avoid. In the
xml-query*
expansion, the only function call is throw
, which is
unavoidable. The xml-query-all*
version makes no function calls
whatsoever. Notice that I used assq
rather than assoc
. First, it
only needs to match symbols, so it should be faster. Second, assq
has its own byte-code instruction (158) and assoc
does not.
-
No unnecessary memory allocations. The xml-query*
expansion
makes no allocations. The xml-query-all*
version only conses
once per output, which is the minimum possible.
The end result is at least as optimal as hand-written code, but
without the chance of human error (typos, fat fingering) and sourced
from an easy-to-read DSL.
In my tests, the xml-query macros are a full order of magnitude
faster than the functions. Yes, ten times faster! It’s an even
bigger gain than I expected.
In the full picture, xml-query is only one part of parsing a feed.
Measuring the time starting from raw XML text (as delivered by
cURL) to a list of database entry objects, I’m seeing an
overall 25% speedup with the macros. The remaining time is
dominated by xml-parse-region
, which is mostly out of my control.
With xml-query so computationally cheap, I don’t need to worry about
using it more often. Compared to parsing XML text, it’s virtually
free.
When it came time to validate my DSL compiler, I was really happy
that Elfeed had a test suite. I essentially rewrote a core component
from scratch, and passing all of the unit tests was a strong sign that
it was correct. Many times that test suite has provided confidence in
changes made both by me and by others.
I’ll end by describing another possible application: Apply this
technique to regular expressions, such that static strings containing
regular expressions are compiled into Elisp/byte-code via macro
expansion. I wonder if situationally this would be faster than Emacs’
own regular expression engine.