Faster Elfeed Search Through JIT Byte-code Compilation
Today I pushed an update for Elfeed that doubles the speed of the search filter in the worse case. This is the user-entered expression that dynamically narrows the entry listing to a subset that meets certain criteria: published after a particular date, with/without particular tags, and matching/non-matching zero or more regular expressions. The filter is live, applied to the database as the expression is edited, so it’s important for usability that this search completes under a threshold that the user might notice.
The typical workaround for these kinds of interfaces is to make filtering/searching asynchronous. It’s possible to do this well, but it’s usually a terrible, broken design. If the user acts upon the asynchronous results — say, by typing the query and hitting enter to choose the current or expected top result — then the final behavior is non-deterministic, a race between the user’s typing speed and the asynchronous search. Elfeed will keep its synchronous live search.
For anyone not familiar with Elfeed, here’s a filter that finds all
entries from within the past year tagged “youtube” (
mention Linux or Linus (
linu[sx]), but aren’t tagged “bsd” (
limited to the most recent 15 entries (
@1-year-old +youtube linu[xs] -bsd #15
The database is primarily indexed over publication date, so filters on publication dates are the most efficient filters. Entries are visited in order starting with the most recently published, and the search can bail out early once it crosses the filter threshold. Time-oriented filters have been encouraged as the solution to keep the live search feeling lively.
The first step in filtering is parsing the filter text entered by the
user. This string is broken into its components using the
elfeed-search-parse-filter function. Date filter components are
converted into a unix epoch interval, tags are interned into symbols,
regular expressions are gathered up as strings, and the entry limit is
parsed into a plain integer. Absence of a filter component is
indicated by nil.
(elfeed-search-parse-filter "@1-year-old +youtube linu[xs] -bsd #15") ;; => (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)
Previously, the next step was to apply the
function with this structured filter representation to the database.
Except for special early-bailout situations, it works left-to-right
across the filter, checking each condition against each entry. This is
analogous to an interpreter, with the filter being a program.
Thinking about it that way, what if the filter was instead compiled into an Emacs byte-code function and executed directly by the Emacs virtual machine? That’s what this latest update does.
With six different filter components, the actual filtering routine is a bit too complicated for an article, so I’ll set up a simpler, but roughly equivalent, scenario. With a reasonable cut-off date, the filter was already sufficiently fast, so for benchmarking I’ll focus on the worst case: no early bailout opportunities. An entry will be just a list of tags (symbols), and the filter will have to test every entry.
My real-world Elfeed database currently has 46,772 entries with 36 distinct tags. For my benchmark I’ll round this up to a nice 100,000 entries, and use 26 distinct tags (A–Z), which has the nice alphabet property and more closely reflects the number of tags I still care about.
make-random-entry to generate a random list of 1–5
tags (i.e. an entry). The
state parameter is the random state,
allowing for deterministic benchmarks on a randomly-generated
(cl-defun make-random-entry (&key state (min 1) (max 5)) (cl-loop repeat (+ min (cl-random (1+ (- max min)) state)) for letter = (+ ?A (cl-random 26 state)) collect (intern (format "%c" letter))))
The database is just a big list of entries. In Elfeed this is actually an AVL tree. Without dates, the order doesn’t matter.
(cl-defun make-random-database (&key state (count 100000)) (cl-loop repeat count collect (make-random-entry :state state)))
Here’s my old time macro. An important change I’ve made since
years ago is to call
garbage-collect before starting the clock,
eliminating bad samples from unlucky garbage collection events.
Depending on what you want to measure, it may even be worth disabling
garbage collection during the measurement by setting
gc-cons-threshold to a high value.
(defmacro measure-time (&rest body) (declare (indent defun)) (garbage-collect) (let ((start (make-symbol "start"))) `(let ((,start (float-time))) ,@body (- (float-time) ,start))))
Finally, the benchmark harness. It uses a hard-coded seed to generate
the same pseudo-random database. The test is run against the a filter
f, 100 times in search for the same 6 tags, and the timing
results are averaged.
(cl-defun benchmark (f &optional (n 100) (tags '(A B C D E F))) (let* ((state (copy-sequence [cl-random-state-tag -1 30 267466518])) (db (make-random-database :state state))) (cl-loop repeat n sum (measure-time (funcall f db tags)) into total finally return (/ total (float n)))))
The baseline will be
memq (test for membership using identity,
eq). There are two lists of tags to compare: the list that is the
entry, and the list from the filter. This requires a nested loop for
each entry, one explicit (
cl-loop) and one implicit (
with early bailout.
(defun memq-count (db tags) (cl-loop for entry in db count (cl-loop for tag in tags when (memq tag entry) return t)))
Byte-code compiling everything and running the benchmark on my laptop I get:
(benchmark #'memq-count) ;; => 0.041 seconds
That’s actually not too bad. One of the advantages of this definition
is that there are no function calls. The
memq built-in function has
its own opcode (62), and the rest of the definition is special forms
and macros expanding to special forms (
cl-loop). It’s exactly the
thing I need to exploit to make filters faster.
As a sanity check, what would happen if I used
member instead of
memq? In theory it should be slower because it uses
tests instead of
(defun member-count (db tags) (cl-loop for entry in db count (cl-loop for tag in tags when (member tag entry) return t)))
It’s only slightly slower because
member, like many other
built-ins, also has an opcode (157). It’s just a tiny bit
(benchmark #'member-count) ;; => 0.047 seconds
To test function call overhead while still using the built-in (e.g.
written in C)
memq, I’ll alias it so that the byte-code compiler is
forced to emit a function call.
(defalias 'memq-alias 'memq) (defun memq-alias-count (db tags) (cl-loop for entry in db count (cl-loop for tag in tags when (memq-alias tag entry) return t)))
To verify that this is doing what I expect, I
M-x disassemble the
function and inspect the byte-code disassembly. Here’s a simple
(disassemble (byte-compile (lambda (list) (memq :foo list))))
When compiled under lexical scope (
lexical-binding is true), here’s
the disassembly. To understand what this means, see Emacs Byte-code
0 constant :foo 1 stack-ref 1 2 memq 3 return
memq instruction. Try using
(disassemble (byte-compile (lambda (list) (memq-alias :foo list))))
Resulting in a function call:
0 constant memq-alias 1 constant :foo 2 stack-ref 2 3 call 2 4 return
And the benchmark:
(benchmark #'memq-alias-count) ;; => 0.052 seconds
So the function call adds about 27% overhead. This means it would be a good idea to avoid calling functions in the filter if I can help it. I should rely on these special opcodes.
memq was written in Emacs Lisp rather than C. How much would
that hurt performance? My version of
my-memq below isn’t quite the
same since it returns t rather than the sublist, but it’s good enough
for this purpose. (I’m using
cl-loop because writing early bailout
in plain Elisp without recursion is, in my opinion, ugly.)
(defun my-memq (needle haystack) (cl-loop for element in haystack when (eq needle element) return t)) (defun my-memq-count (db tags) (cl-loop for entry in db count (cl-loop for tag in tags when (my-memq tag entry) return t)))
And the benchmark:
(benchmark #'my-memq-count) ;; => 0.137 seconds
Oof! It’s more than 3 times slower than the opcode. This means I should use built-ins as much as possible in the filter.
Dynamic vs. lexical scope
There’s one last thing to watch out for. Everything so far has been compiled with lexical scope. You should really turn this on by default for all new code that you write. It has three important advantages:
- It allows the compiler to catch more mistakes.
- It eliminates a class of bugs related to dynamic scope: Local variables are exposed to manipulation by callees.
- Lexical scope has better performance.
Here are all the benchmarks with the default dynamic scope:
(benchmark #'memq-count) ;; => 0.065 seconds (benchmark #'member-count) ;; => 0.070 seconds (benchmark #'memq-alias-count) ;; => 0.074 seconds (benchmark #'my-memq-count) ;; => 0.256 seconds
It halves the performance in this benchmark, and for no benefit. Under
dynamic scope, local variables use the
varref opcode — a global
variable lookup — instead of the
stack-ref opcode — a simple array
(defun norm (a b) (* (- a b) (- a b)))
Under dynamic scope, this compiles to:
0 varref a 1 varref b 2 diff 3 varref a 4 varref b 5 diff 6 mult 7 return
And under lexical scope (notice the variable names disappear):
0 stack-ref 1 1 stack-ref 1 2 diff 3 stack-ref 2 4 stack-ref 2 5 diff 6 mult 7 return
So far I’ve been moving in the wrong direction, making things slower
rather than faster. How can I make it faster than the straight
version? By compiling the filter into byte-code.
I won’t write the byte-code directly, but instead generate Elisp code and use the byte-code compiler on it. This is safer, will work correctly in future versions of Emacs, and leverages the optimizations performed by the byte-compiler. This sort of thing recently got a bad rap on Emacs Horrors, but I was happy to see that this technique is already established.
(defun jit-count (db tags) (let* ((memq-list (cl-loop for tag in tags collect `(memq ',tag entry))) (function `(lambda (db) (cl-loop for entry in db count (or ,@memq-list)))) (compiled (byte-compile function))) (funcall compiled db)))
It dynamically builds the code as an s-expression, runs that through
the byte-code compiler, executes it, and throws it away. It’s
“just-in-time,” though compiling to byte-code and not native
code. For the benchmark tags of
(A B C D E F), this builds
(lambda (db) (cl-loop for entry in db count (or (memq 'A entry) (memq 'B entry) (memq 'C entry) (memq 'D entry) (memq 'E entry) (memq 'F entry))))
Due to its short-circuiting behavior,
or is a special form, so this
function is just special forms and
memq in its opcode form. It’s as
fast as Elisp can get.
How much faster is this?
(benchmark #'jit-count) ;; => 0.017s
It’s more than twice as fast! The big gain here is through loop
unrolling. The outer loop has been unrolled into the
That section of byte-code looks like this:
0 constant A 1 stack-ref 1 2 memq 3 goto-if-not-nil-else-pop 1 6 constant B 7 stack-ref 1 8 memq 9 goto-if-not-nil-else-pop 1 12 constant C 13 stack-ref 1 14 memq 15 goto-if-not-nil-else-pop 1 18 constant D 19 stack-ref 1 20 memq 21 goto-if-not-nil-else-pop 1 24 constant E 25 stack-ref 1 26 memq 27 goto-if-not-nil-else-pop 1 30 constant F 31 stack-ref 1 32 memq 33:1 return
In Elfeed, not only does it unroll these loops, it completely
eliminates the overhead for unused filter components. Comparing to
this benchmark, I’m seeing roughly matching gains in Elfeed’s worst
case. In Elfeed, I also bind
lexical-binding around the
byte-compile call to force lexical scope, since otherwise it just
uses the buffer-local value (usually nil).
Filter compilation can be toggled on and off by setting
elfeed-search-compile-filter. If you’re up to date, try out live
filters with it both enabled and disabled. See if you can notice the
Here are the results in a table, all run with Emacs 24.4 on x86-64.
(ms) memq member memq-alias my-memq jit lexical 41 47 52 137 17 dynamic 65 70 74 256 21
And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where I also occasionally use Elfeed, and where I have been very interested in improving performance.
(ms) memq member memq-alias my-memq jit lexical 170 235 242 614 79 dynamic 274 340 345 1130 92
And here’s how you can run the benchmarks for yourself, perhaps with different parameters:
The header explains how to run the benchmark in batch mode:
$ emacs -Q -batch -f batch-byte-compile jit-bench.el $ emacs -Q -batch -l jit-bench.elc -f benchmark-batch