nullprogram.com/blog/2016/12/11/
Today I pushed an update for Elfeed that doubles the speed
of the search filter in the worse case. This is the user-entered
expression that dynamically narrows the entry listing to a subset that
meets certain criteria: published after a particular date,
with/without particular tags, and matching/non-matching zero or more
regular expressions. The filter is live, applied to the database as
the expression is edited, so it’s important for usability that this
search completes under a threshold that the user might notice.
The typical workaround for these kinds of interfaces is to make
filtering/searching asynchronous. It’s possible to do this well, but
it’s usually a terrible, broken design. If the user acts upon the
asynchronous results — say, by typing the query and hitting enter to
choose the current or expected top result — then the final behavior is
non-deterministic, a race between the user’s typing speed and the
asynchronous search. Elfeed will keep its synchronous live search.
For anyone not familiar with Elfeed, here’s a filter that finds all
entries from within the past year tagged “youtube” (+youtube
) that
mention Linux or Linus (linu[sx]
), but aren’t tagged “bsd” (-bsd
),
limited to the most recent 15 entries (#15
):
@1-year-old +youtube linu[xs] -bsd #15
The database is primarily indexed over publication date, so filters on
publication dates are the most efficient filters. Entries are visited
in order starting with the most recently published, and the search can
bail out early once it crosses the filter threshold. Time-oriented
filters have been encouraged as the solution to keep the live search
feeling lively.
Filtering Overview
The first step in filtering is parsing the filter text entered by the
user. This string is broken into its components using the
elfeed-search-parse-filter
function. Date filter components are
converted into a unix epoch interval, tags are interned into symbols,
regular expressions are gathered up as strings, and the entry limit is
parsed into a plain integer. Absence of a filter component is
indicated by nil.
(elfeed-search-parse-filter "@1-year-old +youtube linu[xs] -bsd #15")
;; => (31557600.0 (youtube) (bsd) ("linu[xs]") nil 15)
Previously, the next step was to apply the elfeed-search-filter
function with this structured filter representation to the database.
Except for special early-bailout situations, it works left-to-right
across the filter, checking each condition against each entry. This is
analogous to an interpreter, with the filter being a program.
Thinking about it that way, what if the filter was instead compiled
into an Emacs byte-code function and executed directly by the Emacs
virtual machine? That’s what this latest update does.
Benchmarks
With six different filter components, the actual filtering routine is
a bit too complicated for an article, so I’ll set up a simpler, but
roughly equivalent, scenario. With a reasonable cut-off date, the
filter was already sufficiently fast, so for benchmarking I’ll focus
on the worst case: no early bailout opportunities. An entry will be
just a list of tags (symbols), and the filter will have to test every
entry.
My real-world Elfeed database currently has 46,772 entries with
36 distinct tags. For my benchmark I’ll round this up to a nice
100,000 entries, and use 26 distinct tags (A–Z), which has the nice
alphabet property and more closely reflects the number of tags I still
care about.
First, here’s make-random-entry
to generate a random list of 1–5
tags (i.e. an entry). The state
parameter is the random state,
allowing for deterministic benchmarks on a randomly-generated
database.
(cl-defun make-random-entry (&key state (min 1) (max 5))
(cl-loop repeat (+ min (cl-random (1+ (- max min)) state))
for letter = (+ ?A (cl-random 26 state))
collect (intern (format "%c" letter))))
The database is just a big list of entries. In Elfeed this is actually
an AVL tree. Without dates, the order doesn’t matter.
(cl-defun make-random-database (&key state (count 100000))
(cl-loop repeat count collect (make-random-entry :state state)))
Here’s my old time macro. An important change I’ve made since
years ago is to call garbage-collect
before starting the clock,
eliminating bad samples from unlucky garbage collection events.
Depending on what you want to measure, it may even be worth disabling
garbage collection during the measurement by setting
gc-cons-threshold
to a high value.
(defmacro measure-time (&rest body)
(declare (indent defun))
(garbage-collect)
(let ((start (make-symbol "start")))
`(let ((,start (float-time)))
,@body
(- (float-time) ,start))))
Finally, the benchmark harness. It uses a hard-coded seed to generate
the same pseudo-random database. The test is run against the a filter
function, f
, 100 times in search for the same 6 tags, and the timing
results are averaged.
(cl-defun benchmark (f &optional (n 100) (tags '(A B C D E F)))
(let* ((state (copy-sequence [cl-random-state-tag -1 30 267466518]))
(db (make-random-database :state state)))
(cl-loop repeat n
sum (measure-time
(funcall f db tags))
into total
finally return (/ total (float n)))))
The baseline will be memq
(test for membership using identity,
eq
). There are two lists of tags to compare: the list that is the
entry, and the list from the filter. This requires a nested loop for
each entry, one explicit (cl-loop
) and one implicit (memq
), both
with early bailout.
(defun memq-count (db tags)
(cl-loop for entry in db count
(cl-loop for tag in tags
when (memq tag entry)
return t)))
Byte-code compiling everything and running the benchmark on my laptop
I get:
(benchmark #'memq-count)
;; => 0.041 seconds
That’s actually not too bad. One of the advantages of this definition
is that there are no function calls. The memq
built-in function has
its own opcode (62), and the rest of the definition is special forms
and macros expanding to special forms (cl-loop
). It’s exactly the
thing I need to exploit to make filters faster.
As a sanity check, what would happen if I used member
instead of
memq
? In theory it should be slower because it uses equal
for
tests instead of eq
.
(defun member-count (db tags)
(cl-loop for entry in db count
(cl-loop for tag in tags
when (member tag entry)
return t)))
It’s only slightly slower because member
, like many other
built-ins, also has an opcode (157). It’s just a tiny bit
more overhead.
(benchmark #'member-count)
;; => 0.047 seconds
To test function call overhead while still using the built-in (e.g.
written in C) memq
, I’ll alias it so that the byte-code compiler is
forced to emit a function call.
(defalias 'memq-alias 'memq)
(defun memq-alias-count (db tags)
(cl-loop for entry in db count
(cl-loop for tag in tags
when (memq-alias tag entry)
return t)))
To verify that this is doing what I expect, I M-x disassemble
the
function and inspect the byte-code disassembly. Here’s a simple
example.
(disassemble
(byte-compile (lambda (list) (memq :foo list))))
When compiled under lexical scope (lexical-binding
is true), here’s
the disassembly. To understand what this means, see Emacs Byte-code
Internals.
0 constant :foo
1 stack-ref 1
2 memq
3 return
Notice the memq
instruction. Try using memq-alias
instead:
(disassemble
(byte-compile (lambda (list) (memq-alias :foo list))))
Resulting in a function call:
0 constant memq-alias
1 constant :foo
2 stack-ref 2
3 call 2
4 return
And the benchmark:
(benchmark #'memq-alias-count)
;; => 0.052 seconds
So the function call adds about 27% overhead. This means it would be a
good idea to avoid calling functions in the filter if I can help
it. I should rely on these special opcodes.
Suppose memq
was written in Emacs Lisp rather than C. How much would
that hurt performance? My version of my-memq
below isn’t quite the
same since it returns t rather than the sublist, but it’s good enough
for this purpose. (I’m using cl-loop
because writing early bailout
in plain Elisp without recursion is, in my opinion, ugly.)
(defun my-memq (needle haystack)
(cl-loop for element in haystack
when (eq needle element)
return t))
(defun my-memq-count (db tags)
(cl-loop for entry in db count
(cl-loop for tag in tags
when (my-memq tag entry)
return t)))
And the benchmark:
(benchmark #'my-memq-count)
;; => 0.137 seconds
Oof! It’s more than 3 times slower than the opcode. This means I
should use built-ins as much as possible in the filter.
Dynamic vs. lexical scope
There’s one last thing to watch out for. Everything so far has been
compiled with lexical scope. You should really turn this on by default
for all new code that you write. It has three important advantages:
- It allows the compiler to catch more mistakes.
- It eliminates a class of bugs related to dynamic scope: Local
variables are exposed to manipulation by callees.
- Lexical scope has better performance.
Here are all the benchmarks with the default dynamic scope:
(benchmark #'memq-count)
;; => 0.065 seconds
(benchmark #'member-count)
;; => 0.070 seconds
(benchmark #'memq-alias-count)
;; => 0.074 seconds
(benchmark #'my-memq-count)
;; => 0.256 seconds
It halves the performance in this benchmark, and for no benefit. Under
dynamic scope, local variables use the varref
opcode — a global
variable lookup — instead of the stack-ref
opcode — a simple array
index.
(defun norm (a b)
(* (- a b) (- a b)))
Under dynamic scope, this compiles to:
0 varref a
1 varref b
2 diff
3 varref a
4 varref b
5 diff
6 mult
7 return
And under lexical scope (notice the variable names disappear):
0 stack-ref 1
1 stack-ref 1
2 diff
3 stack-ref 2
4 stack-ref 2
5 diff
6 mult
7 return
JIT-compiled filters
So far I’ve been moving in the wrong direction, making things slower
rather than faster. How can I make it faster than the straight memq
version? By compiling the filter into byte-code.
I won’t write the byte-code directly, but instead generate Elisp code
and use the byte-code compiler on it. This is safer, will work
correctly in future versions of Emacs, and leverages the optimizations
performed by the byte-compiler. This sort of thing recently got a bad
rap on Emacs Horrors, but I was happy to see that this
technique is already established.
(defun jit-count (db tags)
(let* ((memq-list (cl-loop for tag in tags
collect `(memq ',tag entry)))
(function `(lambda (db)
(cl-loop for entry in db
count (or ,@memq-list))))
(compiled (byte-compile function)))
(funcall compiled db)))
It dynamically builds the code as an s-expression, runs that through
the byte-code compiler, executes it, and throws it away. It’s
“just-in-time,” though compiling to byte-code and not native
code. For the benchmark tags of (A B C D E F)
, this builds
the following:
(lambda (db)
(cl-loop for entry in db
count (or (memq 'A entry)
(memq 'B entry)
(memq 'C entry)
(memq 'D entry)
(memq 'E entry)
(memq 'F entry))))
Due to its short-circuiting behavior, or
is a special form, so this
function is just special forms and memq
in its opcode form. It’s as
fast as Elisp can get.
Having s-expressions is a real strength for lisp, since the
alternative (in, say, JavaScript) would be to assemble the function by
concatenating code strings. By contrast, this looks a lot like a
regular lisp macro. Invoking the byte-code compiler does add some
overhead compared to the interpreted filter, but it’s insignificant.
How much faster is this?
(benchmark #'jit-count)
;; => 0.017s
It’s more than twice as fast! The big gain here is through loop
unrolling. The outer loop has been unrolled into the or
expression.
That section of byte-code looks like this:
0 constant A
1 stack-ref 1
2 memq
3 goto-if-not-nil-else-pop 1
6 constant B
7 stack-ref 1
8 memq
9 goto-if-not-nil-else-pop 1
12 constant C
13 stack-ref 1
14 memq
15 goto-if-not-nil-else-pop 1
18 constant D
19 stack-ref 1
20 memq
21 goto-if-not-nil-else-pop 1
24 constant E
25 stack-ref 1
26 memq
27 goto-if-not-nil-else-pop 1
30 constant F
31 stack-ref 1
32 memq
33:1 return
In Elfeed, not only does it unroll these loops, it completely
eliminates the overhead for unused filter components. Comparing to
this benchmark, I’m seeing roughly matching gains in Elfeed’s worst
case. In Elfeed, I also bind lexical-binding
around the
byte-compile
call to force lexical scope, since otherwise it just
uses the buffer-local value (usually nil).
Filter compilation can be toggled on and off by setting
elfeed-search-compile-filter
. If you’re up to date, try out live
filters with it both enabled and disabled. See if you can notice the
difference.
Result summary
Here are the results in a table, all run with Emacs 24.4 on x86-64.
(ms) memq member memq-alias my-memq jit
lexical 41 47 52 137 17
dynamic 65 70 74 256 21
And the same benchmarks on Aarch64 (Emacs 24.5, ARM Cortex-A53), where
I also occasionally use Elfeed, and where I have been very interested
in improving performance.
(ms) memq member memq-alias my-memq jit
lexical 170 235 242 614 79
dynamic 274 340 345 1130 92
And here’s how you can run the benchmarks for yourself, perhaps with
different parameters:
The header explains how to run the benchmark in batch mode:
$ emacs -Q -batch -f batch-byte-compile jit-bench.el
$ emacs -Q -batch -l jit-bench.elc -f benchmark-batch