How to Write Fast(er) Emacs Lisp

January 30, 2017

nullprogram.com/blog/2017/01/30/

Not everything written in Emacs Lisp needs to be fast. Most of Emacs itself — around 82% — is written in Emacs Lisp because those parts are generally not performance-critical. Otherwise these functions would be built-ins written in C. Extensions to Emacs don’t have a choice and — outside of a few exceptions like dynamic modules and inferior processes — must be written in Emacs Lisp, including their performance-critical bits. Common performance hot spots are automatic indentation, AST parsing, and interactive completion.

Here are 5 guidelines, each very specific to Emacs Lisp, that will result in faster code. The non-intrusive guidelines could be applied at all times as a matter of style — choosing one equally expressive and maintainable form over another just because it performs better.

There’s one caveat: These guidelines are focused on Emacs 25.1 and “nearby” versions. Emacs is constantly evolving. Changes to the virtual machine and byte-code compiler may transform currently-slow expressions into fast code, obsoleting some of these guidelines. In the future I’ll add notes to this article for anything that changes.

(1) Use lexical scope

This guideline refers to the following being the first line of every Emacs Lisp source file you write:

;;; -*- lexical-binding: t; -*-

This point is worth mentioning again and again. Not only will your code be more correct, it will be measurably faster. Dynamic scope is still opt-in through the explicit use of special variables, so there’s absolutely no reason not to be using lexical scope. If you’ve written clean, dynamic scope code, then switching to lexical scope won’t have any effect on its behavior.

Along similar lines, special variables are a lot slower than local, lexical variables. Only use them when necessary.

(2) Prefer built-in functions

Built-in functions are written in C and are, as expected, significantly faster than the equivalent written in Emacs Lisp. Complete as much work as possible inside built-in functions, even if it might mean taking more conceptual steps overall.

For example, what’s the fastest way to accumulate a list of items? That is, new items go on the tail but, for algorithm reasons, the list must be constructed from the head.

You might be tempted to keep track of the tail of the list, appending new elements directly to the tail with setcdr (via setf below).

(defun fib-track-tail (n)
  (let* ((a 0)
         (b 1)
         (head (list 1))
         (tail head))
    (dotimes (_ n head)
      (psetf a b
             b (+ a b))
      (setf (cdr tail) (list b)
            tail (cdr tail)))))

(fib-track-tail 8)
;; => (1 1 2 3 5 8 13 21 34)

Actually, it’s much faster to construct the list in reverse, then destructively reverse it at the end.

(defun fib-nreverse (n)
  (let* ((a 0)
         (b 1)
         (list (list 1)))
    (dotimes (_ n (nreverse list))
      (psetf a b
             b (+ a b))
      (push b list))))

It might not look it, but nreverse is very fast. Not only is it a built-in, it’s got its own opcode. Using push in a loop, then finishing with nreverse is the canonical and fastest way to accumulate a list of items.

In fib-track-tail, the added complexity of tracking the tail in Emacs Lisp is much slower than zipping over the entire list a second time in C.

(3) Avoid unnecessary lambda functions

I’m talking about mapcar and friends.

;; Slower
(defun expt-list (list e)
  (mapcar (lambda (x) (expt x e)) list))

Listen, I know you love dash.el and higher order functions, but this habit ain’t cheap. The byte-code compiler does not know how to inline these lambdas, so there’s an additional per-element function call overhead.

Worse, if you’re using lexical scope like I told you, the above example forms a closure over e. This means a new function object is created (e.g. make-byte-code) each time expt-list is called. To be clear, I don’t mean that the lambda is recompiled each time — the same byte-code string is shared between all instances of the same lambda. A unique function vector (#[...]) and constants vector are allocated and initialized each time expt-list is invoked.

Related mini-guideline: Don’t create any more garbage than strictly necessary in performance-critical code.

Compare to an implementation with an explicit loop, using the nreverse list-accumulation technique.

(defun expt-list-fast (list e)
  (let ((result ()))
    (dolist (x list (nreverse result))
      (push (expt x e) result))))

No unnecessary garbage is created.
No unnecessary per-element function calls.

This is the fastest possible definition for this function, and it’s what you need to use in performance-critical code.

Personally I prefer the list comprehension approach, using cl-loop from cl-lib.

(defun expt-list-fast (list e)
  (cl-loop for x in list
           collect (expt x e)))

The cl-loop macro will expand into essentially the previous definition, making them practically equivalent. It takes some getting used to, but writing efficient loops is a whole lot less tedious with cl-loop.

In Emacs 24.4 and earlier, catch/throw is implemented by converting the body of the catch into a lambda function and calling it. If code inside the catch accesses a variable outside the catch (very likely), then, in lexical scope, it turns into a closure, resulting in the garbage function object like before.

In Emacs 24.5 and later, the byte-code compiler uses a new opcode, pushcatch. It’s a whole lot more efficient, and there’s no longer a reason to shy away from catch/throw in performance-critical code. This is important because it’s often the only way to perform an early bailout.

(4) Prefer using functions with dedicated opcodes

When following the guideline about using built-in functions, you might have several to pick from. Some built-in functions have dedicated virtual machine opcodes, making them much faster to invoke. Prefer these functions when possible.

How can you tell when a function has an assigned opcode? Take a peek at the byte-defop listings in bytecomp.el. Optimization often involves getting into the weeds, so don’t be shy.

For example, the assq and assoc functions search for a matching key in an association list (alist). Both are built-in functions, and the only difference is that the former compares keys with eq (e.g. symbol or integer keys) and the latter with equal (typically string keys). The difference in performance between eq and equal isn’t as important as another factor: assq has its own opcode (158).

This means in performance-critical code you should prefer assq, perhaps even going as far as restructuring your alists specifically to have eq keys. That last step is probably a trade-off, which means you’ll want to make some benchmarks to help with that decision.

Another example is eq, =, eql, and equal. Some macros and functions use eql, especially cl-lib which inherits eql as a default from Common Lisp. Take cl-case, which is like switch from the C family of languages. It compares elements with eql.

(defun op-apply (op a b)
  (cl-case op
    (:norm (+ (* a a) (* b b)))
    (:disp (abs (- a b)))
    (:isin (/ b (sin a)))))

The cl-case expands into a cond. Since Emacs byte-code lacks support for jump tables, there’s not much room for cleverness.

Update: Emacs 26.1, released May 2018, introduced a jump table opcode.

(defun op-apply (op a b)
  (cond
   ((eql op :norm) (+ (* a a) (* b b)))
   ((eql op :disp) (abs (- a b)))
   ((eql op :isin) (/ b (sin a)))))

It turns out eql is pretty much always the worst choice for cl-case. Of the four equality functions I listed, the only one lacking an opcode is eql. A faster definition would use eq. (In theory, cl-case could have done this itself because it knows all the keys are symbols.)

(defun op-apply (op a b)
  (cond
   ((eq op :norm) (+ (* a a) (* b b)))
   ((eq op :disp) (abs (- a b)))
   ((eq op :isin) (/ b (sin a)))))

Fortunately eq can safely compare integers in Emacs Lisp. You only need eql when comparing symbols, integers, and floats all at once, which is unusual.

(5) Unroll loops using and/or

Consider the following function which checks its argument against a list of numbers, bailing out on the first match. I used % instead of mod since the former has an opcode (166) and the latter does not.

(defun detect (x)
  (catch 'found
    (dolist (f '(2 3 5 7 11 13 17 19 23 29 31))
      (when (= 0 (% x f))
        (throw 'found f)))))

The byte-code compiler doesn’t know how to unroll loops. Fortunately that’s something we can do for ourselves using and and or. The compiler will turn this into clean, efficient jumps in the byte-code.

(defun detect-unrolled (x)
  (or (and (= 0 (% x 2)) 2)
      (and (= 0 (% x 3)) 3)
      (and (= 0 (% x 5)) 5)
      (and (= 0 (% x 7)) 7)
      (and (= 0 (% x 11)) 11)
      (and (= 0 (% x 13)) 13)
      (and (= 0 (% x 17)) 17)
      (and (= 0 (% x 19)) 19)
      (and (= 0 (% x 23)) 23)
      (and (= 0 (% x 29)) 29)
      (and (= 0 (% x 31)) 31)))

In Emacs 24.4 and earlier with the old-fashioned lambda-based catch, the unrolled definition is seven times faster. With the faster pushcatch-based catch it’s about twice as fast. This means the loop overhead accounts for about half the work of the first definition of this function.

Update: It was pointed out in the comments that this particular example is equivalent to a cond. That’s literally true all the way down to the byte-code, and it would be a clearer way to express the unrolled code. In real code it’s often not quite equivalent.

Unlike some of the other guidelines, this is certainly something you’d only want to do in code you know for sure is performance-critical. Maintaining unrolled code is tedious and error-prone.

I’ve had the most success with this approach by not by unrolling these loops myself, but by using a macro, or similar, to generate the unrolled form.

(defmacro with-detect (var list)
  (cl-loop for e in list
           collect `(and (= 0 (% ,var ,e)) ,e) into conditions
           finally return `(or ,@conditions)))

(defun detect-unrolled (x)
  (with-detect x (2 3 5 7 11 13 17 19 23 29 31)))

How can I find more optimization opportunities myself?

Use M-x disassemble to inspect the byte-code for your own hot spots. Observe how the byte-code changes in response to changes in your functions. Take note of the sorts of forms that allow the byte-code compiler to produce the best code, and then exploit it where you can.

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.