Measure Elisp Object Memory Usage with Calipers

January 26, 2014

nullprogram.com/blog/2014/01/26/

A couple of weeks ago I wrote a library to measure the retained memory footprint of arbitrary Elisp objects for the purposes of optimization. It’s called Caliper.

https://github.com/skeeto/caliper

Note, Caliper requires predd, my predicate dispatch library. Neither of these packages are on MELPA or Marmalade since they’re mostly for fun.

The reason I wanted this was that I came across a post on reddit where someone had scraped 217,000 Jeopardy! questions from J! Archive and dumped them out into a single, large JSON file. The significance of the effort is that it dealt with some of the inconsistencies of J! Archive’s data presentation, normalizing them for the JSON output.

JEOPARDY_QUESTIONS1.json.gz (12MB, 53MB uncompressed)

When I want to examine a JSON dataset like this I have three preferred options:

Load it into a browser page and poke at it from JavaScript remotely with Skewer. With the JSON text weighing in at 53MB and with such a large object count, I decided this was too large for a browser page. It definitely could be done, it’s just that the browser is not the place to be working on large datasets.
Load it into Clojure. I’m familiar with Clojure’s data.json. This is not a bad choice, but there’s something else I always reach for first if I can.
Load it into Emacs using json.el (part of Emacs). This is what I ended up doing.

(defvar jeopardy
  (with-temp-buffer
    (insert-file-contents "/tmp/JEOPARDY_QUESTIONS1.json")
    (json-read)))

(length jeopardy)
;; => 216930

Here, jeopardy is bound to a vector of 216,930 association lists (alists). I’m curious exactly how much heap memory this data structure is using. To find out, we need to walk the data structure and sum the sizes of everything we come across. However, care must be taken not to count the identical objects twice, such as symbols, which, being interned, appear many times in this data.

Measuring Object Sizes

This is lisp so let’s start with the cons cell. A cons cell is just a pair of pointers, called car and cdr.

These are used to assemble lists.

So a cons cell itself — the shallow size — is two words: 16 bytes on a 64-bit operating system. To make sure Elisp doesn’t happen to have any additional information attached to cons cells, let’s take a look at the Emacs source code.

struct Lisp_Cons
  {
    /* Car of this cons cell.  */
    Lisp_Object car;

    union
    {
      /* Cdr of this cons cell.  */
      Lisp_Object cdr;

      /* Used to chain conses on a free list.  */
      struct Lisp_Cons *chain;
    } u;
  };

The return value from garbage-collect backs this up. The first value after each type is the shallow size of that type. From here on, all values have been computed for 64-bit Emacs running on x86-64 GNU/Linux.

(garbage-collect)
;; => ((conses 16 9923172 2036943)
;;     (symbols 48 57017 54)
;;     (miscs 40 10203 18892)
;;     (strings 32 4810027 197961)
;;     (string-bytes 1 104599635)
;;     (vectors 16 103138)
;;     (vector-slots 8 2921744 131076)
;;     (floats 8 12494 5816)
;;     (intervals 56 119911 69249)
;;     (buffers 960 134)
;;     (heap 1024 593412 133853))

A Lisp_Object is just a pointer to a lisp object. The retained size of a cons cell is its shallow size plus, recursively, the retained size of the objects in its car and cdr.

Integers and Floats

Integers are a special case. Elisp uses what is called tagged integers. They’re not heap-allocated objects. Instead they’re embedded inside the object pointers. That is, those Lisp_Object pointers in Lisp_Cons will hold integers directly. This means to Caliper integers have retained size of 0. We can use this to verify Caliper’s return value for cons cells.

(caliper-object-size 100)
;; => 0

(caliper-object-size (cons 100 200))
;; => 16

Tagged integers are fast and save on memory. They also compare properly with eq, which is just a pointer (identity) comparison. However, because a few bits need to be reserved for differentiating them from actual pointers these integers have a restricted dynamic range.

Floats are not tagged and exist as immutable objects in the heap. That’s why eql is still useful in Elisp — it’s like eq but will handle numbers properly. (By convention you should use eql for integers, too.)

Symbols and Strings

Not counting the string’s contents, a string’s base size is 32 bytes according to garbage-collect. The length of the string can’t be used here because that counts characters, which vary in size. There’s a string-bytes function for this. A string’s size is 32 plus its string-bytes value.

(string-bytes "naïveté")
;; => 9
(caliper-object-size "naïveté")
;; => 41  (i.e. 32 + 9)

As you can see from above, symbols are huge. Without even counting either the string holding the name of the symbol or the symbol’s plist, a symbol is 48 bytes.

(caliper-object-size 'hello)
;; => 1038

This 1,038 bytes is a little misleading. The symbol itself is 48 bytes, the string "hello" is 37 bytes, and the plist is nil. The retained size of nil is significant. On my system, nil’s plist has 4 key-value pairs, which themselves have retained sizes. When examining symbols, caliper doesn’t care if they’re interned or not, including symbols like nil and t. However, nil is only counted once, so it will have little impact on a large data structure.

Miscellaneous

Outside of vectors, measuring object sizes starts to get fuzzy. For example, it’s not possible to examine the exact internals of a hash table from Elisp. We can see its contents and the number of elements it can hold without re-sizing, but there’s intermediate structure that’s not visible. Caliper makes rough estimates for each of these types.

Circularity and Double Counting

To avoid double counting objects, a hash table with a test of eq is dynamically bound by the top level call. It’s used like a set. Before an object is examined, the hash table is checked. If the object is listed, the reported size is 0 (it consumes no additional space than already accounted for).

This automatically solves the circularity problem. There’s no way we can traverse into the same data structure a second time because we’ll stop when we see it twice.

Using Caliper

So what’s the total retained size of the jeopardy structure? About 124MB.

(caliper-object-size jeopardy)
;; => 130430198

For fun, let’s see if how much we can improve on this.

json.el will return alists for objects by default, but this can be changed by setting json-object-type to something else. Initially I thought maybe using plists instead would save space, but I later realized that plists use exactly the same number of cons cells as alists. If this doesn’t sound right, try to picture the cons cells in your head (an exercise for the reader).

(defvar jeopardy
  (let ((json-object-type 'plist))
    (with-temp-buffer
      (insert-file-contents "~/JEOPARDY_QUESTIONS1.json")
      (setf (point) (point-min))
      (json-read))))

(caliper-object-size jeopardy)
;; => 130430077 (plist)

Strangely this is 121 bytes smaller. I don’t know why yet, but in the scope of 124MB that’s nothing.

So what do these questions look like?

(elt jeopardy 0)
;; => (:show_number "4680"
;;     :round "Jeopardy!"
;;     :answer "Copernicus"
;;     :value "$200"
;;     :question "..." ;; omitted
;;     :air_date "2004-12-31"
;;     :category "HISTORY")

They’re (now) plists of 7 pairs. All of the keys are symbols, and, as such, are interned and consuming very little memory. All of the values are strings. Surely we can do better here. The strings can be interned and the numbers can be turned into tagged integers. The :category values would probably be good candidates for conversion into symbols.

Here’s an interesting fact about Jeopardy! that can be exploited for our purposes. While Jeopardy! covers a broad range of trivia, it does so very shallowly. The same answers appear many times. For example, the very first answer from our dataset, Copernicus, appears 14 times. That makes even the answers good candidates for interning.

(cl-loop for question across jeopardy
         for answer = (plist-get question :answer)
         count (string= answer "Copernicus"))
;; => 14

A string pool is trivial to implement. Just use a weak, equal hash table to track strings. Making it weak keeps it from leaking memory by holding onto strings for longer than necessary.

(defvar string-pool
  (make-hash-table :test 'equal :weakness t))

(defun intern-string (string)
  (or (gethash string string-pool)
      (setf (gethash string string-pool) string)))

(defun jeopardy-fix (question)
  (cl-loop for (key value) on question by #'cddr
           collect key
           collect (cl-case key
                     (:show_number (read value))
                     (:value (if value (read (substring value 1))))
                     (:category (intern value))
                     (otherwise (intern-string value)))))

(defvar jeopardy-interned
  (cl-map 'vector #'jeopardy-fix jeopardy))

So how are we looking now?

(caliper-object-size jeopardy-interned)
;; => 83254322

That’s down to 79MB of memory. Not bad! If we print-circle this, taking advantage of string interning in the printed representation, I wonder how it compares to the original JSON.

(with-temp-buffer
  (let ((print-circle nil))
    (prin1 jeopardy-interned (current-buffer))
    (buffer-size)))
;; => 45554437

About 44MB, down from JSON’s 53MB. With print-circle set to nil it’s about 48MB.

emacs
elisp

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.