A couple of weeks ago I wrote a library to measure the retained memory
footprint of arbitrary Elisp objects for the purposes of optimization.
It’s called Caliper.
Note, Caliper requires predd, my predicate dispatch library.
Neither of these packages are on MELPA or Marmalade since they’re
mostly for fun.
The reason I wanted this was that I came across a post on reddit where
someone had scraped 217,000 Jeopardy! questions from
J! Archive and dumped them out into a single, large JSON
file. The significance of the effort is that it dealt with some of the
inconsistencies of J! Archive’s data presentation, normalizing them
for the JSON output.
When I want to examine a JSON dataset like this I have three preferred
with Skewer. With the JSON text weighing in at 53MB and
with such a large object count, I decided this was too large for a
browser page. It definitely could be done, it’s just that the
browser is not the place to be working on large datasets.
- Load it into Clojure. I’m familiar with Clojure’s
data.json. This is not a bad choice, but there’s
something else I always reach for first if I can.
- Load it into Emacs using json.el (part of Emacs). This is what I
ended up doing.
;; => 216930
jeopardy is bound to a vector of 216,930 association lists
(alists). I’m curious exactly how much heap memory this data structure
is using. To find out, we need to walk the data structure and sum the
sizes of everything we come across. However, care must be taken not to
count the identical objects twice, such as symbols, which, being
interned, appear many times in this data.
Measuring Object Sizes
This is lisp so let’s start with the cons cell. A cons cell is just a
pair of pointers, called car and cdr.
These are used to assemble lists.
So a cons cell itself — the shallow size — is two words: 16 bytes
on a 64-bit operating system. To make sure Elisp doesn’t happen to
have any additional information attached to cons cells, let’s take a
look at the Emacs source code.
/* Car of this cons cell. */
/* Cdr of this cons cell. */
/* Used to chain conses on a free list. */
struct Lisp_Cons *chain;
The return value from
garbage-collect backs this up. The first value
after each type is the shallow size of that type. From here on, all
values have been computed for 64-bit Emacs running on x86-64
;; => ((conses 16 9923172 2036943)
;; (symbols 48 57017 54)
;; (miscs 40 10203 18892)
;; (strings 32 4810027 197961)
;; (string-bytes 1 104599635)
;; (vectors 16 103138)
;; (vector-slots 8 2921744 131076)
;; (floats 8 12494 5816)
;; (intervals 56 119911 69249)
;; (buffers 960 134)
;; (heap 1024 593412 133853))
Lisp_Object is just a pointer to a lisp object. The retained
size of a cons cell is its shallow size plus, recursively, the
retained size of the objects in its car and cdr.
Integers and Floats
Integers are a special case. Elisp uses what is called tagged
integers. They’re not heap-allocated objects. Instead they’re
embedded inside the object pointers. That is, those
Lisp_Cons will hold integers directly. This means to
Caliper integers have retained size of 0. We can use this to verify
Caliper’s return value for cons cells.
;; => 0
(caliper-object-size (cons 100 200))
;; => 16
Tagged integers are fast and save on memory. They also compare
eq, which is just a pointer (identity) comparison.
However, because a few bits need to be reserved for differentiating
them from actual pointers these integers have a restricted dynamic
Floats are not tagged and exist as immutable objects in the heap.
eql is still useful in Elisp — it’s like
eq but will
handle numbers properly. (By convention you should use
Symbols and Strings
Not counting the string’s contents, a string’s base size is 32 bytes
length of the string can’t be
used here because that counts characters, which vary in size. There’s
string-bytes function for this. A string’s size is 32 plus its
;; => 9
;; => 41 (i.e. 32 + 9)
As you can see from above, symbols are huge. Without even counting
either the string holding the name of the symbol or the symbol’s
plist, a symbol is 48 bytes.
;; => 1038
This 1,038 bytes is a little misleading. The symbol itself is 48
bytes, the string
"hello" is 37 bytes, and the plist is nil. The
retained size of
nil is significant. On my system, nil’s plist has 4
key-value pairs, which themselves have retained sizes. When examining
symbols, caliper doesn’t care if they’re interned or not, including
t. However, nil is only counted once, so it
will have little impact on a large data structure.
Outside of vectors, measuring object sizes starts to get fuzzy. For
example, it’s not possible to examine the exact internals of a hash
table from Elisp. We can see its contents and the number of elements
it can hold without re-sizing, but there’s intermediate structure
that’s not visible. Caliper makes rough estimates for each of these
Circularity and Double Counting
To avoid double counting objects, a hash table with a test of
dynamically bound by the top level call. It’s used like a set. Before
an object is examined, the hash table is checked. If the object is
listed, the reported size is 0 (it consumes no additional space than
already accounted for).
This automatically solves the circularity problem. There’s no way we
can traverse into the same data structure a second time because we’ll
stop when we see it twice.
So what’s the total retained size of the
jeopardy structure? About
;; => 130430198
For fun, let’s see if how much we can improve on this.
json.el will return alists for objects by default, but this can be
changed by setting
json-object-type to something else. Initially I
thought maybe using plists instead would save space, but I later
realized that plists use exactly the same number of cons cells as
alists. If this doesn’t sound right, try to picture the cons cells
in your head (an exercise for the reader).
(let ((json-object-type 'plist))
(setf (point) (point-min))
;; => 130430077 (plist)
Strangely this is 121 bytes smaller. I don’t know why yet, but in the
scope of 124MB that’s nothing.
So what do these questions look like?
(elt jeopardy 0)
;; => (:show_number "4680"
;; :round "Jeopardy!"
;; :answer "Copernicus"
;; :value "$200"
;; :question "..." ;; omitted
;; :air_date "2004-12-31"
;; :category "HISTORY")
They’re (now) plists of 7 pairs. All of the keys are symbols, and, as
such, are interned and consuming very little memory. All of the values
are strings. Surely we can do better here. The strings can be interned
and the numbers can be turned into tagged integers. The :category
values would probably be good candidates for conversion into symbols.
Here’s an interesting fact about Jeopardy! that can be exploited for
our purposes. While Jeopardy! covers a broad range of trivia,
it does so very shallowly. The same answers appear many
times. For example, the very first answer from our dataset,
Copernicus, appears 14 times. That makes even the answers good
candidates for interning.
(cl-loop for question across jeopardy
for answer = (plist-get question :answer)
count (string= answer "Copernicus"))
;; => 14
A string pool is trivial to implement. Just use a weak,
table to track strings. Making it weak keeps it from leaking memory by
holding onto strings for longer than necessary.
(make-hash-table :test 'equal :weakness t))
(defun intern-string (string)
(or (gethash string string-pool)
(setf (gethash string string-pool) string)))
(defun jeopardy-fix (question)
(cl-loop for (key value) on question by #'cddr
collect (cl-case key
(:show_number (read value))
(:value (if value (read (substring value 1))))
(:category (intern value))
(otherwise (intern-string value)))))
(cl-map 'vector #'jeopardy-fix jeopardy))
So how are we looking now?
;; => 83254322
That’s down to 79MB of memory. Not bad! If we
taking advantage of string interning in the printed representation, I
wonder how it compares to the original JSON.
(let ((print-circle nil))
(prin1 jeopardy-interned (current-buffer))
;; => 45554437
About 44MB, down from JSON’s 53MB. With
print-circle set to nil it’s about